From hlapp at drycafe.net Mon Aug 1 18:36:27 2011 From: hlapp at drycafe.net (Hilmar Lapp) Date: Mon, 1 Aug 2011 18:36:27 -0400 Subject: [Biopython] Job opportunity: User Interface Design and Web Application Developer Message-ID: <7F0AE58E-6052-469B-ACD0-207FAD060472@drycafe.net> (Apologies if you have received this already or if this is considered spam - we're trying to reach out as broad as possible and I know that quite a few in the Bio* communities would be well qualified. Please feel free to pass on to anyone who might be interested, or might know someone who is.) User Interface Design and Web Application Developer The National Evolutionary Synthesis Center (NESCent) seeks a creative and enthusiastic individual to design user interfaces and web applications for scientific applications that manage, analyze, visualize and share data in support of evolutionary research. The incumbent will work as part of a small informatics team in close collaboration with domain scientists. NESCent (http://nescent.org) is an NSF-funded center dedicated to cross-disciplinary research in evolutionary science. Our informatics team works closely with visiting and resident scientists to support their custom software and database development needs (http://informatics.nescent.org ), and collaborates broadly with other biodiversity informatics projects. All NESCent software products are open-source, and the Center has a number of initiatives to actively promote collaborative development of community software resources. Above all, we are enthusiastic about our work, about the mission of the Center, and about the contribution of informatics to that mission. Job description: The incumbent will design and develop user interfaces and web applications for databases and other software tools for sponsored scientists and staff. The job responsibilities include all stages of the software development process, including requirements gathering, design, implementation, release packaging and documentation, as part of a small team (typically 2-3 individuals). We expect the incumbent to present their work at conferences and contribute to publications with scientific collaborators; interact regularly with visiting and resident scientists, other members of the informatics team and Center staff; and generally serve as an expert resource for Center personnel. The position provides opportunities for professional development and encourages research into new technologies. Most informatics staff work at our Durham NC offices, located adjacent to Duke University, but we support a wide range of technologies for virtual communication with off-site staff and collaborators. Salary range: $70,000 - $80,000, depending on education and experience Required Qualifications: * Demonstrated success collaborating with clients on custom software solutions * Experience with various stages of the software development cycle * Expertise in development and testing of user interface designs * Excellent communication skills, both virtual and face-to-face Preferred Qualifications: * M.S. or Ph.D. in Computer Science, Bioinformatics or related field * Demonstrated interest in science, particularly biology * Expertise in dynamic and interactive web technologies (JavaScript, CGI) * Expertise in rapid application development and respective programming technologies and languages (e.g., modern scripting languages and web-application frameworks such as Python/Django, Ruby/ Ruby-on-Rails, and Perl/Catalyst). * Expertise in graphic design * Expertise in data visualization and/or scientific data integration * Expertise in software usability design and assessment * Expertise in web service (SOAP, REST, XML, JSON) and semantic web technologies * Fluency in Java programming * Prior experience in relational database programming (PostgreSQL or MySQL) * Experience with open-source, and collaborative, software development How to apply: Please send cover letter, resume and contact information for three references to Dr. Karen Cranston, Training Coordinator and Bioinformatics Project Manager (karen.cranston at nescent.org); Please also complete the online application at the University of North Carolina HR website: http://bit.ly/r9HQ8r. Informal inquires or requests for additional information may be directed to Dr. Cranston by email or phone (+1-919-613-2275). Closing date is August 15, 2011. -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== From lueck at ipk-gatersleben.de Tue Aug 2 09:15:40 2011 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 2 Aug 2011 15:15:40 +0200 Subject: [Biopython] clustalW output format Message-ID: <000001cc5116$46def910$1022a8c0@ipkgatersleben.de> Hello! I?m using ClustalW2 for my alignments and would like to have the aln1 output format (http://www.ebi.ac.uk/Tools/msa/clustalw2/help/). This should show the line numbers. Actually this should be the default but it?s not. I tried to add aln1 in \Bio\Align\Applications\_Clustalw.py in line 100+ Option(["-output", "-OUTPUT", "OUTPUT", "output"], ["input"], lambda x: x in ["GCG", "GDE", "PHYLIP", "PIR", "NEXUS",?ALN1? "gcg", "gde", "phylip", "pir", "nexus", ?aln1?] but it doesn?t work. Any ideas? Thanks in advance! Stefanie From p.j.a.cock at googlemail.com Tue Aug 2 10:54:42 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 2 Aug 2011 15:54:42 +0100 Subject: [Biopython] clustalW output format In-Reply-To: <000001cc5116$46def910$1022a8c0@ipkgatersleben.de> References: <000001cc5116$46def910$1022a8c0@ipkgatersleben.de> Message-ID: On Tue, Aug 2, 2011 at 2:15 PM, Stefanie L?ck wrote: > Hello! > > I?m using ClustalW2 for my alignments and would like to have the aln1 output > format (http://www.ebi.ac.uk/Tools/msa/clustalw2/help/). This should show > the line numbers. Actually this should be the default but it?s not. I have version 2.1 installed and the default output format is traditional Clustal output with no residue/base numbers (according to -help). Which version of ClustalW2 are you using? I'm expecting we'll need a new wrapper for Clustal Omega (I don't know why they didn't call it Clustalw v3). We'll probably also need to update Clustalw parser to cope with base/residue numbering in the output as well. Peter From steven.irvin at monsanto.com Tue Aug 2 11:47:41 2011 From: steven.irvin at monsanto.com (IRVIN, STEVEN (AG-Contractor/1000)) Date: Tue, 2 Aug 2011 15:47:41 +0000 Subject: [Biopython] Bio.Blast.Applications issue with outfmt="quoted string" Message-ID: <8F46CBF672774F4C8A6B288A246B4468A532BF@stlwexmbxprd02.na.ds.monsanto.com> Hello, I am having an issue with the Biopython module making BLAST+ queries. I am wondering if there is any support in Bio.Blast.Applications for using the multiple arguments to -outfmt allowed by NCBI BLAST+ programs such as blastn. I need to use this for example: blastn_cline = NcbiblastnCommandline(query='somefastafile.fas', db='tomato_cdna.db', evalue=1000, word_size=7, outfmt='10 qseqid sseqid length pident', out='outfile.txt') The multiple arguments allowed to blastn -outfmt allow the choice of specific columns output to the csv or tab separated file such subject_id, etc. Biopython is returning non-zero exit status 1: USAGE when I run my program with above statement. Here is a an example coomand line for BLAST+: prompt_: blastn -query seq_fasta.fas -db local_db.db -out csv_out.csv -dust no -num_alignments 20 -num_descriptions 20 -evalue 1000 -word_size 7 -task blastn -outfmt "10 qseqid sseqid length pident" I do not yet know if something in Bio.Blast.Applications needs to be modified to support this. Steve Steven D Irvin, MS Bioinformatics Analyst [cid:image003.png at 01CC1925.F25B8430] CC214-A Monsanto Research Center Chesterfield Village, MO Steven.Irvin at monsanto.com (636) 737-1756 This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware". Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations. -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 3721 bytes Desc: image001.png URL: From p.j.a.cock at googlemail.com Tue Aug 2 12:14:25 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 2 Aug 2011 17:14:25 +0100 Subject: [Biopython] Bio.Blast.Applications issue with outfmt="quoted string" In-Reply-To: <8F46CBF672774F4C8A6B288A246B4468A532BF@stlwexmbxprd02.na.ds.monsanto.com> References: <8F46CBF672774F4C8A6B288A246B4468A532BF@stlwexmbxprd02.na.ds.monsanto.com> Message-ID: On Tue, Aug 2, 2011 at 4:47 PM, IRVIN, STEVEN (AG-Contractor/1000) wrote: > Hello, > > I am having an issue with the Biopython module making BLAST+ queries. > > I am wondering if there is any support in Bio.Blast.Applications ?for using the multiple arguments to -outfmt allowed by NCBI BLAST+ programs such as blastn. > > I need to use this for example: > > ? ?blastn_cline = NcbiblastnCommandline(query='somefastafile.fas', db='tomato_cdna.db', evalue=1000, word_size=7, outfmt='10 qseqid sseqid length pident', out='outfile.txt') > > The multiple arguments allowed to blastn -outfmt ?allow the choice of specific columns output to the csv or tab separated file such subject_id, etc. Yes, and they are very useful. Try: blastn_cline = NcbiblastnCommandline(query='somefastafile.fas', db='tomato_cdna.db', evalue=1000, word_size=7, outfmt='"10 qseqid sseqid length pident"', out='outfile.txt') i.e. Include the extra quotes explicitly. That's single quote, double quote, text, double quote, single quote. (There are other ways to embed double quote characters in a Python string but that works for me.) Peter From lueck at ipk-gatersleben.de Wed Aug 3 07:19:50 2011 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Wed, 3 Aug 2011 13:19:50 +0200 Subject: [Biopython] clustalW output format In-Reply-To: References: <000001cc5116$46def910$1022a8c0@ipkgatersleben.de> Message-ID: <000001cc51cf$42d16100$1022a8c0@ipkgatersleben.de> Thanks Peter! I'm also using version 2.1. I didn't check the -help, only the homepage where they say "Default value is: Aln w/numbers [aln1]", which confused me... Thanks for mentioning Clustal Omega, I didn't know that say changed the names. Stefanie -----Urspr?ngliche Nachricht----- Von: Peter Cock [mailto:p.j.a.cock at googlemail.com] Gesendet: Dienstag, 2. August 2011 16:55 An: Stefanie L?ck Cc: biopython at lists.open-bio.org Betreff: Re: [Biopython] clustalW output format On Tue, Aug 2, 2011 at 2:15 PM, Stefanie L?ck wrote: > Hello! > > I?m using ClustalW2 for my alignments and would like to have the aln1 output > format (http://www.ebi.ac.uk/Tools/msa/clustalw2/help/). This should show > the line numbers. Actually this should be the default but it?s not. I have version 2.1 installed and the default output format is traditional Clustal output with no residue/base numbers (according to -help). Which version of ClustalW2 are you using? I'm expecting we'll need a new wrapper for Clustal Omega (I don't know why they didn't call it Clustalw v3). We'll probably also need to update Clustalw parser to cope with base/residue numbering in the output as well. Peter From jgrant at smith.edu Mon Aug 8 14:08:25 2011 From: jgrant at smith.edu (Jessica Grant) Date: Mon, 8 Aug 2011 14:08:25 -0400 Subject: [Biopython] deleting in-group paralogs from newick trees Message-ID: <5D7AD333-66EC-4B23-950E-523E2FBD2A62@smith.edu> Hello, I am looking at large phylogenetic trees that have many paralogs. I would like to simplify my trees so that all monophyletic paralog groups are collapsed--or all sequences except the shortest branch are deleted. Is there a Biopython module that can help? I started looking at Phylo, but couldn't see an obvious way. Thanks, Jessica From eric.talevich at gmail.com Mon Aug 8 15:33:52 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 8 Aug 2011 15:33:52 -0400 Subject: [Biopython] deleting in-group paralogs from newick trees In-Reply-To: <5D7AD333-66EC-4B23-950E-523E2FBD2A62@smith.edu> References: <5D7AD333-66EC-4B23-950E-523E2FBD2A62@smith.edu> Message-ID: On Mon, Aug 8, 2011 at 2:08 PM, Jessica Grant wrote: > Hello, > > I am looking at large phylogenetic trees that have many paralogs. I would > like to simplify my trees so that all monophyletic paralog groups are > collapsed--or all sequences except the shortest branch are deleted. Is > there a Biopython module that can help? I started looking at Phylo, but > couldn't see an obvious way. > Hi Jessica, Yes, Phylo is the right module to use. If I understand your problem correctly, the tree methods you want are is_monophyletic() and collapse_all(). Both operate on a clade within the tree. You'd traverse the tree with get_nonterminals(), check if a paralog group under a clade is monophyletic, and if so, collapse it. Do you have a list of paralogs already? And, do you know which groups might be monophyletic? If you have groups/clades already, it's simple: >>> tree = Phylo.read('mytree.nwk', 'newick') >>> for clade in tree.get_nonterminals(order='postorder'): ... mono_parent = clade.is_monophyletic([SOME_PARALOG_GROUP]) ... if mono_parent: ... mono_parent.collapse_all() If you don't know the groups yet, then the test inside the loop is a little more elaborate. You can look for overlaps between a clade's tips and and the paralog list using sets: >>> paralogs = set(PARALOG_LIST) # Inside the loop: >>> tips = set([str(t) for t in clade.get_terminals()]) >>> overlap = tips.intersect(paralogs) >>> if len(overlaps) >= 2: # The rest of the loop... Hope that helps, Eric From cjfields at illinois.edu Tue Aug 9 16:09:05 2011 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 9 Aug 2011 15:09:05 -0500 Subject: [Biopython] [BioRuby] Interesting BLAST 2.2.25+ XML behaviour In-Reply-To: References: <398303E2-1195-4CC2-8B73-09C6C1117892@illinois.edu> <4DC12371.3040204@gmail.com> Message-ID: I'm reviving this thread to see what the current status is (if anything has changed). The bioperl parser has the same problem; at the moment we're bascially stuck until NCBI gives some indication as to whether this is a bug or not. Any word back from them yet? (and agreed, it would be nice to have an external bug tracker from NCBI). chris On May 4, 2011, at 5:36 AM, Peter Cock wrote: > On Wed, May 4, 2011 at 10:59 AM, Michal wrote: >> Hi Peter, >> Do you have the script which read >> >> https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/blastp_four_human_vs_rhodopsin.xml >> >> >> and what would be the correct output? >> >> Thank you in advance. >> >> Cheers, >> Michal > > Hi Michal, > > I'm not quite sure what you're asking, but I'll try. First, the three > data files: > > $ wget https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/blastp_four_human_vs_rhodopsin.xml > $ wget https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/four_human_proteins.fasta > $ wget https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/rhodopsin_proteins.fasta > > The query file has four sequences, > > $ grep -c "^>" four_human_proteins.fasta > 4 > > $ grep "^>" four_human_proteins.fasta >> sp|Q9BS26|ERP44_HUMAN Endoplasmic reticulum resident protein 44 OS=Homo sapiens GN=ERP44 PE=1 SV=1 >> sp|Q9NSY1|BMP2K_HUMAN BMP-2-inducible protein kinase OS=Homo sapiens GN=BMP2K PE=1 SV=2 >> sp|P06213|INSR_HUMAN Insulin receptor OS=Homo sapiens GN=INSR PE=1 SV=4 >> sp|P08100|OPSD_HUMAN Rhodopsin OS=Homo sapiens GN=RHO PE=1 SV=1 > > Based on past experience, I would expect 4 iteration blocks in the > XML, but in this case I have 24: > > $ grep "" -c blastp_four_human_vs_rhodopsin.xml > 24 > > Notice we get 6 iterations for each query (4 times 6 is 24): > > $ grep "" blastp_four_human_vs_rhodopsin.xml > sp|Q9BS26|ERP44_HUMAN > sp|Q9BS26|ERP44_HUMAN > sp|Q9BS26|ERP44_HUMAN > sp|Q9BS26|ERP44_HUMAN > sp|Q9BS26|ERP44_HUMAN > sp|Q9BS26|ERP44_HUMAN > sp|Q9NSY1|BMP2K_HUMAN > sp|Q9NSY1|BMP2K_HUMAN > sp|Q9NSY1|BMP2K_HUMAN > sp|Q9NSY1|BMP2K_HUMAN > sp|Q9NSY1|BMP2K_HUMAN > sp|Q9NSY1|BMP2K_HUMAN > sp|P06213|INSR_HUMAN > sp|P06213|INSR_HUMAN > sp|P06213|INSR_HUMAN > sp|P06213|INSR_HUMAN > sp|P06213|INSR_HUMAN > sp|P06213|INSR_HUMAN > sp|P08100|OPSD_HUMAN > sp|P08100|OPSD_HUMAN > sp|P08100|OPSD_HUMAN > sp|P08100|OPSD_HUMAN > sp|P08100|OPSD_HUMAN > sp|P08100|OPSD_HUMAN > > Now, using the two FASTA files directly and re-running blastp, what do I get? > > $ ~/Downloads/ncbi-blast-2.2.25+/bin/blastp -query > four_human_proteins.fasta -subject rhodopsin_proteins.fasta -outfmt 5 > | grep "" -c > 24 > > Or again with -parse_deflines, which changes how the hit ID/def is presented: > > $ ~/Downloads/ncbi-blast-2.2.25+/bin/blastp -query > four_human_proteins.fasta -subject rhodopsin_proteins.fasta -outfmt 5 > -parse_deflines | grep "" -c > 24 > > How about older versions? > > $ ~/Downloads/ncbi-blast-2.2.24+/bin/blastp -query > four_human_proteins.fasta -subject rhodopsin_proteins.fasta -outfmt 5 > BLAST engine error: XML formatting is only supported for a database search > > I'll have to make a blast database first... > > $ ~/Downloads/ncbi-blast-2.2.24+/bin/makeblastdb -in > rhodopsin_proteins.fasta -dbtype prot > > Building a new DB, current time: 05/04/2011 11:22:57 > New DB name: rhodopsin_proteins.fasta > New DB title: rhodopsin_proteins.fasta > Sequence type: Protein > Keep Linkouts: T > Keep MBits: T > Maximum file size: 1073741824B > Adding sequences from FASTA; added 6 sequences in 0.105655 seconds. > > $ ~/Downloads/ncbi-blast-2.2.25+/bin/blastp -query > four_human_proteins.fasta -db rhodopsin_proteins.fasta -outfmt 5 | > grep "" -c > 4 > > Look - just four identifiers as I expect! This also works if the database > is built with the -parse_seqids switch. > > The same happens with older versions of BLAST+, one > block per query, so four iteration blocks for this example. I tried all > of 2.2.21+, 2.2.22+, 2.2.23+ and 2.2.24+ (running makeblastdb to > give a fresh database, then blastp). > > That seems to demonstrate that bug is specific to the XML output > from FASTA vs FASTA (not FASTA vs DB), which is a new feature > in NCBI BLAST 2.2.25+ > > I will raise this with the NCBI, and report back. > > However, even if the NCBI fix it in the next release, we (Bio*) may > want to update our parsers to cope with this quirk, or at least put a > warning in our BLAST XML parser documentation, as there will be > lots of installations of NCBI BLAST 2.2.25+ in the wild. > > Peter From p.j.a.cock at googlemail.com Wed Aug 10 05:15:18 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 10 Aug 2011 10:15:18 +0100 Subject: [Biopython] [BioRuby] Interesting BLAST 2.2.25+ XML behaviour In-Reply-To: References: <398303E2-1195-4CC2-8B73-09C6C1117892@illinois.edu> <4DC12371.3040204@gmail.com> Message-ID: On Tue, Aug 9, 2011 at 9:09 PM, Chris Fields wrote: > I'm reviving this thread to see what the current status is (if anything has > changed). ?The bioperl parser has the same problem; at the moment we're > bascially stuck until NCBI gives some indication as to whether this is a > bug or not. ?Any word back from them yet? > > (and agreed, it would be nice to have an external bug tracker from NCBI). > > chris Hi Chris, My email to the NCBI on 17 May had a reply from Tao Tao (NCBI User services) saying it would be brought to their developers' attention. For reference, the email subject line was: "Multiple iteration blocks per query in FASTA vs FASTA BLAST XML" I have just emailed back to enquire if there is any news to report. Peter From cjfields at illinois.edu Wed Aug 10 21:35:37 2011 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 10 Aug 2011 20:35:37 -0500 Subject: [Biopython] [BioRuby] Interesting BLAST 2.2.25+ XML behaviour In-Reply-To: References: <398303E2-1195-4CC2-8B73-09C6C1117892@illinois.edu> <4DC12371.3040204@gmail.com> Message-ID: On Aug 10, 2011, at 4:15 AM, Peter Cock wrote: > On Tue, Aug 9, 2011 at 9:09 PM, Chris Fields wrote: >> I'm reviving this thread to see what the current status is (if anything has >> changed). The bioperl parser has the same problem; at the moment we're >> bascially stuck until NCBI gives some indication as to whether this is a >> bug or not. Any word back from them yet? >> >> (and agreed, it would be nice to have an external bug tracker from NCBI). >> >> chris > > Hi Chris, > > My email to the NCBI on 17 May had a reply from Tao Tao (NCBI User > services) saying it would be brought to their developers' attention. > > For reference, the email subject line was: > "Multiple iteration blocks per query in FASTA vs FASTA BLAST XML" > > I have just emailed back to enquire if there is any news to report. > > Peter Wonder if it's worth a second prod from someone else. Sometimes that gets their attention. chris From p.j.a.cock at googlemail.com Thu Aug 11 05:09:13 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 11 Aug 2011 10:09:13 +0100 Subject: [Biopython] [BioRuby] Interesting BLAST 2.2.25+ XML behaviour In-Reply-To: References: <398303E2-1195-4CC2-8B73-09C6C1117892@illinois.edu> <4DC12371.3040204@gmail.com> Message-ID: On Thu, Aug 11, 2011 at 2:35 AM, Chris Fields wrote: >> Hi Chris, >> >> My email to the NCBI on 17 May had a reply from Tao Tao (NCBI User >> services) saying it would be brought to their developers' attention. >> >> For reference, the email subject line was: >> "Multiple iteration blocks per query in FASTA vs FASTA BLAST XML" >> >> I have just emailed back to enquire if there is any news to report. >> >> Peter > > Wonder if it's worth a second prod from someone else. ?Sometimes > that gets their attention. > > chris Tao replied yesterday morning (US time) to confirm the test files so he (she?) could try this on the latest code. Peter From mok at bioxray.dk Thu Aug 11 08:40:57 2011 From: mok at bioxray.dk (Morten Kjeldgaard) Date: Thu, 11 Aug 2011 14:40:57 +0200 Subject: [Biopython] Unable to convert alignment to nexus format Message-ID: <1313066457.10034.11.camel@yeti.daimi.au.dk> Hi, I am getting an exception when trying to output an alignment in nexus format: ValueError: Need a DNA, RNA or Protein alphabet The alignment is read by AlignIO.read() in fasta format from an output file written by Muscle, and so the alphabet specified in the sequences is IUPACProtein. Apparently, NexusIO checks for ProteinAlphabet and thus fails. I am using BioPython 1.56. Here is a 4-line test program generating the exception: from Bio import AlignIO alignment = AlignIO.read(open("aln.muscle"), "fasta") g = open("aln.nexus", "w") g.write (alignment.format("nexus")) Any (safe) workarounds here? Cheers, Morten -- Morten Kjeldgaard, asc. professor, MSc, PhD BiRC - Bioinformatics Research Center, Aarhus University C.F. M?llers Alle, Building 1110, DK-8000 Aarhus C, Denmark. Lab +45 8942 3130 * Mobile: +45 5186 0147 * Home +45 8618 8180 From chapmanb at 50mail.com Thu Aug 11 09:52:38 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 11 Aug 2011 09:52:38 -0400 Subject: [Biopython] Unable to convert alignment to nexus format In-Reply-To: <1313066457.10034.11.camel@yeti.daimi.au.dk> References: <1313066457.10034.11.camel@yeti.daimi.au.dk> Message-ID: <20110811135238.GF3143@kunkel> Morten; > I am getting an exception when trying to output an alignment in nexus > format: > > ValueError: Need a DNA, RNA or Protein alphabet > > The alignment is read by AlignIO.read() in fasta format from an output > file written by Muscle, and so the alphabet specified in the sequences > is IUPACProtein. Apparently, NexusIO checks for ProteinAlphabet and thus > fails. I am using BioPython 1.56. If you specify the alphabet to AlignIO, Nexus will be happy. Here's your modified test program: from Bio import AlignIO from Bio.Alphabet import IUPAC, Gapped alignment = AlignIO.read(open("aln.muscle"), "fasta", alphabet=Gapped(IUPAC.protein)) g = open("aln.nexus", "w") g.write (alignment.format("nexus")) Hope this helps, Brad From p.j.a.cock at googlemail.com Thu Aug 11 10:04:50 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 11 Aug 2011 15:04:50 +0100 Subject: [Biopython] Unable to convert alignment to nexus format In-Reply-To: <1313066457.10034.11.camel@yeti.daimi.au.dk> References: <1313066457.10034.11.camel@yeti.daimi.au.dk> Message-ID: On Thu, Aug 11, 2011 at 1:40 PM, Morten Kjeldgaard wrote: > Hi, > > I am getting an exception when trying to output an alignment in nexus > format: > > ?ValueError: Need a DNA, RNA or Protein alphabet > > The alignment is read by AlignIO.read() in fasta format from an output > file written by Muscle, and so the alphabet specified in the sequences > is IUPACProtein. Yes, but it was written out as a FASTA file which does not record the alphabet. Biopython does not try to guess this, you must be explicit. > Apparently, NexusIO checks for ProteinAlphabet and thus > fails. I am using BioPython 1.56. As Brad described, when you parse the FASTA alignment, tell Biopython it is protein. Peter From aaronquinlan at gmail.com Fri Aug 12 14:39:34 2011 From: aaronquinlan at gmail.com (Aaron Quinlan) Date: Fri, 12 Aug 2011 14:39:34 -0400 Subject: [Biopython] Working with genomic intervals Message-ID: All, I apologize in advance if this is a naive question. I am wondering if BioPython provides libraries for working with genomic intervals in BED, GFF, or any other like format? I am looking for libraries that handle the parsing of files in these formats into Python objects, as well as libraries for manipulating (intersection, merging, counting, etc.) intervals. I know this exists in Galaxy's bx-python, but am wondering if there are similar libraries in BioPython? Gratefully, Aaron From p.j.a.cock at googlemail.com Sun Aug 14 07:11:37 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 14 Aug 2011 12:11:37 +0100 Subject: [Biopython] Working with genomic intervals In-Reply-To: References: Message-ID: On Friday, August 12, 2011, Aaron Quinlan wrote: > All, > > I apologize in advance if this is a naive question. > I am wondering if BioPython provides libraries for > working with genomic intervals in BED, GFF, or > any other like format? I am looking for libraries > that handle the parsing of files in these formats > into Python objects, as well as libraries for > manipulating (intersection, merging, counting, > etc.) intervals. I know this exists in Galaxy's > bx-python, but am wondering if there are similar > libraries in BioPython? > > Gratefully, > Aaron Hi Aaron, Have a look at http://biopython.org/wiki/GFF_Parsing wher Brad is working on this. He's also spoken highly of bx-python as I recall. Peter From sdavis2 at mail.nih.gov Sun Aug 14 07:48:15 2011 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Sun, 14 Aug 2011 07:48:15 -0400 Subject: [Biopython] Working with genomic intervals In-Reply-To: References: Message-ID: On Sun, Aug 14, 2011 at 7:11 AM, Peter Cock wrote: > On Friday, August 12, 2011, Aaron Quinlan wrote: >> All, >> >> I apologize in advance if this is a naive question. >> I am wondering if BioPython provides libraries for >> working with genomic intervals in BED, GFF, or >> any other like format? ?I am looking for libraries >> that handle the parsing of files in these formats >> into Python objects, as well as libraries for >> manipulating (intersection, merging, counting, >> etc.) intervals. ?I know this exists in Galaxy's >> bx-python, but am wondering if there are similar >> libraries in BioPython? >> >> Gratefully, >> Aaron > > Hi Aaron, > > Have a look at http://biopython.org/wiki/GFF_Parsing > wher Brad is working on this. He's also spoken > highly of bx-python as I recall. I would second the bx-python vote. Not only are the "normal" interval classes covered, but there are also some variants (clustering is one that comes to mind). Sean From lgautier at gmail.com Mon Aug 15 02:17:28 2011 From: lgautier at gmail.com (Laurent Gautier) Date: Mon, 15 Aug 2011 08:17:28 +0200 Subject: [Biopython] Working with genomic intervals In-Reply-To: References: Message-ID: <4E48B9F8.4070605@gmail.com> On 2011-08-14 18:00, biopython-request at lists.open-bio.org wrote: > On Sun, Aug 14, 2011 at 7:11 AM, Peter Cock wrote: >> > On Friday, August 12, 2011, Aaron Quinlan wrote: >>> >> All, >>> >> >>> >> I apologize in advance if this is a naive question. >>> >> I am wondering if BioPython provides libraries for >>> >> working with genomic intervals in BED, GFF, or >>> >> any other like format? ?I am looking for libraries >>> >> that handle the parsing of files in these formats >>> >> into Python objects, as well as libraries for >>> >> manipulating (intersection, merging, counting, >>> >> etc.) intervals. ?I know this exists in Galaxy's >>> >> bx-python, but am wondering if there are similar >>> >> libraries in BioPython? >>> >> >>> >> Gratefully, >>> >> Aaron >> > >> > Hi Aaron, >> > >> > Have a look athttp://biopython.org/wiki/GFF_Parsing >> > wher Brad is working on this. He's also spoken >> > highly of bx-python as I recall. > I would second the bx-python vote. Not only are the "normal" interval > classes covered, but there are also some variants (clustering is one > that comes to mind). > > Sean One can also access from Python the utilities for ranges available in bioconductor, for example using the bioconductor extension to rpy2 or rpy2 directly (may be using dynamic class mapping features, as shown below): from rpy2.robjects.packages import importr iranges = importr("IRanges") # Python class IRanges as an API to Bioconductors IRanges::IRanges from rpy2.robjects.methods import RS4, RS4Auto_Type class IRanges(RS4): __metaclass__ = RS4Auto_Type __rpackagename__ = "IRanges" __rname__ = "IRanges" # now in action >>> from rpy2.robjects.vectors import IntVector >>> ir = IRanges(iranges.IRanges(start = IntVector(range(10)), width = 11)) >>> print(ir) IRanges of length 10 start end width [1] 0 10 11 [2] 1 11 11 [3] 2 12 11 [4] 3 13 11 [5] 4 14 11 [6] 5 15 11 [7] 6 16 11 [8] 7 17 11 [9] 8 18 11 [10] 9 19 11 >>> print(IRanges(ir.reduce__IRanges(ir))) IRanges of length 1 start end width [1] 0 19 20 From aaronquinlan at gmail.com Mon Aug 15 19:54:31 2011 From: aaronquinlan at gmail.com (Aaron Quinlan) Date: Mon, 15 Aug 2011 19:54:31 -0400 Subject: [Biopython] Working with genomic intervals In-Reply-To: <4E48B9F8.4070605@gmail.com> References: <4E48B9F8.4070605@gmail.com> Message-ID: <10D5E6D5-7114-406C-A2CF-8EB211CCE8D2@gmail.com> Dear Peter, Sean, and Laurent, Thanks so much for the useful suggestions. Best, Aaron On Aug 15, 2011, at 2:17 AM, Laurent Gautier wrote: > On 2011-08-14 18:00, biopython-request at lists.open-bio.org wrote: >> On Sun, Aug 14, 2011 at 7:11 AM, Peter Cock wrote: >>> > On Friday, August 12, 2011, Aaron Quinlan wrote: >>>> >> All, >>>> >> >>>> >> I apologize in advance if this is a naive question. >>>> >> I am wondering if BioPython provides libraries for >>>> >> working with genomic intervals in BED, GFF, or >>>> >> any other like format? ?I am looking for libraries >>>> >> that handle the parsing of files in these formats >>>> >> into Python objects, as well as libraries for >>>> >> manipulating (intersection, merging, counting, >>>> >> etc.) intervals. ?I know this exists in Galaxy's >>>> >> bx-python, but am wondering if there are similar >>>> >> libraries in BioPython? >>>> >> >>>> >> Gratefully, >>>> >> Aaron >>> > >>> > Hi Aaron, >>> > >>> > Have a look athttp://biopython.org/wiki/GFF_Parsing >>> > wher Brad is working on this. He's also spoken >>> > highly of bx-python as I recall. >> I would second the bx-python vote. Not only are the "normal" interval >> classes covered, but there are also some variants (clustering is one >> that comes to mind). >> >> Sean > > One can also access from Python the utilities for ranges available in > bioconductor, for example using the bioconductor extension to rpy2 or rpy2 > directly (may be using dynamic class mapping features, as shown below): > > from rpy2.robjects.packages import importr > iranges = importr("IRanges") > # Python class IRanges as an API to Bioconductors IRanges::IRanges > from rpy2.robjects.methods import RS4, RS4Auto_Type > class IRanges(RS4): > __metaclass__ = RS4Auto_Type > __rpackagename__ = "IRanges" > __rname__ = "IRanges" > > # now in action > > >>> from rpy2.robjects.vectors import IntVector > >>> ir = IRanges(iranges.IRanges(start = IntVector(range(10)), width = 11)) > >>> print(ir) > IRanges of length 10 > start end width > [1] 0 10 11 > [2] 1 11 11 > [3] 2 12 11 > [4] 3 13 11 > [5] 4 14 11 > [6] 5 15 11 > [7] 6 16 11 > [8] 7 17 11 > [9] 8 18 11 > [10] 9 19 11 > >>> print(IRanges(ir.reduce__IRanges(ir))) > IRanges of length 1 > start end width > [1] 0 19 20 > > From brandonjbreitling at gmail.com Wed Aug 17 17:44:21 2011 From: brandonjbreitling at gmail.com (Brandon Breitling) Date: Wed, 17 Aug 2011 21:44:21 +0000 (UTC) Subject: [Biopython] Question on your Methods in Enzymology paper References: Message-ID: Hi Mr. Lunt, My name is Brandon Breitling and I'm a statistics graduate student in the United States. I was wondering if you hadthe scripts or code available from your "Inference of Direct Residue Contacts in Two-Component Signaling" paper. I'm trying to see if I can do the same for a eukaryotic protein pair that my lab studies. I have created the concatenated strings dataset for my protein as described in your paper and have attempted to make scripts for the MI steps but would really be benefited if I could get them for the all steps in the Direct Coupling analysis. If you could also email me the accession number for your dataset so that I can verify that I have the scripts working, that would be most appreciated as well. Regards, Brandon Breitling From p.j.a.cock at googlemail.com Wed Aug 17 18:21:44 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 17 Aug 2011 23:21:44 +0100 Subject: [Biopython] Question on your Methods in Enzymology paper In-Reply-To: References: Message-ID: On Wed, Aug 17, 2011 at 10:44 PM, Brandon Breitling wrote: > > Hi Mr. Lunt, > > My name is Brandon Breitling and I'm a statistics > graduate student in the United States. ?I was > wondering if you had the scripts or code available > from your "Inference of Direct Residue Contacts > in Two-Component Signaling" paper. ?I'm trying > to see if I can do the same for a eukaryotic > protein pair that my lab studies. > > I have created the concatenated strings dataset > for my protein as described in your paper and > have attempted to make scripts for the MI steps > but would really be benefited if I could get > them for the all steps in the Direct Coupling > analysis. ?If you could also email me the > accession number for your dataset so that I > can verify that I have the scripts working, > that would be most appreciated as well. > > Regards, > Brandon Breitling Hi Brandon, It looks like you've mixed up your email addresses. As it happens I did my PhD on TCS, and used Biopython's Bio.PDB model to get crude distances from a PDB complex (and also looked at MI). I'm not sure if I've read this paper though... Bryan Lunt, Hendrik Szurmant, Andrea Procaccini, James A. Hoch, Terence Hwa and Martin Weigt "Chapter Two - Inference of Direct Residue Contacts in Two-Component Signaling". Methods in Enzymology Volume 471, 2010, Pages 17-41 http://dx.doi.org/10.1016/S0076-6879(10)71002-8 Peter From lunt at ctbp.ucsd.edu Thu Aug 18 12:46:05 2011 From: lunt at ctbp.ucsd.edu (Bryan Lunt) Date: Thu, 18 Aug 2011 09:46:05 -0700 Subject: [Biopython] Biopython Digest, Vol 104, Issue 10 In-Reply-To: References: Message-ID: Oh!, Yeah, we used BioPython extensively, but I thought I sent Brandon the code already... We have a decent module for getting distances from Bio.PDB, though unfortunately it uses far far too much disk space (it outputs a large text file with every residue compared to every other reside, allowing AWK or some other tool to filter the file.) And a large set of tools for creating putative pairings, mainly for TCS, but of course generalized to pair any set of protein domains... -Bryan On Thu, Aug 18, 2011 at 9:00 AM, wrote: > Send Biopython mailing list submissions to > ? ? ? ?biopython at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > ? ? ? ?http://lists.open-bio.org/mailman/listinfo/biopython > or, via email, send a message with subject or body 'help' to > ? ? ? ?biopython-request at lists.open-bio.org > > You can reach the person managing the list at > ? ? ? ?biopython-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biopython digest..." > > > Today's Topics: > > ? 1. Question on your Methods in Enzymology paper (Brandon Breitling) > ? 2. Re: Question on your Methods in Enzymology paper (Peter Cock) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 17 Aug 2011 21:44:21 +0000 (UTC) > From: Brandon Breitling > Subject: [Biopython] Question on your Methods in Enzymology paper > To: biopython at biopython.org > Message-ID: > Content-Type: text/plain; charset=us-ascii > > > Hi Mr. Lunt, > > My name is Brandon Breitling and I'm a statistics > graduate student in the United States. ?I was > wondering if you hadthe scripts or code available > from your "Inference of Direct Residue Contacts > in Two-Component Signaling" paper. ?I'm trying > to see if I can do the same for a eukaryotic > protein pair that my lab > studies. > > I have created the concatenated strings dataset > for my protein as described in your paper and > have attempted to make scripts for the MI steps > but would really be benefited if I could get > them for the all steps in the Direct Coupling > analysis. ?If you could also email me the > accession number for your dataset so that I > can verify that I have the scripts working, > that would be most appreciated as well. > > Regards, > Brandon Breitling > > > > ------------------------------ > > Message: 2 > Date: Wed, 17 Aug 2011 23:21:44 +0100 > From: Peter Cock > Subject: Re: [Biopython] Question on your Methods in Enzymology paper > To: Brandon Breitling > Cc: biopython at biopython.org > Message-ID: > ? ? ? ? > Content-Type: text/plain; charset=ISO-8859-1 > > On Wed, Aug 17, 2011 at 10:44 PM, Brandon Breitling wrote: >> >> Hi Mr. Lunt, >> >> My name is Brandon Breitling and I'm a statistics >> graduate student in the United States. ?I was >> wondering if you had the scripts or code available >> from your "Inference of Direct Residue Contacts >> in Two-Component Signaling" paper. ?I'm trying >> to see if I can do the same for a eukaryotic >> protein pair that my lab studies. >> >> I have created the concatenated strings dataset >> for my protein as described in your paper and >> have attempted to make scripts for the MI steps >> but would really be benefited if I could get >> them for the all steps in the Direct Coupling >> analysis. ?If you could also email me the >> accession number for your dataset so that I >> can verify that I have the scripts working, >> that would be most appreciated as well. >> >> Regards, >> Brandon Breitling > > Hi Brandon, > > It looks like you've mixed up your email addresses. > > As it happens I did my PhD on TCS, and used > Biopython's Bio.PDB model to get crude distances > from a PDB complex (and also looked at MI). I'm > not sure if I've read this paper though... > > Bryan Lunt, Hendrik Szurmant, Andrea Procaccini, > James A. Hoch, Terence Hwa and Martin Weigt > "Chapter Two - ?Inference of Direct Residue Contacts > in Two-Component Signaling". Methods in Enzymology > Volume 471, 2010, Pages 17-41 > http://dx.doi.org/10.1016/S0076-6879(10)71002-8 > > Peter > > > > ------------------------------ > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > End of Biopython Digest, Vol 104, Issue 10 > ****************************************** > From p.j.a.cock at googlemail.com Thu Aug 18 15:32:57 2011 From: p.j.a.cock at googlemail.com (Peter) Date: Thu, 18 Aug 2011 20:32:57 +0100 Subject: [Biopython] Biopython 1.58 released Message-ID: <75327C54-CF88-43BC-BACF-87139456FE67@googlemail.com> Dear All, Biopython 1.58 is out: http://news.open-bio.org/news/2011/08/biopython-1-58-released/ Thank you to everyone who has contributed. Peter P.S. We're on Twitter as @Biopython From hlapp at drycafe.net Mon Aug 1 22:36:27 2011 From: hlapp at drycafe.net (Hilmar Lapp) Date: Mon, 1 Aug 2011 18:36:27 -0400 Subject: [Biopython] Job opportunity: User Interface Design and Web Application Developer Message-ID: <7F0AE58E-6052-469B-ACD0-207FAD060472@drycafe.net> (Apologies if you have received this already or if this is considered spam - we're trying to reach out as broad as possible and I know that quite a few in the Bio* communities would be well qualified. Please feel free to pass on to anyone who might be interested, or might know someone who is.) User Interface Design and Web Application Developer The National Evolutionary Synthesis Center (NESCent) seeks a creative and enthusiastic individual to design user interfaces and web applications for scientific applications that manage, analyze, visualize and share data in support of evolutionary research. The incumbent will work as part of a small informatics team in close collaboration with domain scientists. NESCent (http://nescent.org) is an NSF-funded center dedicated to cross-disciplinary research in evolutionary science. Our informatics team works closely with visiting and resident scientists to support their custom software and database development needs (http://informatics.nescent.org ), and collaborates broadly with other biodiversity informatics projects. All NESCent software products are open-source, and the Center has a number of initiatives to actively promote collaborative development of community software resources. Above all, we are enthusiastic about our work, about the mission of the Center, and about the contribution of informatics to that mission. Job description: The incumbent will design and develop user interfaces and web applications for databases and other software tools for sponsored scientists and staff. The job responsibilities include all stages of the software development process, including requirements gathering, design, implementation, release packaging and documentation, as part of a small team (typically 2-3 individuals). We expect the incumbent to present their work at conferences and contribute to publications with scientific collaborators; interact regularly with visiting and resident scientists, other members of the informatics team and Center staff; and generally serve as an expert resource for Center personnel. The position provides opportunities for professional development and encourages research into new technologies. Most informatics staff work at our Durham NC offices, located adjacent to Duke University, but we support a wide range of technologies for virtual communication with off-site staff and collaborators. Salary range: $70,000 - $80,000, depending on education and experience Required Qualifications: * Demonstrated success collaborating with clients on custom software solutions * Experience with various stages of the software development cycle * Expertise in development and testing of user interface designs * Excellent communication skills, both virtual and face-to-face Preferred Qualifications: * M.S. or Ph.D. in Computer Science, Bioinformatics or related field * Demonstrated interest in science, particularly biology * Expertise in dynamic and interactive web technologies (JavaScript, CGI) * Expertise in rapid application development and respective programming technologies and languages (e.g., modern scripting languages and web-application frameworks such as Python/Django, Ruby/ Ruby-on-Rails, and Perl/Catalyst). * Expertise in graphic design * Expertise in data visualization and/or scientific data integration * Expertise in software usability design and assessment * Expertise in web service (SOAP, REST, XML, JSON) and semantic web technologies * Fluency in Java programming * Prior experience in relational database programming (PostgreSQL or MySQL) * Experience with open-source, and collaborative, software development How to apply: Please send cover letter, resume and contact information for three references to Dr. Karen Cranston, Training Coordinator and Bioinformatics Project Manager (karen.cranston at nescent.org); Please also complete the online application at the University of North Carolina HR website: http://bit.ly/r9HQ8r. Informal inquires or requests for additional information may be directed to Dr. Cranston by email or phone (+1-919-613-2275). Closing date is August 15, 2011. -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== From lueck at ipk-gatersleben.de Tue Aug 2 13:15:40 2011 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 2 Aug 2011 15:15:40 +0200 Subject: [Biopython] clustalW output format Message-ID: <000001cc5116$46def910$1022a8c0@ipkgatersleben.de> Hello! I?m using ClustalW2 for my alignments and would like to have the aln1 output format (http://www.ebi.ac.uk/Tools/msa/clustalw2/help/). This should show the line numbers. Actually this should be the default but it?s not. I tried to add aln1 in \Bio\Align\Applications\_Clustalw.py in line 100+ Option(["-output", "-OUTPUT", "OUTPUT", "output"], ["input"], lambda x: x in ["GCG", "GDE", "PHYLIP", "PIR", "NEXUS",?ALN1? "gcg", "gde", "phylip", "pir", "nexus", ?aln1?] but it doesn?t work. Any ideas? Thanks in advance! Stefanie From p.j.a.cock at googlemail.com Tue Aug 2 14:54:42 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 2 Aug 2011 15:54:42 +0100 Subject: [Biopython] clustalW output format In-Reply-To: <000001cc5116$46def910$1022a8c0@ipkgatersleben.de> References: <000001cc5116$46def910$1022a8c0@ipkgatersleben.de> Message-ID: On Tue, Aug 2, 2011 at 2:15 PM, Stefanie L?ck wrote: > Hello! > > I?m using ClustalW2 for my alignments and would like to have the aln1 output > format (http://www.ebi.ac.uk/Tools/msa/clustalw2/help/). This should show > the line numbers. Actually this should be the default but it?s not. I have version 2.1 installed and the default output format is traditional Clustal output with no residue/base numbers (according to -help). Which version of ClustalW2 are you using? I'm expecting we'll need a new wrapper for Clustal Omega (I don't know why they didn't call it Clustalw v3). We'll probably also need to update Clustalw parser to cope with base/residue numbering in the output as well. Peter From steven.irvin at monsanto.com Tue Aug 2 15:47:41 2011 From: steven.irvin at monsanto.com (IRVIN, STEVEN (AG-Contractor/1000)) Date: Tue, 2 Aug 2011 15:47:41 +0000 Subject: [Biopython] Bio.Blast.Applications issue with outfmt="quoted string" Message-ID: <8F46CBF672774F4C8A6B288A246B4468A532BF@stlwexmbxprd02.na.ds.monsanto.com> Hello, I am having an issue with the Biopython module making BLAST+ queries. I am wondering if there is any support in Bio.Blast.Applications for using the multiple arguments to -outfmt allowed by NCBI BLAST+ programs such as blastn. I need to use this for example: blastn_cline = NcbiblastnCommandline(query='somefastafile.fas', db='tomato_cdna.db', evalue=1000, word_size=7, outfmt='10 qseqid sseqid length pident', out='outfile.txt') The multiple arguments allowed to blastn -outfmt allow the choice of specific columns output to the csv or tab separated file such subject_id, etc. Biopython is returning non-zero exit status 1: USAGE when I run my program with above statement. Here is a an example coomand line for BLAST+: prompt_: blastn -query seq_fasta.fas -db local_db.db -out csv_out.csv -dust no -num_alignments 20 -num_descriptions 20 -evalue 1000 -word_size 7 -task blastn -outfmt "10 qseqid sseqid length pident" I do not yet know if something in Bio.Blast.Applications needs to be modified to support this. Steve Steven D Irvin, MS Bioinformatics Analyst [cid:image003.png at 01CC1925.F25B8430] CC214-A Monsanto Research Center Chesterfield Village, MO Steven.Irvin at monsanto.com (636) 737-1756 This e-mail message may contain privileged and/or confidential information, and is intended to be received only by persons entitled to receive such information. If you have received this e-mail in error, please notify the sender immediately. Please delete it and all attachments from any servers, hard drives or any other media. Other use of this e-mail by you is strictly prohibited. All e-mails and attachments sent and received are subject to monitoring, reading and archival by Monsanto, including its subsidiaries. The recipient of this e-mail is solely responsible for checking for the presence of "Viruses" or other "Malware". Monsanto, along with its subsidiaries, accepts no liability for any damage caused by any such code transmitted by or accompanying this e-mail or any attachment. The information contained in this email may be subject to the export control laws and regulations of the United States, potentially including but not limited to the Export Administration Regulations (EAR) and sanctions regulations issued by the U.S. Department of Treasury, Office of Foreign Asset Controls (OFAC). As a recipient of this information you are obligated to comply with all applicable U.S. export laws and regulations. -------------- next part -------------- A non-text attachment was scrubbed... Name: image001.png Type: image/png Size: 3721 bytes Desc: image001.png URL: From p.j.a.cock at googlemail.com Tue Aug 2 16:14:25 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 2 Aug 2011 17:14:25 +0100 Subject: [Biopython] Bio.Blast.Applications issue with outfmt="quoted string" In-Reply-To: <8F46CBF672774F4C8A6B288A246B4468A532BF@stlwexmbxprd02.na.ds.monsanto.com> References: <8F46CBF672774F4C8A6B288A246B4468A532BF@stlwexmbxprd02.na.ds.monsanto.com> Message-ID: On Tue, Aug 2, 2011 at 4:47 PM, IRVIN, STEVEN (AG-Contractor/1000) wrote: > Hello, > > I am having an issue with the Biopython module making BLAST+ queries. > > I am wondering if there is any support in Bio.Blast.Applications ?for using the multiple arguments to -outfmt allowed by NCBI BLAST+ programs such as blastn. > > I need to use this for example: > > ? ?blastn_cline = NcbiblastnCommandline(query='somefastafile.fas', db='tomato_cdna.db', evalue=1000, word_size=7, outfmt='10 qseqid sseqid length pident', out='outfile.txt') > > The multiple arguments allowed to blastn -outfmt ?allow the choice of specific columns output to the csv or tab separated file such subject_id, etc. Yes, and they are very useful. Try: blastn_cline = NcbiblastnCommandline(query='somefastafile.fas', db='tomato_cdna.db', evalue=1000, word_size=7, outfmt='"10 qseqid sseqid length pident"', out='outfile.txt') i.e. Include the extra quotes explicitly. That's single quote, double quote, text, double quote, single quote. (There are other ways to embed double quote characters in a Python string but that works for me.) Peter From lueck at ipk-gatersleben.de Wed Aug 3 11:19:50 2011 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Wed, 3 Aug 2011 13:19:50 +0200 Subject: [Biopython] clustalW output format In-Reply-To: References: <000001cc5116$46def910$1022a8c0@ipkgatersleben.de> Message-ID: <000001cc51cf$42d16100$1022a8c0@ipkgatersleben.de> Thanks Peter! I'm also using version 2.1. I didn't check the -help, only the homepage where they say "Default value is: Aln w/numbers [aln1]", which confused me... Thanks for mentioning Clustal Omega, I didn't know that say changed the names. Stefanie -----Urspr?ngliche Nachricht----- Von: Peter Cock [mailto:p.j.a.cock at googlemail.com] Gesendet: Dienstag, 2. August 2011 16:55 An: Stefanie L?ck Cc: biopython at lists.open-bio.org Betreff: Re: [Biopython] clustalW output format On Tue, Aug 2, 2011 at 2:15 PM, Stefanie L?ck wrote: > Hello! > > I?m using ClustalW2 for my alignments and would like to have the aln1 output > format (http://www.ebi.ac.uk/Tools/msa/clustalw2/help/). This should show > the line numbers. Actually this should be the default but it?s not. I have version 2.1 installed and the default output format is traditional Clustal output with no residue/base numbers (according to -help). Which version of ClustalW2 are you using? I'm expecting we'll need a new wrapper for Clustal Omega (I don't know why they didn't call it Clustalw v3). We'll probably also need to update Clustalw parser to cope with base/residue numbering in the output as well. Peter From jgrant at smith.edu Mon Aug 8 18:08:25 2011 From: jgrant at smith.edu (Jessica Grant) Date: Mon, 8 Aug 2011 14:08:25 -0400 Subject: [Biopython] deleting in-group paralogs from newick trees Message-ID: <5D7AD333-66EC-4B23-950E-523E2FBD2A62@smith.edu> Hello, I am looking at large phylogenetic trees that have many paralogs. I would like to simplify my trees so that all monophyletic paralog groups are collapsed--or all sequences except the shortest branch are deleted. Is there a Biopython module that can help? I started looking at Phylo, but couldn't see an obvious way. Thanks, Jessica From eric.talevich at gmail.com Mon Aug 8 19:33:52 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 8 Aug 2011 15:33:52 -0400 Subject: [Biopython] deleting in-group paralogs from newick trees In-Reply-To: <5D7AD333-66EC-4B23-950E-523E2FBD2A62@smith.edu> References: <5D7AD333-66EC-4B23-950E-523E2FBD2A62@smith.edu> Message-ID: On Mon, Aug 8, 2011 at 2:08 PM, Jessica Grant wrote: > Hello, > > I am looking at large phylogenetic trees that have many paralogs. I would > like to simplify my trees so that all monophyletic paralog groups are > collapsed--or all sequences except the shortest branch are deleted. Is > there a Biopython module that can help? I started looking at Phylo, but > couldn't see an obvious way. > Hi Jessica, Yes, Phylo is the right module to use. If I understand your problem correctly, the tree methods you want are is_monophyletic() and collapse_all(). Both operate on a clade within the tree. You'd traverse the tree with get_nonterminals(), check if a paralog group under a clade is monophyletic, and if so, collapse it. Do you have a list of paralogs already? And, do you know which groups might be monophyletic? If you have groups/clades already, it's simple: >>> tree = Phylo.read('mytree.nwk', 'newick') >>> for clade in tree.get_nonterminals(order='postorder'): ... mono_parent = clade.is_monophyletic([SOME_PARALOG_GROUP]) ... if mono_parent: ... mono_parent.collapse_all() If you don't know the groups yet, then the test inside the loop is a little more elaborate. You can look for overlaps between a clade's tips and and the paralog list using sets: >>> paralogs = set(PARALOG_LIST) # Inside the loop: >>> tips = set([str(t) for t in clade.get_terminals()]) >>> overlap = tips.intersect(paralogs) >>> if len(overlaps) >= 2: # The rest of the loop... Hope that helps, Eric From cjfields at illinois.edu Tue Aug 9 20:09:05 2011 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 9 Aug 2011 15:09:05 -0500 Subject: [Biopython] [BioRuby] Interesting BLAST 2.2.25+ XML behaviour In-Reply-To: References: <398303E2-1195-4CC2-8B73-09C6C1117892@illinois.edu> <4DC12371.3040204@gmail.com> Message-ID: I'm reviving this thread to see what the current status is (if anything has changed). The bioperl parser has the same problem; at the moment we're bascially stuck until NCBI gives some indication as to whether this is a bug or not. Any word back from them yet? (and agreed, it would be nice to have an external bug tracker from NCBI). chris On May 4, 2011, at 5:36 AM, Peter Cock wrote: > On Wed, May 4, 2011 at 10:59 AM, Michal wrote: >> Hi Peter, >> Do you have the script which read >> >> https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/blastp_four_human_vs_rhodopsin.xml >> >> >> and what would be the correct output? >> >> Thank you in advance. >> >> Cheers, >> Michal > > Hi Michal, > > I'm not quite sure what you're asking, but I'll try. First, the three > data files: > > $ wget https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/blastp_four_human_vs_rhodopsin.xml > $ wget https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/test-data/four_human_proteins.fasta > $ wget https://bitbucket.org/galaxy/galaxy-central/src/8eaf07a46623/rhodopsin_proteins.fasta > > The query file has four sequences, > > $ grep -c "^>" four_human_proteins.fasta > 4 > > $ grep "^>" four_human_proteins.fasta >> sp|Q9BS26|ERP44_HUMAN Endoplasmic reticulum resident protein 44 OS=Homo sapiens GN=ERP44 PE=1 SV=1 >> sp|Q9NSY1|BMP2K_HUMAN BMP-2-inducible protein kinase OS=Homo sapiens GN=BMP2K PE=1 SV=2 >> sp|P06213|INSR_HUMAN Insulin receptor OS=Homo sapiens GN=INSR PE=1 SV=4 >> sp|P08100|OPSD_HUMAN Rhodopsin OS=Homo sapiens GN=RHO PE=1 SV=1 > > Based on past experience, I would expect 4 iteration blocks in the > XML, but in this case I have 24: > > $ grep "" -c blastp_four_human_vs_rhodopsin.xml > 24 > > Notice we get 6 iterations for each query (4 times 6 is 24): > > $ grep "" blastp_four_human_vs_rhodopsin.xml > sp|Q9BS26|ERP44_HUMAN > sp|Q9BS26|ERP44_HUMAN > sp|Q9BS26|ERP44_HUMAN > sp|Q9BS26|ERP44_HUMAN > sp|Q9BS26|ERP44_HUMAN > sp|Q9BS26|ERP44_HUMAN > sp|Q9NSY1|BMP2K_HUMAN > sp|Q9NSY1|BMP2K_HUMAN > sp|Q9NSY1|BMP2K_HUMAN > sp|Q9NSY1|BMP2K_HUMAN > sp|Q9NSY1|BMP2K_HUMAN > sp|Q9NSY1|BMP2K_HUMAN > sp|P06213|INSR_HUMAN > sp|P06213|INSR_HUMAN > sp|P06213|INSR_HUMAN > sp|P06213|INSR_HUMAN > sp|P06213|INSR_HUMAN > sp|P06213|INSR_HUMAN > sp|P08100|OPSD_HUMAN > sp|P08100|OPSD_HUMAN > sp|P08100|OPSD_HUMAN > sp|P08100|OPSD_HUMAN > sp|P08100|OPSD_HUMAN > sp|P08100|OPSD_HUMAN > > Now, using the two FASTA files directly and re-running blastp, what do I get? > > $ ~/Downloads/ncbi-blast-2.2.25+/bin/blastp -query > four_human_proteins.fasta -subject rhodopsin_proteins.fasta -outfmt 5 > | grep "" -c > 24 > > Or again with -parse_deflines, which changes how the hit ID/def is presented: > > $ ~/Downloads/ncbi-blast-2.2.25+/bin/blastp -query > four_human_proteins.fasta -subject rhodopsin_proteins.fasta -outfmt 5 > -parse_deflines | grep "" -c > 24 > > How about older versions? > > $ ~/Downloads/ncbi-blast-2.2.24+/bin/blastp -query > four_human_proteins.fasta -subject rhodopsin_proteins.fasta -outfmt 5 > BLAST engine error: XML formatting is only supported for a database search > > I'll have to make a blast database first... > > $ ~/Downloads/ncbi-blast-2.2.24+/bin/makeblastdb -in > rhodopsin_proteins.fasta -dbtype prot > > Building a new DB, current time: 05/04/2011 11:22:57 > New DB name: rhodopsin_proteins.fasta > New DB title: rhodopsin_proteins.fasta > Sequence type: Protein > Keep Linkouts: T > Keep MBits: T > Maximum file size: 1073741824B > Adding sequences from FASTA; added 6 sequences in 0.105655 seconds. > > $ ~/Downloads/ncbi-blast-2.2.25+/bin/blastp -query > four_human_proteins.fasta -db rhodopsin_proteins.fasta -outfmt 5 | > grep "" -c > 4 > > Look - just four identifiers as I expect! This also works if the database > is built with the -parse_seqids switch. > > The same happens with older versions of BLAST+, one > block per query, so four iteration blocks for this example. I tried all > of 2.2.21+, 2.2.22+, 2.2.23+ and 2.2.24+ (running makeblastdb to > give a fresh database, then blastp). > > That seems to demonstrate that bug is specific to the XML output > from FASTA vs FASTA (not FASTA vs DB), which is a new feature > in NCBI BLAST 2.2.25+ > > I will raise this with the NCBI, and report back. > > However, even if the NCBI fix it in the next release, we (Bio*) may > want to update our parsers to cope with this quirk, or at least put a > warning in our BLAST XML parser documentation, as there will be > lots of installations of NCBI BLAST 2.2.25+ in the wild. > > Peter From p.j.a.cock at googlemail.com Wed Aug 10 09:15:18 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 10 Aug 2011 10:15:18 +0100 Subject: [Biopython] [BioRuby] Interesting BLAST 2.2.25+ XML behaviour In-Reply-To: References: <398303E2-1195-4CC2-8B73-09C6C1117892@illinois.edu> <4DC12371.3040204@gmail.com> Message-ID: On Tue, Aug 9, 2011 at 9:09 PM, Chris Fields wrote: > I'm reviving this thread to see what the current status is (if anything has > changed). ?The bioperl parser has the same problem; at the moment we're > bascially stuck until NCBI gives some indication as to whether this is a > bug or not. ?Any word back from them yet? > > (and agreed, it would be nice to have an external bug tracker from NCBI). > > chris Hi Chris, My email to the NCBI on 17 May had a reply from Tao Tao (NCBI User services) saying it would be brought to their developers' attention. For reference, the email subject line was: "Multiple iteration blocks per query in FASTA vs FASTA BLAST XML" I have just emailed back to enquire if there is any news to report. Peter From cjfields at illinois.edu Thu Aug 11 01:35:37 2011 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 10 Aug 2011 20:35:37 -0500 Subject: [Biopython] [BioRuby] Interesting BLAST 2.2.25+ XML behaviour In-Reply-To: References: <398303E2-1195-4CC2-8B73-09C6C1117892@illinois.edu> <4DC12371.3040204@gmail.com> Message-ID: On Aug 10, 2011, at 4:15 AM, Peter Cock wrote: > On Tue, Aug 9, 2011 at 9:09 PM, Chris Fields wrote: >> I'm reviving this thread to see what the current status is (if anything has >> changed). The bioperl parser has the same problem; at the moment we're >> bascially stuck until NCBI gives some indication as to whether this is a >> bug or not. Any word back from them yet? >> >> (and agreed, it would be nice to have an external bug tracker from NCBI). >> >> chris > > Hi Chris, > > My email to the NCBI on 17 May had a reply from Tao Tao (NCBI User > services) saying it would be brought to their developers' attention. > > For reference, the email subject line was: > "Multiple iteration blocks per query in FASTA vs FASTA BLAST XML" > > I have just emailed back to enquire if there is any news to report. > > Peter Wonder if it's worth a second prod from someone else. Sometimes that gets their attention. chris From p.j.a.cock at googlemail.com Thu Aug 11 09:09:13 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 11 Aug 2011 10:09:13 +0100 Subject: [Biopython] [BioRuby] Interesting BLAST 2.2.25+ XML behaviour In-Reply-To: References: <398303E2-1195-4CC2-8B73-09C6C1117892@illinois.edu> <4DC12371.3040204@gmail.com> Message-ID: On Thu, Aug 11, 2011 at 2:35 AM, Chris Fields wrote: >> Hi Chris, >> >> My email to the NCBI on 17 May had a reply from Tao Tao (NCBI User >> services) saying it would be brought to their developers' attention. >> >> For reference, the email subject line was: >> "Multiple iteration blocks per query in FASTA vs FASTA BLAST XML" >> >> I have just emailed back to enquire if there is any news to report. >> >> Peter > > Wonder if it's worth a second prod from someone else. ?Sometimes > that gets their attention. > > chris Tao replied yesterday morning (US time) to confirm the test files so he (she?) could try this on the latest code. Peter From mok at bioxray.dk Thu Aug 11 12:40:57 2011 From: mok at bioxray.dk (Morten Kjeldgaard) Date: Thu, 11 Aug 2011 14:40:57 +0200 Subject: [Biopython] Unable to convert alignment to nexus format Message-ID: <1313066457.10034.11.camel@yeti.daimi.au.dk> Hi, I am getting an exception when trying to output an alignment in nexus format: ValueError: Need a DNA, RNA or Protein alphabet The alignment is read by AlignIO.read() in fasta format from an output file written by Muscle, and so the alphabet specified in the sequences is IUPACProtein. Apparently, NexusIO checks for ProteinAlphabet and thus fails. I am using BioPython 1.56. Here is a 4-line test program generating the exception: from Bio import AlignIO alignment = AlignIO.read(open("aln.muscle"), "fasta") g = open("aln.nexus", "w") g.write (alignment.format("nexus")) Any (safe) workarounds here? Cheers, Morten -- Morten Kjeldgaard, asc. professor, MSc, PhD BiRC - Bioinformatics Research Center, Aarhus University C.F. M?llers Alle, Building 1110, DK-8000 Aarhus C, Denmark. Lab +45 8942 3130 * Mobile: +45 5186 0147 * Home +45 8618 8180 From chapmanb at 50mail.com Thu Aug 11 13:52:38 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 11 Aug 2011 09:52:38 -0400 Subject: [Biopython] Unable to convert alignment to nexus format In-Reply-To: <1313066457.10034.11.camel@yeti.daimi.au.dk> References: <1313066457.10034.11.camel@yeti.daimi.au.dk> Message-ID: <20110811135238.GF3143@kunkel> Morten; > I am getting an exception when trying to output an alignment in nexus > format: > > ValueError: Need a DNA, RNA or Protein alphabet > > The alignment is read by AlignIO.read() in fasta format from an output > file written by Muscle, and so the alphabet specified in the sequences > is IUPACProtein. Apparently, NexusIO checks for ProteinAlphabet and thus > fails. I am using BioPython 1.56. If you specify the alphabet to AlignIO, Nexus will be happy. Here's your modified test program: from Bio import AlignIO from Bio.Alphabet import IUPAC, Gapped alignment = AlignIO.read(open("aln.muscle"), "fasta", alphabet=Gapped(IUPAC.protein)) g = open("aln.nexus", "w") g.write (alignment.format("nexus")) Hope this helps, Brad From p.j.a.cock at googlemail.com Thu Aug 11 14:04:50 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 11 Aug 2011 15:04:50 +0100 Subject: [Biopython] Unable to convert alignment to nexus format In-Reply-To: <1313066457.10034.11.camel@yeti.daimi.au.dk> References: <1313066457.10034.11.camel@yeti.daimi.au.dk> Message-ID: On Thu, Aug 11, 2011 at 1:40 PM, Morten Kjeldgaard wrote: > Hi, > > I am getting an exception when trying to output an alignment in nexus > format: > > ?ValueError: Need a DNA, RNA or Protein alphabet > > The alignment is read by AlignIO.read() in fasta format from an output > file written by Muscle, and so the alphabet specified in the sequences > is IUPACProtein. Yes, but it was written out as a FASTA file which does not record the alphabet. Biopython does not try to guess this, you must be explicit. > Apparently, NexusIO checks for ProteinAlphabet and thus > fails. I am using BioPython 1.56. As Brad described, when you parse the FASTA alignment, tell Biopython it is protein. Peter From aaronquinlan at gmail.com Fri Aug 12 18:39:34 2011 From: aaronquinlan at gmail.com (Aaron Quinlan) Date: Fri, 12 Aug 2011 14:39:34 -0400 Subject: [Biopython] Working with genomic intervals Message-ID: All, I apologize in advance if this is a naive question. I am wondering if BioPython provides libraries for working with genomic intervals in BED, GFF, or any other like format? I am looking for libraries that handle the parsing of files in these formats into Python objects, as well as libraries for manipulating (intersection, merging, counting, etc.) intervals. I know this exists in Galaxy's bx-python, but am wondering if there are similar libraries in BioPython? Gratefully, Aaron From p.j.a.cock at googlemail.com Sun Aug 14 11:11:37 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 14 Aug 2011 12:11:37 +0100 Subject: [Biopython] Working with genomic intervals In-Reply-To: References: Message-ID: On Friday, August 12, 2011, Aaron Quinlan wrote: > All, > > I apologize in advance if this is a naive question. > I am wondering if BioPython provides libraries for > working with genomic intervals in BED, GFF, or > any other like format? I am looking for libraries > that handle the parsing of files in these formats > into Python objects, as well as libraries for > manipulating (intersection, merging, counting, > etc.) intervals. I know this exists in Galaxy's > bx-python, but am wondering if there are similar > libraries in BioPython? > > Gratefully, > Aaron Hi Aaron, Have a look at http://biopython.org/wiki/GFF_Parsing wher Brad is working on this. He's also spoken highly of bx-python as I recall. Peter From sdavis2 at mail.nih.gov Sun Aug 14 11:48:15 2011 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Sun, 14 Aug 2011 07:48:15 -0400 Subject: [Biopython] Working with genomic intervals In-Reply-To: References: Message-ID: On Sun, Aug 14, 2011 at 7:11 AM, Peter Cock wrote: > On Friday, August 12, 2011, Aaron Quinlan wrote: >> All, >> >> I apologize in advance if this is a naive question. >> I am wondering if BioPython provides libraries for >> working with genomic intervals in BED, GFF, or >> any other like format? ?I am looking for libraries >> that handle the parsing of files in these formats >> into Python objects, as well as libraries for >> manipulating (intersection, merging, counting, >> etc.) intervals. ?I know this exists in Galaxy's >> bx-python, but am wondering if there are similar >> libraries in BioPython? >> >> Gratefully, >> Aaron > > Hi Aaron, > > Have a look at http://biopython.org/wiki/GFF_Parsing > wher Brad is working on this. He's also spoken > highly of bx-python as I recall. I would second the bx-python vote. Not only are the "normal" interval classes covered, but there are also some variants (clustering is one that comes to mind). Sean From lgautier at gmail.com Mon Aug 15 06:17:28 2011 From: lgautier at gmail.com (Laurent Gautier) Date: Mon, 15 Aug 2011 08:17:28 +0200 Subject: [Biopython] Working with genomic intervals In-Reply-To: References: Message-ID: <4E48B9F8.4070605@gmail.com> On 2011-08-14 18:00, biopython-request at lists.open-bio.org wrote: > On Sun, Aug 14, 2011 at 7:11 AM, Peter Cock wrote: >> > On Friday, August 12, 2011, Aaron Quinlan wrote: >>> >> All, >>> >> >>> >> I apologize in advance if this is a naive question. >>> >> I am wondering if BioPython provides libraries for >>> >> working with genomic intervals in BED, GFF, or >>> >> any other like format? ?I am looking for libraries >>> >> that handle the parsing of files in these formats >>> >> into Python objects, as well as libraries for >>> >> manipulating (intersection, merging, counting, >>> >> etc.) intervals. ?I know this exists in Galaxy's >>> >> bx-python, but am wondering if there are similar >>> >> libraries in BioPython? >>> >> >>> >> Gratefully, >>> >> Aaron >> > >> > Hi Aaron, >> > >> > Have a look athttp://biopython.org/wiki/GFF_Parsing >> > wher Brad is working on this. He's also spoken >> > highly of bx-python as I recall. > I would second the bx-python vote. Not only are the "normal" interval > classes covered, but there are also some variants (clustering is one > that comes to mind). > > Sean One can also access from Python the utilities for ranges available in bioconductor, for example using the bioconductor extension to rpy2 or rpy2 directly (may be using dynamic class mapping features, as shown below): from rpy2.robjects.packages import importr iranges = importr("IRanges") # Python class IRanges as an API to Bioconductors IRanges::IRanges from rpy2.robjects.methods import RS4, RS4Auto_Type class IRanges(RS4): __metaclass__ = RS4Auto_Type __rpackagename__ = "IRanges" __rname__ = "IRanges" # now in action >>> from rpy2.robjects.vectors import IntVector >>> ir = IRanges(iranges.IRanges(start = IntVector(range(10)), width = 11)) >>> print(ir) IRanges of length 10 start end width [1] 0 10 11 [2] 1 11 11 [3] 2 12 11 [4] 3 13 11 [5] 4 14 11 [6] 5 15 11 [7] 6 16 11 [8] 7 17 11 [9] 8 18 11 [10] 9 19 11 >>> print(IRanges(ir.reduce__IRanges(ir))) IRanges of length 1 start end width [1] 0 19 20 From aaronquinlan at gmail.com Mon Aug 15 23:54:31 2011 From: aaronquinlan at gmail.com (Aaron Quinlan) Date: Mon, 15 Aug 2011 19:54:31 -0400 Subject: [Biopython] Working with genomic intervals In-Reply-To: <4E48B9F8.4070605@gmail.com> References: <4E48B9F8.4070605@gmail.com> Message-ID: <10D5E6D5-7114-406C-A2CF-8EB211CCE8D2@gmail.com> Dear Peter, Sean, and Laurent, Thanks so much for the useful suggestions. Best, Aaron On Aug 15, 2011, at 2:17 AM, Laurent Gautier wrote: > On 2011-08-14 18:00, biopython-request at lists.open-bio.org wrote: >> On Sun, Aug 14, 2011 at 7:11 AM, Peter Cock wrote: >>> > On Friday, August 12, 2011, Aaron Quinlan wrote: >>>> >> All, >>>> >> >>>> >> I apologize in advance if this is a naive question. >>>> >> I am wondering if BioPython provides libraries for >>>> >> working with genomic intervals in BED, GFF, or >>>> >> any other like format? ?I am looking for libraries >>>> >> that handle the parsing of files in these formats >>>> >> into Python objects, as well as libraries for >>>> >> manipulating (intersection, merging, counting, >>>> >> etc.) intervals. ?I know this exists in Galaxy's >>>> >> bx-python, but am wondering if there are similar >>>> >> libraries in BioPython? >>>> >> >>>> >> Gratefully, >>>> >> Aaron >>> > >>> > Hi Aaron, >>> > >>> > Have a look athttp://biopython.org/wiki/GFF_Parsing >>> > wher Brad is working on this. He's also spoken >>> > highly of bx-python as I recall. >> I would second the bx-python vote. Not only are the "normal" interval >> classes covered, but there are also some variants (clustering is one >> that comes to mind). >> >> Sean > > One can also access from Python the utilities for ranges available in > bioconductor, for example using the bioconductor extension to rpy2 or rpy2 > directly (may be using dynamic class mapping features, as shown below): > > from rpy2.robjects.packages import importr > iranges = importr("IRanges") > # Python class IRanges as an API to Bioconductors IRanges::IRanges > from rpy2.robjects.methods import RS4, RS4Auto_Type > class IRanges(RS4): > __metaclass__ = RS4Auto_Type > __rpackagename__ = "IRanges" > __rname__ = "IRanges" > > # now in action > > >>> from rpy2.robjects.vectors import IntVector > >>> ir = IRanges(iranges.IRanges(start = IntVector(range(10)), width = 11)) > >>> print(ir) > IRanges of length 10 > start end width > [1] 0 10 11 > [2] 1 11 11 > [3] 2 12 11 > [4] 3 13 11 > [5] 4 14 11 > [6] 5 15 11 > [7] 6 16 11 > [8] 7 17 11 > [9] 8 18 11 > [10] 9 19 11 > >>> print(IRanges(ir.reduce__IRanges(ir))) > IRanges of length 1 > start end width > [1] 0 19 20 > > From brandonjbreitling at gmail.com Wed Aug 17 21:44:21 2011 From: brandonjbreitling at gmail.com (Brandon Breitling) Date: Wed, 17 Aug 2011 21:44:21 +0000 (UTC) Subject: [Biopython] Question on your Methods in Enzymology paper References: Message-ID: Hi Mr. Lunt, My name is Brandon Breitling and I'm a statistics graduate student in the United States. I was wondering if you hadthe scripts or code available from your "Inference of Direct Residue Contacts in Two-Component Signaling" paper. I'm trying to see if I can do the same for a eukaryotic protein pair that my lab studies. I have created the concatenated strings dataset for my protein as described in your paper and have attempted to make scripts for the MI steps but would really be benefited if I could get them for the all steps in the Direct Coupling analysis. If you could also email me the accession number for your dataset so that I can verify that I have the scripts working, that would be most appreciated as well. Regards, Brandon Breitling From p.j.a.cock at googlemail.com Wed Aug 17 22:21:44 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 17 Aug 2011 23:21:44 +0100 Subject: [Biopython] Question on your Methods in Enzymology paper In-Reply-To: References: Message-ID: On Wed, Aug 17, 2011 at 10:44 PM, Brandon Breitling wrote: > > Hi Mr. Lunt, > > My name is Brandon Breitling and I'm a statistics > graduate student in the United States. ?I was > wondering if you had the scripts or code available > from your "Inference of Direct Residue Contacts > in Two-Component Signaling" paper. ?I'm trying > to see if I can do the same for a eukaryotic > protein pair that my lab studies. > > I have created the concatenated strings dataset > for my protein as described in your paper and > have attempted to make scripts for the MI steps > but would really be benefited if I could get > them for the all steps in the Direct Coupling > analysis. ?If you could also email me the > accession number for your dataset so that I > can verify that I have the scripts working, > that would be most appreciated as well. > > Regards, > Brandon Breitling Hi Brandon, It looks like you've mixed up your email addresses. As it happens I did my PhD on TCS, and used Biopython's Bio.PDB model to get crude distances from a PDB complex (and also looked at MI). I'm not sure if I've read this paper though... Bryan Lunt, Hendrik Szurmant, Andrea Procaccini, James A. Hoch, Terence Hwa and Martin Weigt "Chapter Two - Inference of Direct Residue Contacts in Two-Component Signaling". Methods in Enzymology Volume 471, 2010, Pages 17-41 http://dx.doi.org/10.1016/S0076-6879(10)71002-8 Peter From lunt at ctbp.ucsd.edu Thu Aug 18 16:46:05 2011 From: lunt at ctbp.ucsd.edu (Bryan Lunt) Date: Thu, 18 Aug 2011 09:46:05 -0700 Subject: [Biopython] Biopython Digest, Vol 104, Issue 10 In-Reply-To: References: Message-ID: Oh!, Yeah, we used BioPython extensively, but I thought I sent Brandon the code already... We have a decent module for getting distances from Bio.PDB, though unfortunately it uses far far too much disk space (it outputs a large text file with every residue compared to every other reside, allowing AWK or some other tool to filter the file.) And a large set of tools for creating putative pairings, mainly for TCS, but of course generalized to pair any set of protein domains... -Bryan On Thu, Aug 18, 2011 at 9:00 AM, wrote: > Send Biopython mailing list submissions to > ? ? ? ?biopython at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > ? ? ? ?http://lists.open-bio.org/mailman/listinfo/biopython > or, via email, send a message with subject or body 'help' to > ? ? ? ?biopython-request at lists.open-bio.org > > You can reach the person managing the list at > ? ? ? ?biopython-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biopython digest..." > > > Today's Topics: > > ? 1. Question on your Methods in Enzymology paper (Brandon Breitling) > ? 2. Re: Question on your Methods in Enzymology paper (Peter Cock) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 17 Aug 2011 21:44:21 +0000 (UTC) > From: Brandon Breitling > Subject: [Biopython] Question on your Methods in Enzymology paper > To: biopython at biopython.org > Message-ID: > Content-Type: text/plain; charset=us-ascii > > > Hi Mr. Lunt, > > My name is Brandon Breitling and I'm a statistics > graduate student in the United States. ?I was > wondering if you hadthe scripts or code available > from your "Inference of Direct Residue Contacts > in Two-Component Signaling" paper. ?I'm trying > to see if I can do the same for a eukaryotic > protein pair that my lab > studies. > > I have created the concatenated strings dataset > for my protein as described in your paper and > have attempted to make scripts for the MI steps > but would really be benefited if I could get > them for the all steps in the Direct Coupling > analysis. ?If you could also email me the > accession number for your dataset so that I > can verify that I have the scripts working, > that would be most appreciated as well. > > Regards, > Brandon Breitling > > > > ------------------------------ > > Message: 2 > Date: Wed, 17 Aug 2011 23:21:44 +0100 > From: Peter Cock > Subject: Re: [Biopython] Question on your Methods in Enzymology paper > To: Brandon Breitling > Cc: biopython at biopython.org > Message-ID: > ? ? ? ? > Content-Type: text/plain; charset=ISO-8859-1 > > On Wed, Aug 17, 2011 at 10:44 PM, Brandon Breitling wrote: >> >> Hi Mr. Lunt, >> >> My name is Brandon Breitling and I'm a statistics >> graduate student in the United States. ?I was >> wondering if you had the scripts or code available >> from your "Inference of Direct Residue Contacts >> in Two-Component Signaling" paper. ?I'm trying >> to see if I can do the same for a eukaryotic >> protein pair that my lab studies. >> >> I have created the concatenated strings dataset >> for my protein as described in your paper and >> have attempted to make scripts for the MI steps >> but would really be benefited if I could get >> them for the all steps in the Direct Coupling >> analysis. ?If you could also email me the >> accession number for your dataset so that I >> can verify that I have the scripts working, >> that would be most appreciated as well. >> >> Regards, >> Brandon Breitling > > Hi Brandon, > > It looks like you've mixed up your email addresses. > > As it happens I did my PhD on TCS, and used > Biopython's Bio.PDB model to get crude distances > from a PDB complex (and also looked at MI). I'm > not sure if I've read this paper though... > > Bryan Lunt, Hendrik Szurmant, Andrea Procaccini, > James A. Hoch, Terence Hwa and Martin Weigt > "Chapter Two - ?Inference of Direct Residue Contacts > in Two-Component Signaling". Methods in Enzymology > Volume 471, 2010, Pages 17-41 > http://dx.doi.org/10.1016/S0076-6879(10)71002-8 > > Peter > > > > ------------------------------ > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > End of Biopython Digest, Vol 104, Issue 10 > ****************************************** > From p.j.a.cock at googlemail.com Thu Aug 18 19:32:57 2011 From: p.j.a.cock at googlemail.com (Peter) Date: Thu, 18 Aug 2011 20:32:57 +0100 Subject: [Biopython] Biopython 1.58 released Message-ID: <75327C54-CF88-43BC-BACF-87139456FE67@googlemail.com> Dear All, Biopython 1.58 is out: http://news.open-bio.org/news/2011/08/biopython-1-58-released/ Thank you to everyone who has contributed. Peter P.S. We're on Twitter as @Biopython