From chapmanb at 50mail.com Tue Feb 1 11:03:04 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 1 Feb 2011 11:03:04 -0500 Subject: [Biopython] internal function to convert illumina quality scores to phred In-Reply-To: References: <2CCFE5EE-0A98-4BA9-A853-B727978B29B7@stanford.edu> <97487661-DAB2-43F8-8CCF-4FC0AE252582@stanford.edu> Message-ID: <20110201160304.GH17835@sobchak.mgh.harvard.edu> Alan and Peter; Alan, nice suggestions on conversion from phred. On the barcode sorting side there was just some discussion of this on the development list; I have a script that does barcode sorting and trimming with mismatches using Biopython: https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/barcode_sort_trim.py It does not use qualities, but this might be a framework you could build off to add that support. Peter, how hard do you think it would be to have SeqIO only convert from the fastq encoding to phred scores on demand? Most of the time when dealing with fastq I do not need any conversion at all and use the FastqGeneralIterator to just pull out the name, sequence and quality. You've done a lot of nice work with the correct conversions and it would be great to expose that directly though on-demand conversion as Alan is suggesting. Ideally you would use SeqIO as normal with fastq files, but the quality score would not be converted to solexa during parsing using letter_annotations["solexa_quality"] was accessed. Another option would just be to expose a function so folks could do: convert_fastq_illumina_to_quality(illumina_encoded_string) to get the phred quality scores for a string they were interested in. This way you could use FastqGeneralIterator for no SeqRecord/Seq overhead, but still make use of your conversion work. Brad From biopython at maubp.freeserve.co.uk Tue Feb 1 11:16:18 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Feb 2011 16:16:18 +0000 Subject: [Biopython] internal function to convert illumina quality scores to phred In-Reply-To: <20110201160304.GH17835@sobchak.mgh.harvard.edu> References: <2CCFE5EE-0A98-4BA9-A853-B727978B29B7@stanford.edu> <97487661-DAB2-43F8-8CCF-4FC0AE252582@stanford.edu> <20110201160304.GH17835@sobchak.mgh.harvard.edu> Message-ID: On Tue, Feb 1, 2011 at 4:03 PM, Brad Chapman wrote: > > Alan and Peter; > Alan, nice suggestions on conversion from phred. On the barcode > sorting side there was just some discussion of this on the > development list; I have a script that does barcode sorting > and trimming with mismatches using Biopython: > > https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/barcode_sort_trim.py > > It does not use qualities, but this might be a framework you could > build off to add that support. > > Peter, how hard do you think it would be to have SeqIO only convert > from the fastq encoding to phred scores on demand? Most of the time > when dealing with fastq I do not need any conversion at all and use > the FastqGeneralIterator to just pull out the name, sequence and > quality. > > You've done a lot of nice work with the correct conversions and it > would be great to expose that directly though on-demand conversion > as Alan is suggesting. Ideally you would use SeqIO as normal with > fastq files, but the quality score would not be converted to solexa > during parsing using letter_annotations["solexa_quality"] was > accessed. I actually implemented a proof of concept that does that. In order to not alter the SeqRecord behaviour, it was a new object which acted like a list of integers in many respects. The data is held as a FASTQ encoded string, and decoded (and then cached) on demand only. On output if it was already in the right encoding the string could be used as is, otherwise the conversion could be done very quickly with a precomputed table and the string translate() method (without having to go via a list of integers). It seemed to work, but I wasn't convinced about the benefits (given the complexity). I'd really want some real world FASTQ benchmarks to try it on... something you might have in the form of your scripts and the real data they were written for? I'm pretty sure this code is in a local git branch on one of my machines (probably at home), but I don't think I pushed it to github. I should do that... > Another option would just be to expose a function so folks > could do: > > convert_fastq_illumina_to_quality(illumina_encoded_string) > > to get the phred quality scores for a string they were interested > in. This way you could use FastqGeneralIterator for no > SeqRecord/Seq overhead, but still make use of your > conversion work. Yeah, three or four helper functions for the three decoding would be sensible. It looks like there is demand for it then... Peter From jp.verta at gmail.com Tue Feb 1 11:28:57 2011 From: jp.verta at gmail.com (Jukka-Pekka Verta) Date: Tue, 1 Feb 2011 11:28:57 -0500 Subject: [Biopython] Bio.Emboss.Primer3 -parser and Primer3 2.2.3 Message-ID: <249537ED-8F00-4AAF-B4BF-97BED5D64BB6@gmail.com> Hello, I'm having trouble parsing out the Bio.Emboss.Primer3Commandline output when using the Whitehead Institute Primer3 version 2.2.3 (and a compatible eprimer3 version). The parser (Bio.Emboss.Primer3) does not write out the reverse primer sequence i.e. the Primers -class member "reverse_seq" is empty. The parser worked fine with the 1.1.4 version of Primer3 and the compatible eprimer3. The output of the primer3-2.2.3 compatible eprimer3 -version looks nearly identical to the old distributions: # EPRIMER3 RESULTS FOR GQ0197_O05.1 # FORWARD PRIMER STATISTICS: # considered 5541 # GC content failed 14 # GC clamp failed 573 # low tm 4213 # high any compl 10 # high end compl 40 # long poly-x seq 8 # ok 683 # REVERSE PRIMER STATISTICS: # considered 5629 # GC content failed 211 # GC clamp failed 564 # low tm 4100 # high end compl 10 # long poly-x seq 12 # ok 732 # PRIMER PAIR STATISTICS: # considered 6607 # unacceptable product size 6551 # high end compl 20 # ok 36 # Start Len Tm GC% Sequence FORWARD PRIMER 799 23 63.93 69.57 AGCCACCAGGGGGTGCTCTCCAG REVERSE PRIMER 979 20 62.11 70.00 TGGCGACTCGGCCCATGCAC FORWARD PRIMER 799 20 60.18 70.00 AGCCACCAGGGGGTGCTCTC REVERSE PRIMER 979 20 62.11 70.00 TGGCGACTCGGCCCATGCAC FORWARD PRIMER 800 22 63.01 72.73 GCCACCAGGGGGTGCTCTCCAG REVERSE PRIMER 975 23 64.22 69.57 GGCGACTCGGCCCATGCACTGTC FORWARD PRIMER 799 23 63.93 69.57 AGCCACCAGGGGGTGCTCTCCAG REVERSE PRIMER 975 23 64.22 69.57 GGCGACTCGGCCCATGCACTGTC ...and that's why I'm kinda lost. Thank you for your help! JP Verta From biopython at maubp.freeserve.co.uk Tue Feb 1 11:41:08 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Feb 2011 16:41:08 +0000 Subject: [Biopython] Bio.Emboss.Primer3 -parser and Primer3 2.2.3 In-Reply-To: <249537ED-8F00-4AAF-B4BF-97BED5D64BB6@gmail.com> References: <249537ED-8F00-4AAF-B4BF-97BED5D64BB6@gmail.com> Message-ID: On Tue, Feb 1, 2011 at 4:28 PM, Jukka-Pekka Verta wrote: > Hello, > > I'm having trouble parsing out the Bio.Emboss.Primer3Commandline ... Could you file a bug (including version numbers of Biopython, EMBOSS, etc), with a short bit of Python code showing how you parse the file, and then (after filing the bug you can) attach the example Primer3 file to it. http://bugzilla.open-bio.org/enter_bug.cgi?product=Biopython I could try cut and pasting the example from the email, but white space and line wrapping tends to make a mess of things. Could you also clarify what happens with the other reverse primer attributes, like reverse_tm? Thanks, Peter From jp.verta at gmail.com Tue Feb 1 13:10:49 2011 From: jp.verta at gmail.com (Jukka-Pekka Verta) Date: Tue, 1 Feb 2011 13:10:49 -0500 Subject: [Biopython] Bio.Emboss.Primer3 -parser and Primer3 2.2.3 In-Reply-To: References: <249537ED-8F00-4AAF-B4BF-97BED5D64BB6@gmail.com> Message-ID: Thanks, bug 3173. JP On 2011-02-01, at 11:41 AM, Peter wrote: > On Tue, Feb 1, 2011 at 4:28 PM, Jukka-Pekka Verta wrote: >> Hello, >> >> I'm having trouble parsing out the Bio.Emboss.Primer3Commandline ... > > Could you file a bug (including version numbers of Biopython, EMBOSS, > etc), with a short bit of Python code showing how you parse the file, and > then (after filing the bug you can) attach the example Primer3 file to it. > http://bugzilla.open-bio.org/enter_bug.cgi?product=Biopython > > I could try cut and pasting the example from the email, but white space > and line wrapping tends to make a mess of things. > > Could you also clarify what happens with the other reverse primer > attributes, like reverse_tm? > > Thanks, > > Peter From biopython at maubp.freeserve.co.uk Tue Feb 1 15:39:30 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Feb 2011 20:39:30 +0000 Subject: [Biopython] internal function to convert illumina quality scores to phred In-Reply-To: References: <2CCFE5EE-0A98-4BA9-A853-B727978B29B7@stanford.edu> <97487661-DAB2-43F8-8CCF-4FC0AE252582@stanford.edu> <20110201160304.GH17835@sobchak.mgh.harvard.edu> Message-ID: On Tue, Feb 1, 2011 at 4:16 PM, Peter wrote: > On Tue, Feb 1, 2011 at 4:03 PM, Brad Chapman wrote: >> >> Peter, how hard do you think it would be to have SeqIO only convert >> from the fastq encoding to phred scores on demand? Most of the time >> when dealing with fastq I do not need any conversion at all and use >> the FastqGeneralIterator to just pull out the name, sequence and >> quality. >> >> You've done a lot of nice work with the correct conversions and it >> would be great to expose that directly though on-demand conversion >> as Alan is suggesting. Ideally you would use SeqIO as normal with >> fastq files, but the quality score would not be converted to solexa >> during parsing using letter_annotations["solexa_quality"] was >> accessed. > > I actually implemented a proof of concept that does that. In order > to not alter the SeqRecord behaviour, it was a new object which > acted like a list of integers in many respects. The data is held > as a FASTQ encoded string, and decoded (and then cached) on > demand only. On output if it was already in the right encoding > the string could be used as is, otherwise the conversion could > be done very quickly with a precomputed table and the string > translate() method (without having to go via a list of integers). > It seemed to work, but I wasn't convinced about the benefits > (given the complexity). I'd really want some real world FASTQ > benchmarks to try it on... something you might have in the form > of your scripts and the real data they were written for? > > I'm pretty sure this code is in a local git branch on one of my > machines (probably at home), but I don't think I pushed it to > github. I should do that... Found it and pushed it: https://github.com/peterjc/biopython/tree/fastq-tricks Note there are unit test failures (e.g. as currently implemented there is no range checking on the characters in the quality strings at parse time). We may want to continue this on the dev mailing list... Peter From rik at cogsci.ucsd.edu Tue Feb 1 18:02:27 2011 From: rik at cogsci.ucsd.edu (richard k belew) Date: Tue, 01 Feb 2011 15:02:27 -0800 Subject: [Biopython] Entrez.read(handle) AND handle.read() on the same handle? Message-ID: <4D489103.5020806@cogsci.ucsd.edu> there must be a simple way to do this, but i've not figured it out: i want to sniff at aspects of a record (using the dictionary returned by Entrez.read(handle)) and then cache the XML version (returned by handle.read()), only if it meets certain criteria. and without doing two separate efetch's! has anyone else bumped into this pattern? rik From p.j.a.cock at googlemail.com Tue Feb 1 18:20:07 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 1 Feb 2011 23:20:07 +0000 Subject: [Biopython] Entrez.read(handle) AND handle.read() on the same handle? In-Reply-To: <4D489103.5020806@cogsci.ucsd.edu> References: <4D489103.5020806@cogsci.ucsd.edu> Message-ID: On Tue, Feb 1, 2011 at 11:02 PM, richard k belew wrote: > there must be a simple way to do this, but i've > not figured it out: > > i want to sniff at aspects of a record (using the > dictionary returned by Entrez.read(handle)) and then > cache the XML version (returned by handle.read()), > only if it meets certain criteria. ?and without > doing two separate efetch's! > > has anyone else bumped into this pattern? > > ? ? ? ?rik The simplest solution is to use StringIO (or cStringIO) and cache it all in memory, then parse it: from StringIO import StringIO raw_data = efetch(...).read() record = Entrez.read(String(raw_data)) Peter From brettpthomas at gmail.com Wed Feb 2 10:04:03 2011 From: brettpthomas at gmail.com (Brett Thomas) Date: Wed, 2 Feb 2011 10:04:03 -0500 Subject: [Biopython] Use biopython to create database of genome intervals? Message-ID: Hi all, I'm looking to create a database of genome variants of varying size: some single base and some not. It needs to provide efficient range queries, such as "get me all genome variants in region X". Has anybody used biopython for something like this? I think this will require an interval tree, or something like it. Are there any implementations of interval trees in Biopython? Thanks, Brett From chapmanb at 50mail.com Wed Feb 2 10:25:25 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 2 Feb 2011 10:25:25 -0500 Subject: [Biopython] Use biopython to create database of genome intervals? In-Reply-To: References: Message-ID: <20110202152525.GE2151@kunkel> Brett; > I'm looking to create a database of genome variants of varying size: some > single base and some not. It needs to provide efficient range queries, such > as "get me all genome variants in region X". Has anybody used biopython for > something like this? > > I think this will require an interval tree, I'd recommend using bx-python, which contains an excellent IntervalTree implementation: https://bitbucket.org/james_taylor/bx-python/wiki/Home If you search GitHub there are several scripts you can use as examples to get started: https://github.com/search?langOverride=&language=python&q=intervaltree+bx&repo=&start_value=1&type=Code&x=0&y=0 But the basic usage is: import collections from bx.intervals.intersection import IntervalTree # build an interval tree itree = collections.defaultdict(IntervalTree) for chrom, start, end, data_dict in your_intervals: itree[chrom].insert(start, end, data_dict) # query the tree for chrom, start, end in regions_of_interest: overlaps = itree[chrom].find(start, end) Hope this helps, Brad From bnbowman at gmail.com Wed Feb 2 18:54:38 2011 From: bnbowman at gmail.com (Brett Bowman) Date: Wed, 2 Feb 2011 15:54:38 -0800 Subject: [Biopython] Multiple Sequence Alignment Conversion: A2M and A3M from Fasta Message-ID: I'm writing a Biopython script to pipeline the following process: 1) Parse Fasta From File 2) Blast it against NCBI and pull down a range of solid hits 3) Align the sequence with Muscle or ClustalW 4) Build a HMM profile of the alignment with HHmake 1-3 I've got down pat, its step 4 that seems to be the problem. In particular, HHmake appears to prefer A2M or A3M format alignments, and produces inferior results when fed an Aligned Fasta (*.AFA). Both alignment programs output to Fasta or ClustalW, but not A2M or A3M, and in addition I can't seem to find a definition for either format online anywhere. So: Does anyone know if there is a way to convert to A2M or A3M with Biopython? They do not appear supported by AlignIO. Otherwise, does anyone know where I could find a definition for the formats online so that I can write my own conversion? Brett Bowman Woelk Lab UCSD Medical School From biopython at maubp.freeserve.co.uk Wed Feb 2 19:10:02 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 3 Feb 2011 00:10:02 +0000 Subject: [Biopython] Multiple Sequence Alignment Conversion: A2M and A3M from Fasta In-Reply-To: References: Message-ID: On Wed, Feb 2, 2011 at 11:54 PM, Brett Bowman wrote: > I'm writing a Biopython script to pipeline the following process: > 1) Parse Fasta From File > 2) Blast it against NCBI and pull down a range of solid hits > 3) Align the sequence with Muscle or ClustalW > 4) Build a HMM profile of the alignment with HHmake > > 1-3 I've got down pat, its step 4 that seems to be the problem. > ?In particular, HHmake appears to prefer A2M or A3M format alignments, > and produces inferior results when fed an Aligned Fasta (*.AFA). ?Both > alignment programs output to Fasta or ClustalW, but not A2M or A3M, and in > addition I can't seem to find a definition for either format online > anywhere. > > So: Does anyone know if there is a way to convert to A2M or A3M > with Biopython? ?They do not appear supported by AlignIO. ?Otherwise, does > anyone know where I could find a definition for the formats online so that I > can write my own conversion? Have you seen this HHMake manual: ftp://toolkit.lmb.uni-muenchen.de/HHsearch/HHsearch1.5.1/HHsearch-guide.pdf This describes the A2M and A3M formats, which I had not heard of before. I suspect these are file formats specific to the HHmake. It also says HHmake comes with a perl script reformat.pl which can be used to convert Clustal (or Stockholm format) to A3M - so just use that instead? Peter From p.j.a.cock at googlemail.com Thu Feb 3 11:40:31 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 3 Feb 2011 16:40:31 +0000 Subject: [Biopython] help possible? error while installing In-Reply-To: <4DDEB591-27B7-4D80-BF64-9CFFBE2206DF@med.umcg.nl> References: <4DDEB591-27B7-4D80-BF64-9CFFBE2206DF@med.umcg.nl> Message-ID: On Thu, Feb 3, 2011 at 4:35 PM, Ruben Mars wrote: > Hi there, > > I'm a biologist trying to get Biopython to work but I am getting the following error while installing and can't find anything about it in the help pages: > > mw159059:~ Rubenmars$ python /Users/Rubenmars/Downloads/biopython-1.56/setup.py install > Traceback (most recent call last): > ?File "/Users/Rubenmars/Downloads/biopython-1.56/setup.py", line 323, in > ? for line in open('Bio/__init__.py'): > IOError: [Errno 2] No such file or directory: 'Bio/__init__.py' > > I would be great if you could help me with this. > > I am using terminal from macOSX development tools. > > Best wishes, > > Ruben > Hi Rob, You should first change to the directory then run setup: cd /Users/Rubenmars/Downloads/biopython-1.56 python setup.py install Note you need to send your emails to biopython at lists.open-bio.org (or biopython at lbiopython.org), not biopython-owner at lists.open-bio.org which just does to a handful of people who help look after the mailing list. Peter From laserson at mit.edu Thu Feb 3 11:49:52 2011 From: laserson at mit.edu (Uri Laserson) Date: Thu, 3 Feb 2011 11:49:52 -0500 Subject: [Biopython] Output SeqRecord as XML? Message-ID: Does anyone have any experience or code that will write SeqRecord objects to XML (and also parse them)? Uri ................................................................................... Uri Laserson Graduate Student, Biomedical Engineering Harvard-MIT Division of Health Sciences and Technology M +1 917 742 8019 laserson at mit.edu From p.j.a.cock at googlemail.com Thu Feb 3 11:59:18 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 3 Feb 2011 16:59:18 +0000 Subject: [Biopython] Output SeqRecord as XML? In-Reply-To: References: Message-ID: On Thu, Feb 3, 2011 at 4:49 PM, Uri Laserson wrote: > Does anyone have any experience or code that will write SeqRecord objects to > XML (and also parse them)? > > Uri What kind of XML do you have in mind? INSDC, UniProt, TinySeq, ... Peter From p.j.a.cock at googlemail.com Thu Feb 3 13:40:45 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 3 Feb 2011 18:40:45 +0000 Subject: [Biopython] Output SeqRecord as XML? In-Reply-To: References: Message-ID: On Thu, Feb 3, 2011 at 6:15 PM, Uri Laserson wrote: > I am not familiar with those particular XML standards. > Basically, something that is clean/simple and that doesn't have some of the > arbitrary limitations of GenBank/EMBL format (like length of qualifiers, > etc.). ?What would you recommend? ?(If something is already supported in > biopython, that would obviously be a big plus.) > Uri There is nothing built into Biopython's SeqIO yet beyond the UniProt XML parser, which you are familiar with - right? ;). I suspect that either that or the INSDC XML format (INSDSeq, an XML version of GenBank/EMBL text files) might be good choices - although they too may have field limitations. I certainly wouldn't encourage you to invent yet another file format. Peter From clementsgalaxy at gmail.com Thu Feb 3 20:01:01 2011 From: clementsgalaxy at gmail.com (Dave Clements) Date: Thu, 3 Feb 2011 17:01:01 -0800 Subject: [Biopython] Galaxy Community Conference, May 25-26, Lunteren, The Netherlands Message-ID: We are pleased to announce the *2011 Galaxy Community Conference*, being held *May 25-26 in Lunteren, The Netherlands*. The meeting will feature two full days of presentations and discussion on extending Galaxy to use new tools and data sources, deploying Galaxy at your organization, and best practices for using Galaxy to further your own and your community's research. See http://galaxy.psu.edu/gcc2011/* for complete details. * *About Galaxy: *Galaxy is an open, web-based platform for *accessible, reproducible, and transparent* computational biomedical research. - *Accessibility:* Galaxy enables users without programming experience to easily specify parameters and run tools and workflows. - *Reproducibility:* Galaxy captures all information necessary so that any user can repeat and understand a complete computational analysis. - *Transparency:* Galaxy enables users to share and publish analyses via the web and create Pages--interactive, web-based documents that describe a complete analysis. Galaxy is open source for all organizations. The public Galaxy service ( http://usegalaxy.org) makes analysis tools, genomic data, tutorial demonstrations, persistent workspaces, and publication services available to any scientist that has access to the Internet. Local Galaxy servers can be set up by downloading the Galaxy application and customizing it to meet particular needs. *Conference Overview: * This event aims to engage a broader community of developers, data producers, tool creators, and core facility and other research hub staff to become an active part of the Galaxy community. We'll cover defining resources in the Galaxy framework, increasing their visibility and making them easier to use and integrate with other resources, how to extend Galaxy to use custom data sources and custom tools, and best practices for using Galaxy in your organization. Additional topics include, but are not limited to: * Talks submitted by the Galaxy community * Integration of tools (including NGS analysis tools) and distributed job management * Deployment of Galaxy instances on local resources and on the Cloud * Management of large datasets with the Galaxy Library System * Using the Galaxy LIMS functionality at NGS sequencing facilities * Visualizing Data without leaving Galaxy * Performing reproducible research * Performing and sharing complex analyses with Workflows * An "Introduction to Galaxy" session, offered on May 24, for Galaxy newcomers. *Registration: * The conference fee is ?100 on or before April 24, and ?120 after that. The meeting is being held at the Conference Centre De Werelt in Lunteren, The Netherlands, which is also the conference hotel. You are encouraged to register early, as space at the hotel (and at the "Intro to Galaxy" session) is limited and is likely to fill up before the conference itself does. See http://galaxy.psu.edu/gcc2011/Register.html * Abstract Submission: * Abstracts are now being accepted for short oral presentations. Proposals on any topic of interest to the Galaxy community are welcome and encouraged. The abstract submission deadline is the end of February 28. See http://galaxy.psu.edu/gcc2011/Abstracts.html * * *Sponsors * The 2011 Galaxy Community Conference is co-sponsored by the US National Science Foundation (NSF, http://www.nsf.gov/), and the Netherlands Bioinformatics Centre (NBIC, http://www.nbic.nl/). NBIC is a collaborative institute of the bioinformatics groups in the Netherlands. Together, these groups perform cutting-edge research, develop novel tools and support platforms, create an e-science infrastructure and educate the next generations of bioinformaticians. We are looking forward to a great conference and hope to see you in the Netherlands! The Galaxy and NBIC Teams -- http://galaxy.psu.edu/gcc2011/ http://getgalaxy.org http://usegalaxy.org/ From bnbowman at gmail.com Sat Feb 5 14:10:37 2011 From: bnbowman at gmail.com (Brett Bowman) Date: Sat, 5 Feb 2011 11:10:37 -0800 Subject: [Biopython] Bad Gateway IOerror when querying Entrez Message-ID: I'm running a script that takes a multi-fasta file and runs a local PSI-Blast on each seq, then queries NCBI for the full record of each one of those hits. It works fine most of the time, but occasionally it throws an IOError telling me that there is a "Bad Gateway". Curiously, this causes the script to crash sometimes but not others, and sometimes it gives me a warning of a "J" in the sequence when there are no Js in any of my files, except in the Fasta header. Does anyone know what causes this or how to stop it? Or do I just need to trap IO exceptions from every web query that I run, with appropriate code to retry failed queries? Brett Bowman Woelk Lab UCSD Medical School *** WARNING *** Assuming Amino (see -seqtype option), invalid letters found: J Traceback (most recent call last): File "AlignmentFromBlast.py", line 246, in seq_data = fetch_seqs_from_entrez(len(gi_nums), webenv, query_key) File "AlignmentFromBlast.py", line 125, in fetch_seqs_from_entrez webenv=webenv, query_key=query_key) File "/usr/lib/pymodules/python2.6/Bio/Entrez/__init__.py", line 109, in efetch return _open(cgi, variables) File "/usr/lib/pymodules/python2.6/Bio/Entrez/__init__.py", line 353, in _open raise IOError("Bad Gateway!") IOError: Bad Gateway! From p.j.a.cock at googlemail.com Sat Feb 5 16:48:38 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 5 Feb 2011 21:48:38 +0000 Subject: [Biopython] Bad Gateway IOerror when querying Entrez In-Reply-To: References: Message-ID: On Sat, Feb 5, 2011 at 7:10 PM, Brett Bowman wrote: > I'm running a script that takes a multi-fasta file and runs a local > PSI-Blast on each seq, then queries NCBI for the full record of each one of > those hits. ?It works fine most of the time, but occasionally it throws an > IOError telling me that there is a "Bad Gateway". ?Curiously, this causes > the script to crash sometimes but not others, and sometimes it gives me a > warning of a "J" in the sequence when there are no Js in any of my files, > except in the Fasta header. > > Does anyone know what causes this or how to stop it? ?Or do I just need to > trap IO exceptions from every web query that I run, with appropriate code to > retry failed queries? > > Brett Bowman Hi Brett, I've never had this 'J' problem (is this for PSI-BLAST?)... not sure what is going on there. However, a "Bad Gateway" or other IOErrors are almost to be expected with a long Entrez script (especially if run during peak times - don't forget to check the NCBI usage guidelines). I recommend you add a try/except block and some sensible retry code. Peter From hlapp at drycafe.net Sat Feb 5 18:45:47 2011 From: hlapp at drycafe.net (Hilmar Lapp) Date: Sat, 5 Feb 2011 18:45:47 -0500 Subject: [Biopython] NESCent Seeks Hackathon Whitepapers In-Reply-To: <0D7D89E4-C0D4-4347-A94C-21800E927746@ad.unc.edu> References:

<0D7D89E4-C0D4-4347-A94C-21800E927746@ad.unc.edu> Message-ID: <066A2391-6041-408C-B26E-9B867DE785C7@drycafe.net> The National Evolutionary Synthesis Center (NESCent), in keeping with its objective to promote collaborative development of open-source, reusable, and standards-supporting informatics resources, sponsors highly collaborative, face-to-face software development events, called "hackathons" (see [1]). To ensure that this program continues to be responsive to user needs and to tap into the expertise and creativity of the evolutionary biology community, NESCent is soliciting short whitepapers (2-6 pages) [2] on potential target areas for future hackathons. To further encourage submissions, we have now distilled specific guidelines for proposing hackathon events, based on the experiences gained from events we have sponsored in the past: http://informatics.nescent.org/wiki/Hackathon_Whitepaper_Guidelines The Center's Call for Informatics Whitepapers [3] includes not only hackathons, but also a large spectrum of other initiatives to be undertaken by the Center, including training, software development, collaborative ontology development, and coordination of data standards. Whitepapers are accepted at any time and reviewed on an on- going basis. URLs: [1] Collaborative cyberinfrastructure events and programs organized by NESCent: http://informatics.nescent.org/wiki/Main_Page [2] NESCent Call for Informatics Whitepapers http://www.nescent.org/informatics/whitepapers.php [3] Hackathon Whitepaper Guidelines: http://informatics.nescent.org/wiki/Hackathon_Whitepaper_Guidelines [4] Past NESCent-sponsored hackathons: http://informatics.nescent.org/wiki/Main_Page#Hackathons From dimitrakopoul at gmail.com Sun Feb 6 15:37:01 2011 From: dimitrakopoul at gmail.com (chris dimitrakopoulos) Date: Sun, 6 Feb 2011 22:37:01 +0200 Subject: [Biopython] Feature selection techniques modules Message-ID: Hello everyone, I am an msc student in University of Patras, Greece, in the research field of Bioinformatics. I recently become a member of the OBF and i appreciate the open source work of your OBF project. I had a discussion with Mr. Robert Buels about this year gsoc, cause i look forward to make an application and i found that OBF would be the organization most suitable for me. Generally, i was idling in the projects announced on previous years and i found them very interesting. As this year's potential projects have not been announced yet, i wanted to express to you an idea of mine, say briefly what I am thinking of doing, and ask you if you think it is a good idea and it is worth to make an application with this subject after March 28. Well, I think that feature selection techniques have become a very important issue in many bioinformatics implementations. In many cases (like protein interactions prediction), you have to find a way to collect the best set of features that leads to the best classification performance. I looked in Biopython libraries and i didn't find something relative about FS techniques implementation to a dataset of features (like t-test, ANOVA, Wilcoxon, CFS etc... ). Hence, i think that the creation of a library focused on FS techniques would be a good idea. Moreover, that library can have an hierarchical structure as there are different types of FS techniques, like filter, wrapper and embedded techniques. Furthermore, each type of them is divided into more groups, (f.e. filter methods are divided into univariate and multivariate methods, according to the consideration of feature dependencies) etc... Only some of the methods i am thinking of implementing are: T-test, ANOVA, Gamma, bivariate methods, CFS, MRMR which are some known filter feature selection techniques. In wrapper and embedded methods, the classifiers are been used in the process of feature selection, so we have techniques based on Genetic algorithms, Random forests, logistic regression, Decision Tree Learners, Bayesian Classifiers, etc.. In this case, the existing Biopython modules Bio.LogisticRegression, Bio.GA and Bio.NaiveBayes could be used. More information on the techniques I describe can be found on the following links: http://bioinformatics.oxfordjournals.org/content/23/19/2507.full.pdf+html http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=3570EDE4C7E11AAE7CA5F727800DC58A?doi=10.1.1.37.4643&rep=rep1&type=pdf New functions computing the above measures can be created. The calculation can be done between vectors of features, between a feature vector and the output vector, or even if in large datasets (with many features) been readen from a file, in which we want to implement feature selections. I send to you this email in order to express briefly my idea. Please let me know what do you think about it and if it is worth been proposed as one of my student applications in gsoc 2011, to open bioinformatics foundation. If you want me to tell you any further details about my thinking just ask me! :-) Look forward to hearing from you, Chris Dim From sdavis2 at mail.nih.gov Sun Feb 6 16:35:09 2011 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Sun, 6 Feb 2011 16:35:09 -0500 Subject: [Biopython] Feature selection techniques modules In-Reply-To: References: Message-ID: On Sun, Feb 6, 2011 at 3:37 PM, chris dimitrakopoulos < dimitrakopoul at gmail.com> wrote: > Hello everyone, > > I am an msc student in University of Patras, Greece, in the research field > of Bioinformatics. I recently become a member of the OBF and i appreciate > the open source work of your OBF project. > > I had a discussion with Mr. Robert Buels about this year gsoc, cause i look > forward to make an application and i found that OBF would be the > organization most suitable for me. Generally, i was idling in the projects > announced on previous years and i found them very interesting. As this > year's potential projects have not been announced yet, i wanted to express > to you an idea of mine, say briefly what I am thinking of doing, and ask > you > if you think it is a good idea and it is worth to make an application with > this subject after March 28. > > Well, I think that feature selection techniques have become a very > important > issue in many bioinformatics implementations. In many cases (like protein > interactions prediction), you have to find a way to collect the best set of > features that leads to the best classification performance. I looked in > Biopython libraries and i didn't find something relative about FS > techniques > implementation to a dataset of features (like t-test, ANOVA, Wilcoxon, CFS > etc... ). Hence, i think that the creation of a library focused on FS > techniques would be a good idea. Moreover, that library can have an > hierarchical structure as there are different types of FS techniques, like > filter, wrapper and embedded techniques. Furthermore, each type of them is > divided into more groups, (f.e. filter methods are divided into univariate > and multivariate methods, according to the consideration of feature > dependencies) etc... > > Only some of the methods i am thinking of implementing are: > > T-test, ANOVA, Gamma, bivariate methods, CFS, MRMR which are some known > filter feature selection techniques. > In wrapper and embedded methods, the classifiers are been used in the > process of feature selection, so we have techniques based on Genetic > algorithms, Random forests, logistic regression, Decision Tree Learners, > Bayesian Classifiers, etc.. In this case, the existing Biopython modules > Bio.LogisticRegression, Bio.GA and Bio.NaiveBayes could be used. > > Hi, Chris. You might want to look at the Rpy project. All of the above machine learning and feature selection algorithms (and many more) are implemented in R and can be wrapped fairly easily in python using Rpy. Sean > More information on the techniques I describe can be found on the following > links: > > http://bioinformatics.oxfordjournals.org/content/23/19/2507.full.pdf+html > > http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=3570EDE4C7E11AAE7CA5F727800DC58A?doi=10.1.1.37.4643&rep=rep1&type=pdf > > New functions computing the above measures can be created. The calculation > can be done between vectors of features, between a feature vector and the > output vector, or even if in large datasets (with many features) been > readen > from a file, in which we want to implement feature selections. > > I send to you this email in order to express briefly my idea. Please let me > know what do you think about it and if it is worth been proposed as one of > my student applications in gsoc 2011, to open bioinformatics foundation. If > you want me to tell you any further details about my thinking just ask me! > :-) > > Look forward to hearing from you, > Chris Dim > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Sun Feb 6 17:05:49 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 6 Feb 2011 22:05:49 +0000 Subject: [Biopython] Feature selection techniques modules In-Reply-To: References: Message-ID: On Sun, Feb 6, 2011 at 8:37 PM, chris dimitrakopoulos wrote: > Hello everyone, > > I am an msc student in University of Patras, Greece, in the research field > of Bioinformatics. I recently become a member of the OBF and i appreciate > the open source work of your OBF project. > > I had a discussion with Mr. Robert Buels about this year gsoc, cause i look > forward to make an application and i found that OBF would be the > organization most suitable for me. Generally, i was idling in the projects > announced on previous years and i found them very interesting. As this > year's potential projects have not been announced yet, i wanted to express > to you an idea of mine, say briefly what I am thinking of doing, and ask you > if you think it is a good idea and it is worth to make an application with > this subject after March 28. > > Well, I think that feature selection techniques have become a very important > issue in many bioinformatics implementations. In many cases (like protein > interactions prediction), you have to find a way to collect the best set of > features that leads to the best classification performance. I looked in > Biopython libraries and i didn't find something relative about FS techniques > implementation to a dataset of features (like t-test, ANOVA, Wilcoxon, CFS > etc... ). Hence, i think that the creation of a library focused on FS > techniques would be a good idea. Moreover, that library can have an > hierarchical structure as there are different types of FS techniques, like > filter, wrapper and embedded techniques. Furthermore, each type of them is > divided into more groups, (f.e. filter methods are divided into univariate > and multivariate methods, according to the consideration of feature > dependencies) etc... > > Only some of the methods i am thinking of implementing are: > > T-test, ANOVA, Gamma, bivariate methods, CFS, MRMR which are some known > filter feature selection techniques. > In wrapper and embedded methods, the classifiers are been used in the > process of feature selection, so we have techniques based on Genetic > algorithms, Random forests, logistic regression, Decision Tree Learners, > Bayesian Classifiers, etc.. In this case, the existing Biopython modules > Bio.LogisticRegression, Bio.GA and Bio.NaiveBayes could be used. > > More information on the techniques I describe can be found on the following > links: > > http://bioinformatics.oxfordjournals.org/content/23/19/2507.full.pdf+html > http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=3570EDE4C7E11AAE7CA5F727800DC58A?doi=10.1.1.37.4643&rep=rep1&type=pdf > > New functions computing the above measures can be created. The calculation > can be done between vectors of features, between a feature vector and the > output vector, or even if in large datasets (with many features) been readen > from a file, in which we want to implement feature selections. > > I send to you this email in order to express briefly my idea. Please let me > know what do you think about it and if it is worth been proposed as one of > my student applications in gsoc 2011, to open bioinformatics foundation. If > you want me to tell you any further details about my thinking just ask me! > :-) > > Look forward to hearing from you, > Chris Dim Hello Chris, This sounds interesting - a provided we can find some suitable mentors it could turn into a Google Summer of Code project. Something you could start with (now or as one of the first tasks if you write up a GSoC proposal) could be to understand the existing code in Biopython in this area (Bio.LogisticRegression, Bio.GA, Bio.NaiveBayes etc) and perhaps writing extra documentation for them (they are not covered in the tutorial at all), and perhaps some more unit tests too. One thing I would suggest checking is how much of the statistical code you mention is already written in other Python libraries (e.g. SciPy). For something as complicated as statistical testing there is no point reimplementing it. Tiago has previously said there are statics routines in SciPy he may want to use in his Biopython code for population genetics. So, check out SciPy: http://scipy.org/ Regards, Peter From bnbowman at gmail.com Mon Feb 7 17:30:19 2011 From: bnbowman at gmail.com (Brett Bowman) Date: Mon, 7 Feb 2011 14:30:19 -0800 Subject: [Biopython] Pulling Alignment From PSI-Blast Output Message-ID: I'm trying to use the PSI-Blast results from a series of proteins to detect distant homologues, using HMMs of various sorts. Currently I'm pulling down the sequence IDs with PSI-Blast, downloading the full sequences from NCBI, then aligning everything with ClustalW or Muscle. However this is eating up way more processor time than I have to spare, so I want to just pull the full multi-sequence alignment from the PSI-blast results if possible (OUTFMT option #3 or 4), for use in building the HMMs. But it doesn't look like AlignIO has a module for reading the peculiar format that PSI-Blast generates... Has this been done before, or will I need to write my own parser? Brett Bowman Woelk Lab UCSD School of Medicine From mjldehoon at yahoo.com Mon Feb 7 20:20:09 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 7 Feb 2011 17:20:09 -0800 (PST) Subject: [Biopython] Pulling Alignment From PSI-Blast Output In-Reply-To: Message-ID: <216797.39164.qm@web161211.mail.bf1.yahoo.com> One option you could try is to let PSI-Blast generate its output in XML and check if the information you need is present in the XML. If it is, you can parse the XML with the read() function in Bio.Entrez. You may find that Bio.Entrez needs an additional DTD file to be able to parse the PSI-Blast XML output (Bio.Entrez will tell you which one and where to store it). If so, please let us know, so we can include the required DTDs in the next release of Biopython. --Michiel. --- On Mon, 2/7/11, Brett Bowman wrote: > From: Brett Bowman > Subject: [Biopython] Pulling Alignment From PSI-Blast Output > To: biopython at biopython.org > Date: Monday, February 7, 2011, 5:30 PM > I'm trying to use the PSI-Blast > results from a series of proteins to detect > distant homologues, using HMMs of various sorts.? > Currently I'm pulling down > the sequence IDs with PSI-Blast, downloading the full > sequences from NCBI, > then aligning everything with ClustalW or Muscle.? > However this is eating up > way more processor time than I have to spare, so I want to > just pull the > full multi-sequence alignment from the PSI-blast results if > possible (OUTFMT > option #3 or 4), for use in building the HMMs.? But it > doesn't look like > AlignIO has a module for reading the peculiar format that > PSI-Blast > generates... > > Has this been done before, or will I need to write my own > parser? > > Brett Bowman > Woelk Lab > UCSD School of Medicine > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From bnbowman at gmail.com Tue Feb 8 02:40:10 2011 From: bnbowman at gmail.com (Brett Bowman) Date: Mon, 7 Feb 2011 23:40:10 -0800 Subject: [Biopython] Pulling Alignment From PSI-Blast Output In-Reply-To: <216797.39164.qm@web161211.mail.bf1.yahoo.com> References: <216797.39164.qm@web161211.mail.bf1.yahoo.com> Message-ID: I thought about that, but there doesn't appear to be any multiple-alignment data in the XML file - just pair-wise alignments of the query with each hit. In addition, when I parse the output file with NCBIXML I get a Bio.Blast.Record.Blast object, instead of a Bio.Blast.Record.PSIBlast object. The Biopython cookbook describes how to work with a PSIBlast object, but it doesn't really cover how to make one... On Mon, Feb 7, 2011 at 5:20 PM, Michiel de Hoon wrote: > One option you could try is to let PSI-Blast generate its output in XML and > check if the information you need is present in the XML. If it is, you can > parse the XML with the read() function in Bio.Entrez. You may find that > Bio.Entrez needs an additional DTD file to be able to parse the PSI-Blast > XML output (Bio.Entrez will tell you which one and where to store it). If > so, please let us know, so we can include the required DTDs in the next > release of Biopython. > > --Michiel. > > --- On Mon, 2/7/11, Brett Bowman wrote: > > > From: Brett Bowman > > Subject: [Biopython] Pulling Alignment From PSI-Blast Output > > To: biopython at biopython.org > > Date: Monday, February 7, 2011, 5:30 PM > > I'm trying to use the PSI-Blast > > results from a series of proteins to detect > > distant homologues, using HMMs of various sorts. > > Currently I'm pulling down > > the sequence IDs with PSI-Blast, downloading the full > > sequences from NCBI, > > then aligning everything with ClustalW or Muscle. > > However this is eating up > > way more processor time than I have to spare, so I want to > > just pull the > > full multi-sequence alignment from the PSI-blast results if > > possible (OUTFMT > > option #3 or 4), for use in building the HMMs. But it > > doesn't look like > > AlignIO has a module for reading the peculiar format that > > PSI-Blast > > generates... > > > > Has this been done before, or will I need to write my own > > parser? > > > > Brett Bowman > > Woelk Lab > > UCSD School of Medicine > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > > From bnbowman at gmail.com Tue Feb 8 02:44:08 2011 From: bnbowman at gmail.com (Brett Bowman) Date: Mon, 7 Feb 2011 23:44:08 -0800 Subject: [Biopython] Pulling Alignment From PSI-Blast Output In-Reply-To: References:

Message-ID: I had never heard of jackHMMER until now, so I'll look into it. However the outline of my current project is essential "This paper used X to find Y, so I want you to use X to find a homologue of Y in this other background", so I'm not sure how much wiggle room I have to change the methodologies used. The source paper used HHmake to create HMMs from the output of PSI-Blast, so I am trying to do the same if possible. -Brett On Mon, Feb 7, 2011 at 9:41 PM, Ruchira Datta wrote: > If you're using HMMs anyway, why not use jackhmmer? It's been shown to be > more sensitive than PSI-BLAST at the same number of iterations, and with an > option it will output the alignment. > > Note that its alignment is in Stockholm format though, and if you want > something else, BioPython's Stockholm parsing is very slow. > > --Ruchira > On Feb 7, 2011 2:31 PM, "Brett Bowman" wrote: > > I'm trying to use the PSI-Blast results from a series of proteins to > detect > > distant homologues, using HMMs of various sorts. Currently I'm pulling > down > > the sequence IDs with PSI-Blast, downloading the full sequences from > NCBI, > > then aligning everything with ClustalW or Muscle. However this is eating > up > > way more processor time than I have to spare, so I want to just pull the > > full multi-sequence alignment from the PSI-blast results if possible > (OUTFMT > > option #3 or 4), for use in building the HMMs. But it doesn't look like > > AlignIO has a module for reading the peculiar format that PSI-Blast > > generates... > > > > Has this been done before, or will I need to write my own parser? > > > > Brett Bowman > > Woelk Lab > > UCSD School of Medicine > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > From ruchira.datta at gmail.com Tue Feb 8 04:22:20 2011 From: ruchira.datta at gmail.com (Ruchira Datta) Date: Tue, 8 Feb 2011 01:22:20 -0800 Subject: [Biopython] Pulling Alignment From PSI-Blast Output In-Reply-To: References:

Message-ID: Oh, I see what you mean. Yes, I've had issues with HHmake on HMMER HMMs, so I'd stick with your current protocol and let it make HMMs the way it wants to. --Ruchira On Mon, Feb 7, 2011 at 11:44 PM, Brett Bowman wrote: > I had never heard of jackHMMER until now, so I'll look into it. However > the outline of my current project is essential "This paper used X to find Y, > so I want you to use X to find a homologue of Y in this other background", > so I'm not sure how much wiggle room I have to change the methodologies > used. The source paper used HHmake to create HMMs from the output of > PSI-Blast, so I am trying to do the same if possible. > > -Brett > > > On Mon, Feb 7, 2011 at 9:41 PM, Ruchira Datta wrote: > >> If you're using HMMs anyway, why not use jackhmmer? It's been shown to be >> more sensitive than PSI-BLAST at the same number of iterations, and with an >> option it will output the alignment. >> >> Note that its alignment is in Stockholm format though, and if you want >> something else, BioPython's Stockholm parsing is very slow. >> >> --Ruchira >> On Feb 7, 2011 2:31 PM, "Brett Bowman" wrote: >> > I'm trying to use the PSI-Blast results from a series of proteins to >> detect >> > distant homologues, using HMMs of various sorts. Currently I'm pulling >> down >> > the sequence IDs with PSI-Blast, downloading the full sequences from >> NCBI, >> > then aligning everything with ClustalW or Muscle. However this is eating >> up >> > way more processor time than I have to spare, so I want to just pull the >> > full multi-sequence alignment from the PSI-blast results if possible >> (OUTFMT >> > option #3 or 4), for use in building the HMMs. But it doesn't look like >> > AlignIO has a module for reading the peculiar format that PSI-Blast >> > generates... >> > >> > Has this been done before, or will I need to write my own parser? >> > >> > Brett Bowman >> > Woelk Lab >> > UCSD School of Medicine >> > _______________________________________________ >> > Biopython mailing list - Biopython at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biopython >> > > From mjldehoon at yahoo.com Tue Feb 8 07:05:22 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 8 Feb 2011 04:05:22 -0800 (PST) Subject: [Biopython] Pulling Alignment From PSI-Blast Output In-Reply-To: Message-ID: <172275.39440.qm@web161202.mail.bf1.yahoo.com> I am surprised that the multiple alignment is not in the XML at all. It can not be constructed from the information in the XML? Anyway, if it is in there, I would suggest to use Bio.Entrez to parse the XML instead of the parser in Bio.Blast. The Bio.Entrez parser will give you all the information in the XML; the parser in Bio.Blast is more polished but may not give you all the information present in the PSI-Blast output. --Michiel. --- On Tue, 2/8/11, Brett Bowman wrote: From: Brett Bowman Subject: Re: [Biopython] Pulling Alignment From PSI-Blast Output To: "Michiel de Hoon" Cc: biopython at biopython.org Date: Tuesday, February 8, 2011, 2:40 AM I thought about that, but there doesn't appear to be any multiple-alignment data in the XML file - just pair-wise alignments of the query with each hit. ?In addition, when I parse the output file with NCBIXML I get a Bio.Blast.Record.Blast object, instead of a Bio.Blast.Record.PSIBlast object. ?The Biopython cookbook describes how to work with a PSIBlast object, but it doesn't really cover how to make one... On Mon, Feb 7, 2011 at 5:20 PM, Michiel de Hoon wrote: One option you could try is to let PSI-Blast generate its output in XML and check if the information you need is present in the XML. If it is, you can parse the XML with the read() function in Bio.Entrez. You may find that Bio.Entrez needs an additional DTD file to be able to parse the PSI-Blast XML output (Bio.Entrez will tell you which one and where to store it). If so, please let us know, so we can include the required DTDs in the next release of Biopython. --Michiel. --- On Mon, 2/7/11, Brett Bowman wrote: > From: Brett Bowman > Subject: [Biopython] Pulling Alignment From PSI-Blast Output > To: biopython at biopython.org > Date: Monday, February 7, 2011, 5:30 PM > I'm trying to use the PSI-Blast > results from a series of proteins to detect > distant homologues, using HMMs of various sorts.? > Currently I'm pulling down > the sequence IDs with PSI-Blast, downloading the full > sequences from NCBI, > then aligning everything with ClustalW or Muscle.? > However this is eating up > way more processor time than I have to spare, so I want to > just pull the > full multi-sequence alignment from the PSI-blast results if > possible (OUTFMT > option #3 or 4), for use in building the HMMs.? But it > doesn't look like > AlignIO has a module for reading the peculiar format that > PSI-Blast > generates... > > Has this been done before, or will I need to write my own > parser? > > Brett Bowman > Woelk Lab > UCSD School of Medicine > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From bnbowman at gmail.com Tue Feb 8 12:48:37 2011 From: bnbowman at gmail.com (Brett Bowman) Date: Tue, 8 Feb 2011 09:48:37 -0800 Subject: [Biopython] Pulling Alignment From PSI-Blast Output In-Reply-To: <172275.39440.qm@web161202.mail.bf1.yahoo.com> References: <172275.39440.qm@web161202.mail.bf1.yahoo.com> Message-ID: Sadly no - I tried lining up the output sequence alignments, but the result is meaningless because they are all just aligned pair-wise to the query. I'm wondering if maybe I just need to go back and use BlastPGP somehow? I know they cut out a lot of features to make the PSIblast standalone executable. Though why they would remove things from the output makes no sense to me... So I guess that goes back to my previous question - if parsing the PSIBlast XML output only gives me a Bio.Blast.Record.Blast object, then where do the Bio.Blast.Record.PSIBlast objects, which are supposed to have that alignment built in, come from? -Brett On Tue, Feb 8, 2011 at 4:05 AM, Michiel de Hoon wrote: > I am surprised that the multiple alignment is not in the XML at all. It can > not be constructed from the information in the XML? Anyway, if it is in > there, I would suggest to use Bio.Entrez to parse the XML instead of the > parser in Bio.Blast. The Bio.Entrez parser will give you all the information > in the XML; the parser in Bio.Blast is more polished but may not give you > all the information present in the PSI-Blast output. > > --Michiel. > > > --- On *Tue, 2/8/11, Brett Bowman * wrote: > > > From: Brett Bowman > Subject: Re: [Biopython] Pulling Alignment From PSI-Blast Output > To: "Michiel de Hoon" > Cc: biopython at biopython.org > Date: Tuesday, February 8, 2011, 2:40 AM > > > I thought about that, but there doesn't appear to be any multiple-alignment > data in the XML file - just pair-wise alignments of the query with each hit. > In addition, when I parse the output file with NCBIXML I get a > Bio.Blast.Record.Blast object, instead of a Bio.Blast.Record.PSIBlast > object. The Biopython cookbook describes how to work with a PSIBlast > object, but it doesn't really cover how to make one... > > On Mon, Feb 7, 2011 at 5:20 PM, Michiel de Hoon > > wrote: > > One option you could try is to let PSI-Blast generate its output in XML and > check if the information you need is present in the XML. If it is, you can > parse the XML with the read() function in Bio.Entrez. You may find that > Bio.Entrez needs an additional DTD file to be able to parse the PSI-Blast > XML output (Bio.Entrez will tell you which one and where to store it). If > so, please let us know, so we can include the required DTDs in the next > release of Biopython. > > --Michiel. > > --- On Mon, 2/7/11, Brett Bowman > > wrote: > > > From: Brett Bowman > > > > Subject: [Biopython] Pulling Alignment From PSI-Blast Output > > To: biopython at biopython.org > > Date: Monday, February 7, 2011, 5:30 PM > > I'm trying to use the PSI-Blast > > results from a series of proteins to detect > > distant homologues, using HMMs of various sorts. > > Currently I'm pulling down > > the sequence IDs with PSI-Blast, downloading the full > > sequences from NCBI, > > then aligning everything with ClustalW or Muscle. > > However this is eating up > > way more processor time than I have to spare, so I want to > > just pull the > > full multi-sequence alignment from the PSI-blast results if > > possible (OUTFMT > > option #3 or 4), for use in building the HMMs. But it > > doesn't look like > > AlignIO has a module for reading the peculiar format that > > PSI-Blast > > generates... > > > > Has this been done before, or will I need to write my own > > parser? > > > > Brett Bowman > > Woelk Lab > > UCSD School of Medicine > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > > > > From biopython at maubp.freeserve.co.uk Tue Feb 8 12:56:54 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 8 Feb 2011 17:56:54 +0000 Subject: [Biopython] Pulling Alignment From PSI-Blast Output In-Reply-To: References: <172275.39440.qm@web161202.mail.bf1.yahoo.com> Message-ID: On Tue, Feb 8, 2011 at 5:48 PM, Brett Bowman wrote: > Sadly no - I tried lining up the output sequence alignments, but the result > is meaningless because they are all just aligned pair-wise to the query. > ?I'm wondering if maybe I just need to go back and use BlastPGP somehow? ?I > know they cut out a lot of features to make the PSIblast standalone > executable. ?Though why they would remove things from the output makes no > sense to me... > > So I guess that goes back to my previous question - if parsing the PSIBlast > XML output only gives me a Bio.Blast.Record.Blast object, then where do the > Bio.Blast.Record.PSIBlast objects, which are supposed to have that alignment > built in, come from? Currently you've have to use the old text based PSI-BLAST parser to get Bio.Blast.Record.PSIBlast objects. The XML parser doesn't (yet) do anything special for PSI-BLAST. But in any case the PSI-BLAST XML just has the pairwise alignments (right?). Peter From ruchira.datta at gmail.com Tue Feb 8 12:59:28 2011 From: ruchira.datta at gmail.com (Ruchira Datta) Date: Tue, 8 Feb 2011 09:59:28 -0800 Subject: [Biopython] Pulling Alignment From PSI-Blast Output In-Reply-To: <172275.39440.qm@web161202.mail.bf1.yahoo.com> References: <172275.39440.qm@web161202.mail.bf1.yahoo.com> Message-ID: On Tue, Feb 8, 2011 at 4:05 AM, Michiel de Hoon wrote: > I am surprised that the multiple alignment is not in the XML at all. It can > not be constructed from the information in the XML? It might be there implicitly in HSPs (high-scoring segment pairs). Trying to make a multiple sequence alignment out of these is a huge pain. Brett, might you not run all of HHmake (which includes running PSI-BLAST)? It extracts the multiple sequence alignment. (It might "clean up" after itself by deleting it -- you might need to go in and take that line out.) Then, if you wanted to do more stuff to the MSA before running HHmake, you could just discard the rest of the results of the first run and run HHmake again at the point that you want. --Ruchira Anyway, if it is in there, I would suggest to use Bio.Entrez to parse the > XML instead of the parser in Bio.Blast. The Bio.Entrez parser will give you > all the information in the XML; the parser in Bio.Blast is more polished but > may not give you all the information present in the PSI-Blast output. > > --Michiel. > > --- On Tue, 2/8/11, Brett Bowman wrote: > > From: Brett Bowman > Subject: Re: [Biopython] Pulling Alignment From PSI-Blast Output > To: "Michiel de Hoon" > Cc: biopython at biopython.org > Date: Tuesday, February 8, 2011, 2:40 AM > > I thought about that, but there doesn't appear to be any multiple-alignment > data in the XML file - just pair-wise alignments of the query with each hit. > In addition, when I parse the output file with NCBIXML I get a > Bio.Blast.Record.Blast object, instead of a Bio.Blast.Record.PSIBlast > object. The Biopython cookbook describes how to work with a PSIBlast > object, but it doesn't really cover how to make one... > > > > On Mon, Feb 7, 2011 at 5:20 PM, Michiel de Hoon > wrote: > > > One option you could try is to let PSI-Blast generate its output in XML and > check if the information you need is present in the XML. If it is, you can > parse the XML with the read() function in Bio.Entrez. You may find that > Bio.Entrez needs an additional DTD file to be able to parse the PSI-Blast > XML output (Bio.Entrez will tell you which one and where to store it). If > so, please let us know, so we can include the required DTDs in the next > release of Biopython. > > > > > > --Michiel. > > > > --- On Mon, 2/7/11, Brett Bowman wrote: > > > > > From: Brett Bowman > > > Subject: [Biopython] Pulling Alignment From PSI-Blast Output > > > To: biopython at biopython.org > > > Date: Monday, February 7, 2011, 5:30 PM > > > I'm trying to use the PSI-Blast > > > results from a series of proteins to detect > > > distant homologues, using HMMs of various sorts. > > > Currently I'm pulling down > > > the sequence IDs with PSI-Blast, downloading the full > > > sequences from NCBI, > > > then aligning everything with ClustalW or Muscle. > > > However this is eating up > > > way more processor time than I have to spare, so I want to > > > just pull the > > > full multi-sequence alignment from the PSI-blast results if > > > possible (OUTFMT > > > option #3 or 4), for use in building the HMMs. But it > > > doesn't look like > > > AlignIO has a module for reading the peculiar format that > > > PSI-Blast > > > generates... > > > > > > Has this been done before, or will I need to write my own > > > parser? > > > > > > Brett Bowman > > > Woelk Lab > > > UCSD School of Medicine > > > _______________________________________________ > > > Biopython mailing list - Biopython at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > > > > > > > > > > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From bnbowman at gmail.com Tue Feb 8 15:28:45 2011 From: bnbowman at gmail.com (Brett Bowman) Date: Tue, 8 Feb 2011 12:28:45 -0800 Subject: [Biopython] Pulling Alignment From PSI-Blast Output In-Reply-To: References: <172275.39440.qm@web161202.mail.bf1.yahoo.com> Message-ID: > > It might be there implicitly in HSPs (high-scoring segment pairs). Trying > to make a multiple sequence alignment out of these is a huge pain. > It isn't, I checked - the alignments are only pairwise, so combining them produces non-sense. Thats why I have been using Muscle and ClustalW to create an alignment after the Blast, despite the large computational cost. > Brett, might you not run all of HHmake (which includes running PSI-BLAST)? > It extracts the multiple sequence alignment. > HHmake does not include PSI-Blast - all HHmake does is convert a multiple sequence alignment to an HMM profile. The original paper that inspired this project used PSI-Blast and then used HHmake, but make no mention of alignments. I have sent the authors an e-mail for clarification, but have not received a reply. But from the way the paper was written, it does not appear that they used an alignment program, hence my confusion at my inability to do the same. -Brett On Tue, Feb 8, 2011 at 9:59 AM, Ruchira Datta wrote: > > > On Tue, Feb 8, 2011 at 4:05 AM, Michiel de Hoon wrote: > >> I am surprised that the multiple alignment is not in the XML at all. It >> can not be constructed from the information in the XML? > > > It might be there implicitly in HSPs (high-scoring segment pairs). Trying > to make a multiple sequence alignment out of these is a huge pain. > > Brett, might you not run all of HHmake (which includes running PSI-BLAST)? > It extracts the multiple sequence alignment. (It might "clean up" after > itself by deleting it -- you might need to go in and take that line out.) > Then, if you wanted to do more stuff to the MSA before running HHmake, you > could just discard the rest of the results of the first run and run HHmake > again at the point that you want. > > --Ruchira > > Anyway, if it is in there, I would suggest to use Bio.Entrez to parse the >> XML instead of the parser in Bio.Blast. The Bio.Entrez parser will give you >> all the information in the XML; the parser in Bio.Blast is more polished but >> may not give you all the information present in the PSI-Blast output. >> >> --Michiel. >> >> --- On Tue, 2/8/11, Brett Bowman wrote: >> >> From: Brett Bowman >> Subject: Re: [Biopython] Pulling Alignment From PSI-Blast Output >> To: "Michiel de Hoon" >> Cc: biopython at biopython.org >> Date: Tuesday, February 8, 2011, 2:40 AM >> >> I thought about that, but there doesn't appear to be any >> multiple-alignment data in the XML file - just pair-wise alignments of the >> query with each hit. In addition, when I parse the output file with NCBIXML >> I get a Bio.Blast.Record.Blast object, instead of a >> Bio.Blast.Record.PSIBlast object. The Biopython cookbook describes how to >> work with a PSIBlast object, but it doesn't really cover how to make one... >> >> >> >> On Mon, Feb 7, 2011 at 5:20 PM, Michiel de Hoon >> wrote: >> >> >> One option you could try is to let PSI-Blast generate its output in XML >> and check if the information you need is present in the XML. If it is, you >> can parse the XML with the read() function in Bio.Entrez. You may find that >> Bio.Entrez needs an additional DTD file to be able to parse the PSI-Blast >> XML output (Bio.Entrez will tell you which one and where to store it). If >> so, please let us know, so we can include the required DTDs in the next >> release of Biopython. >> >> >> >> >> >> --Michiel. >> >> >> >> --- On Mon, 2/7/11, Brett Bowman wrote: >> >> >> >> > From: Brett Bowman >> >> > Subject: [Biopython] Pulling Alignment From PSI-Blast Output >> >> > To: biopython at biopython.org >> >> > Date: Monday, February 7, 2011, 5:30 PM >> >> > I'm trying to use the PSI-Blast >> >> > results from a series of proteins to detect >> >> > distant homologues, using HMMs of various sorts. >> >> > Currently I'm pulling down >> >> > the sequence IDs with PSI-Blast, downloading the full >> >> > sequences from NCBI, >> >> > then aligning everything with ClustalW or Muscle. >> >> > However this is eating up >> >> > way more processor time than I have to spare, so I want to >> >> > just pull the >> >> > full multi-sequence alignment from the PSI-blast results if >> >> > possible (OUTFMT >> >> > option #3 or 4), for use in building the HMMs. But it >> >> > doesn't look like >> >> > AlignIO has a module for reading the peculiar format that >> >> > PSI-Blast >> >> > generates... >> >> > >> >> > Has this been done before, or will I need to write my own >> >> > parser? >> >> > >> >> > Brett Bowman >> >> > Woelk Lab >> >> > UCSD School of Medicine >> >> > _______________________________________________ >> >> > Biopython mailing list - Biopython at lists.open-bio.org >> >> > http://lists.open-bio.org/mailman/listinfo/biopython >> >> > >> >> >> >> >> >> >> >> >> >> >> >> >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > From ruchira.datta at gmail.com Tue Feb 8 15:54:34 2011 From: ruchira.datta at gmail.com (Ruchira Datta) Date: Tue, 8 Feb 2011 12:54:34 -0800 Subject: [Biopython] Pulling Alignment From PSI-Blast Output In-Reply-To: References: <172275.39440.qm@web161202.mail.bf1.yahoo.com>

Message-ID: Hi Brett, I have run HHmake numerous times myself and the overall script does include running PSI-BLAST. I suspect they didn't include it in the paper because it's not part of their algorithm which they invented. >From the hh_1.5.1 version, from the script buildali.pl, here are some relevant lines: my $blastpgp=$ncbidir."/blastpgp -I T -s T"; # blastpgp executable our $blastpgp = $blastpgp . " -I T -s T"; # show gi's in defline; use Smith-Waterman # Fast psiblast search for very similar sequences &System("$blastpgp -I T -b 10 -v 10 -e 1e-6 -A 10 -f 15 -d $dbfile -i $tmp.seq &> $tmp.bla"); if ($nseqin<=1) { &blastpgp("-e $Eult -d $db -i $seqfile","$blafile"); } else { if ($core==1) { &blastpgp("-e $Eult -d $db -i $seqfile","$blafile"); &System("perl $perl/alignhits.pl -cov $covcore -e $E2 $qid $bcore $pmaxopt -best -psi -q $seqfile $blafile $coreali",$v2); if ($bcore ne "") {$bcore.=" -B $coreali";} # From here on use $coreali for end pruning of $coreali } &blastpgp("-e $Eult -d $db -i $seqfile -B $psifile","$blafile"); } It's possible you could *only* use their script alignhits.pl: #! /usr/bin/perl -w # Extract a multiple alignment of hits from Blast or PsiBlast output # Usage: alignhits.pl [options] blast.out alignment-file Very sorry for subjecting you to Perl, but it does seem like it might be helpful in your case. :-) --Ruchira On Tue, Feb 8, 2011 at 12:28 PM, Brett Bowman wrote: > It might be there implicitly in HSPs (high-scoring segment pairs). >> Trying to make a multiple sequence alignment out of these is a huge pain. >> > > It isn't, I checked - the alignments are only pairwise, so combining them > produces non-sense. Thats why I have been using Muscle and ClustalW to > create an alignment after the Blast, despite the large computational cost. > > >> Brett, might you not run all of HHmake (which includes running >> PSI-BLAST)? It extracts the multiple sequence alignment. >> > > HHmake does not include PSI-Blast - all HHmake does is convert a multiple > sequence alignment to an HMM profile. The original paper that inspired this > project used PSI-Blast and then used HHmake, but make no mention of > alignments. I have sent the authors an e-mail for clarification, but have > not received a reply. But from the way the paper was written, it does not > appear that they used an alignment program, hence my confusion at my > inability to do the same. > > -Brett > > On Tue, Feb 8, 2011 at 9:59 AM, Ruchira Datta wrote: > >> >> >> On Tue, Feb 8, 2011 at 4:05 AM, Michiel de Hoon wrote: >> >>> I am surprised that the multiple alignment is not in the XML at all. It >>> can not be constructed from the information in the XML? >> >> >> It might be there implicitly in HSPs (high-scoring segment pairs). Trying >> to make a multiple sequence alignment out of these is a huge pain. >> >> Brett, might you not run all of HHmake (which includes running >> PSI-BLAST)? It extracts the multiple sequence alignment. (It might "clean >> up" after itself by deleting it -- you might need to go in and take that >> line out.) Then, if you wanted to do more stuff to the MSA before running >> HHmake, you could just discard the rest of the results of the first run and >> run HHmake again at the point that you want. >> >> --Ruchira >> >> Anyway, if it is in there, I would suggest to use Bio.Entrez to parse the >>> XML instead of the parser in Bio.Blast. The Bio.Entrez parser will give you >>> all the information in the XML; the parser in Bio.Blast is more polished but >>> may not give you all the information present in the PSI-Blast output. >>> >>> --Michiel. >>> >>> --- On Tue, 2/8/11, Brett Bowman wrote: >>> >>> From: Brett Bowman >>> Subject: Re: [Biopython] Pulling Alignment From PSI-Blast Output >>> To: "Michiel de Hoon" >>> Cc: biopython at biopython.org >>> Date: Tuesday, February 8, 2011, 2:40 AM >>> >>> I thought about that, but there doesn't appear to be any >>> multiple-alignment data in the XML file - just pair-wise alignments of the >>> query with each hit. In addition, when I parse the output file with NCBIXML >>> I get a Bio.Blast.Record.Blast object, instead of a >>> Bio.Blast.Record.PSIBlast object. The Biopython cookbook describes how to >>> work with a PSIBlast object, but it doesn't really cover how to make one... >>> >>> >>> >>> On Mon, Feb 7, 2011 at 5:20 PM, Michiel de Hoon >>> wrote: >>> >>> >>> One option you could try is to let PSI-Blast generate its output in XML >>> and check if the information you need is present in the XML. If it is, you >>> can parse the XML with the read() function in Bio.Entrez. You may find that >>> Bio.Entrez needs an additional DTD file to be able to parse the PSI-Blast >>> XML output (Bio.Entrez will tell you which one and where to store it). If >>> so, please let us know, so we can include the required DTDs in the next >>> release of Biopython. >>> >>> >>> >>> >>> >>> --Michiel. >>> >>> >>> >>> --- On Mon, 2/7/11, Brett Bowman wrote: >>> >>> >>> >>> > From: Brett Bowman >>> >>> > Subject: [Biopython] Pulling Alignment From PSI-Blast Output >>> >>> > To: biopython at biopython.org >>> >>> > Date: Monday, February 7, 2011, 5:30 PM >>> >>> > I'm trying to use the PSI-Blast >>> >>> > results from a series of proteins to detect >>> >>> > distant homologues, using HMMs of various sorts. >>> >>> > Currently I'm pulling down >>> >>> > the sequence IDs with PSI-Blast, downloading the full >>> >>> > sequences from NCBI, >>> >>> > then aligning everything with ClustalW or Muscle. >>> >>> > However this is eating up >>> >>> > way more processor time than I have to spare, so I want to >>> >>> > just pull the >>> >>> > full multi-sequence alignment from the PSI-blast results if >>> >>> > possible (OUTFMT >>> >>> > option #3 or 4), for use in building the HMMs. But it >>> >>> > doesn't look like >>> >>> > AlignIO has a module for reading the peculiar format that >>> >>> > PSI-Blast >>> >>> > generates... >>> >>> > >>> >>> > Has this been done before, or will I need to write my own >>> >>> > parser? >>> >>> > >>> >>> > Brett Bowman >>> >>> > Woelk Lab >>> >>> > UCSD School of Medicine >>> >>> > _______________________________________________ >>> >>> > Biopython mailing list - Biopython at lists.open-bio.org >>> >>> > http://lists.open-bio.org/mailman/listinfo/biopython >>> >>> > >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> _______________________________________________ >>> Biopython mailing list - Biopython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >>> >> >> > From bnbowman at gmail.com Tue Feb 8 17:46:02 2011 From: bnbowman at gmail.com (Brett Bowman) Date: Tue, 8 Feb 2011 14:46:02 -0800 Subject: [Biopython] Pulling Alignment From PSI-Blast Output In-Reply-To: References: <172275.39440.qm@web161202.mail.bf1.yahoo.com>

Message-ID: I stand corrected. I couldn't get the buildali script to work, and it wasn't mentioned in the paper that I am trying to emulate, so I tried to build something in Python instead. But it appears to confirm my suspicions that the PSI-Blast standalone doesn't have the output that I want, as they use the less efficient blastpgp algorithm and trap the output, so I think I will give that a try... -Brett On Tue, Feb 8, 2011 at 12:54 PM, Ruchira Datta wrote: > Hi Brett, > > I have run HHmake numerous times myself and the overall script does include > running PSI-BLAST. I suspect they didn't include it in the paper because > it's not part of their algorithm which they invented. > > From the hh_1.5.1 version, from the script buildali.pl, here are some > relevant lines: > > my $blastpgp=$ncbidir."/blastpgp -I T -s T"; # blastpgp executable > > our $blastpgp = $blastpgp . " -I T -s T"; # show gi's in defline; use > Smith-Waterman > > # Fast psiblast search for very similar sequences > &System("$blastpgp -I T -b 10 -v 10 -e 1e-6 -A 10 -f 15 -d $dbfile -i > $tmp.seq &> $tmp.bla"); > > if ($nseqin<=1) { > &blastpgp("-e $Eult -d $db -i $seqfile","$blafile"); > } else { > if ($core==1) { > &blastpgp("-e $Eult -d $db -i $seqfile","$blafile"); > &System("perl $perl/alignhits.pl -cov $covcore -e $E2 $qid $bcore > $pmaxopt -best -psi -q $seqfile $blafile $coreali",$v2); > if ($bcore ne "") {$bcore.=" -B $coreali";} # From here on use $coreali > for end pruning of $coreali > } > &blastpgp("-e $Eult -d $db -i $seqfile -B $psifile","$blafile"); > } > > It's possible you could *only* use their script alignhits.pl: > > #! /usr/bin/perl -w > # Extract a multiple alignment of hits from Blast or PsiBlast output > # Usage: alignhits.pl [options] blast.out alignment-file > > Very sorry for subjecting you to Perl, but it does seem like it might be > helpful in your case. :-) > > --Ruchira > > > On Tue, Feb 8, 2011 at 12:28 PM, Brett Bowman wrote: > >> It might be there implicitly in HSPs (high-scoring segment pairs). >>> Trying to make a multiple sequence alignment out of these is a huge pain. >>> >> >> It isn't, I checked - the alignments are only pairwise, so combining them >> produces non-sense. Thats why I have been using Muscle and ClustalW to >> create an alignment after the Blast, despite the large computational cost. >> >> >>> Brett, might you not run all of HHmake (which includes running >>> PSI-BLAST)? It extracts the multiple sequence alignment. >>> >> >> HHmake does not include PSI-Blast - all HHmake does is convert a multiple >> sequence alignment to an HMM profile. The original paper that inspired this >> project used PSI-Blast and then used HHmake, but make no mention of >> alignments. I have sent the authors an e-mail for clarification, but have >> not received a reply. But from the way the paper was written, it does not >> appear that they used an alignment program, hence my confusion at my >> inability to do the same. >> >> -Brett >> >> On Tue, Feb 8, 2011 at 9:59 AM, Ruchira Datta wrote: >> >>> >>> >>> On Tue, Feb 8, 2011 at 4:05 AM, Michiel de Hoon wrote: >>> >>>> I am surprised that the multiple alignment is not in the XML at all. It >>>> can not be constructed from the information in the XML? >>> >>> >>> It might be there implicitly in HSPs (high-scoring segment pairs). >>> Trying to make a multiple sequence alignment out of these is a huge pain. >>> >>> Brett, might you not run all of HHmake (which includes running >>> PSI-BLAST)? It extracts the multiple sequence alignment. (It might "clean >>> up" after itself by deleting it -- you might need to go in and take that >>> line out.) Then, if you wanted to do more stuff to the MSA before running >>> HHmake, you could just discard the rest of the results of the first run and >>> run HHmake again at the point that you want. >>> >>> --Ruchira >>> >>> Anyway, if it is in there, I would suggest to use Bio.Entrez to parse the >>>> XML instead of the parser in Bio.Blast. The Bio.Entrez parser will give you >>>> all the information in the XML; the parser in Bio.Blast is more polished but >>>> may not give you all the information present in the PSI-Blast output. >>>> >>>> --Michiel. >>>> >>>> --- On Tue, 2/8/11, Brett Bowman wrote: >>>> >>>> From: Brett Bowman >>>> Subject: Re: [Biopython] Pulling Alignment From PSI-Blast Output >>>> To: "Michiel de Hoon" >>>> Cc: biopython at biopython.org >>>> Date: Tuesday, February 8, 2011, 2:40 AM >>>> >>>> I thought about that, but there doesn't appear to be any >>>> multiple-alignment data in the XML file - just pair-wise alignments of the >>>> query with each hit. In addition, when I parse the output file with NCBIXML >>>> I get a Bio.Blast.Record.Blast object, instead of a >>>> Bio.Blast.Record.PSIBlast object. The Biopython cookbook describes how to >>>> work with a PSIBlast object, but it doesn't really cover how to make one... >>>> >>>> >>>> >>>> On Mon, Feb 7, 2011 at 5:20 PM, Michiel de Hoon >>>> wrote: >>>> >>>> >>>> One option you could try is to let PSI-Blast generate its output in XML >>>> and check if the information you need is present in the XML. If it is, you >>>> can parse the XML with the read() function in Bio.Entrez. You may find that >>>> Bio.Entrez needs an additional DTD file to be able to parse the PSI-Blast >>>> XML output (Bio.Entrez will tell you which one and where to store it). If >>>> so, please let us know, so we can include the required DTDs in the next >>>> release of Biopython. >>>> >>>> >>>> >>>> >>>> >>>> --Michiel. >>>> >>>> >>>> >>>> --- On Mon, 2/7/11, Brett Bowman wrote: >>>> >>>> >>>> >>>> > From: Brett Bowman >>>> >>>> > Subject: [Biopython] Pulling Alignment From PSI-Blast Output >>>> >>>> > To: biopython at biopython.org >>>> >>>> > Date: Monday, February 7, 2011, 5:30 PM >>>> >>>> > I'm trying to use the PSI-Blast >>>> >>>> > results from a series of proteins to detect >>>> >>>> > distant homologues, using HMMs of various sorts. >>>> >>>> > Currently I'm pulling down >>>> >>>> > the sequence IDs with PSI-Blast, downloading the full >>>> >>>> > sequences from NCBI, >>>> >>>> > then aligning everything with ClustalW or Muscle. >>>> >>>> > However this is eating up >>>> >>>> > way more processor time than I have to spare, so I want to >>>> >>>> > just pull the >>>> >>>> > full multi-sequence alignment from the PSI-blast results if >>>> >>>> > possible (OUTFMT >>>> >>>> > option #3 or 4), for use in building the HMMs. But it >>>> >>>> > doesn't look like >>>> >>>> > AlignIO has a module for reading the peculiar format that >>>> >>>> > PSI-Blast >>>> >>>> > generates... >>>> >>>> > >>>> >>>> > Has this been done before, or will I need to write my own >>>> >>>> > parser? >>>> >>>> > >>>> >>>> > Brett Bowman >>>> >>>> > Woelk Lab >>>> >>>> > UCSD School of Medicine >>>> >>>> > _______________________________________________ >>>> >>>> > Biopython mailing list - Biopython at lists.open-bio.org >>>> >>>> > http://lists.open-bio.org/mailman/listinfo/biopython >>>> >>>> > >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> _______________________________________________ >>>> Biopython mailing list - Biopython at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biopython >>>> >>> >>> >> > From gebauer-jung at ice.mpg.de Wed Feb 9 04:23:50 2011 From: gebauer-jung at ice.mpg.de (St. Gebauer-Jung) Date: Wed, 09 Feb 2011 10:23:50 +0100 Subject: [Biopython] Pulling Alignment From PSI-Blast Output Message-ID: <4D525D26.5080209@ice.mpg.de> Hello, only recently I stumbled upon CSI-Blast, which runs on top of PSI-Blast but claims to be more sensitive and to need less iterations. It has an option to output the multiple alignment of good hits. Steffi From oriolebaltimore at gmail.com Wed Feb 9 16:42:08 2011 From: oriolebaltimore at gmail.com (Adrian Johnson) Date: Wed, 9 Feb 2011 16:42:08 -0500 Subject: [Biopython] SeattleSeq type library Message-ID: Dear group, Is there a library in biopython that takes a pileup file as an input and processes the file and gives back an output with annotations with gene name, NM accession, mutation type and type of change (splice site, coding etc) and many other details related to protein and gene sequence. I use SeattleSeq Annotation and the limitation is that it is a web based tool and limits the ability to use it in a pipeline. I am interested in similar tools that can be used through a command line parameters. The ideal situation would be a BioPython module that uses SOAP interface to either NCBI or UCSC. Thanks From goyo1987 at gmail.com Wed Feb 9 17:07:42 2011 From: goyo1987 at gmail.com (Gregorio Manuel Iraola Bentancor) Date: Wed, 9 Feb 2011 20:07:42 -0200 Subject: [Biopython] NcbiblastxCommandline error In-Reply-To: References: Message-ID: Hello, my name is Gregorio from Uruguay. I am trying to perform some local blast in my machine, but I got this error: Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Blast.Applications import NcbiblastxCommandline >>> blastx_cline = NcbiblastxCommandline(cmd='blastx',db='K10909.fasta', query='query.fasta',out='salida.xml') >>> blastx_cline() Traceback (most recent call last): File "", line 1, in TypeError: 'NcbiblastxCommandline' object is not callable I do not know what is happening, because the same code runs well in other computers. I would really appreciate any advice or solution. Thank you. -- Lic. Gregorio Manuel Iraola Grupo Gen?tica de Microorganismos Secci?n Gen?tica Evolutiva Facultad de Ciencias, Montevideo, Uruguay Igu? 4225 esq. Mataojo Piso 5 ala Sur +59825258618 int. 141 +59899394982 From goyo1987 at gmail.com Wed Feb 9 17:07:42 2011 From: goyo1987 at gmail.com (Gregorio Manuel Iraola Bentancor) Date: Wed, 9 Feb 2011 20:07:42 -0200 Subject: [Biopython] NcbiblastxCommandline error In-Reply-To: References: Message-ID: Hello, my name is Gregorio from Uruguay. I am trying to perform some local blast in my machine, but I got this error: Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) [GCC 4.4.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio.Blast.Applications import NcbiblastxCommandline >>> blastx_cline = NcbiblastxCommandline(cmd='blastx',db='K10909.fasta', query='query.fasta',out='salida.xml') >>> blastx_cline() Traceback (most recent call last): File "", line 1, in TypeError: 'NcbiblastxCommandline' object is not callable I do not know what is happening, because the same code runs well in other computers. I would really appreciate any advice or solution. Thank you. -- Lic. Gregorio Manuel Iraola Grupo Gen?tica de Microorganismos Secci?n Gen?tica Evolutiva Facultad de Ciencias, Montevideo, Uruguay Igu? 4225 esq. Mataojo Piso 5 ala Sur +59825258618 int. 141 +59899394982 From eric.talevich at gmail.com Wed Feb 9 21:02:21 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 9 Feb 2011 21:02:21 -0500 Subject: [Biopython] NcbiblastxCommandline error In-Reply-To: References:

Message-ID: On Wed, Feb 9, 2011 at 5:07 PM, Gregorio Manuel Iraola Bentancor < goyo1987 at gmail.com> wrote: > Hello, my name is Gregorio from Uruguay. I am trying to perform some local > blast in my machine, but I got this error: > > Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) > [GCC 4.4.3] on linux2 > Type "help", "copyright", "credits" or "license" for more information. > >>> from Bio.Blast.Applications import NcbiblastxCommandline > >>> blastx_cline = NcbiblastxCommandline(cmd='blastx',db='K10909.fasta', > query='query.fasta',out='salida.xml') > >>> blastx_cline() > Traceback (most recent call last): > File "", line 1, in > TypeError: 'NcbiblastxCommandline' object is not callable > > I do not know what is happening, because the same code runs well in other > computers. I would really appreciate any advice or solution. > Hi Gregorio, If the same code has worked on other machines, then you probably have different versions of Biopython installed on these machines. The "Commandline" objects became callable in Biopython version 1.55, but in earlier versions we would first serialize the command line as a string, then launch it as a subprocess. Can you update the Biopython installation on your local computer? If not, I'd recommend converting the NcbiblastxCommandline to a string -- str(blastx_cline) -- then using os.system() or subprocess.call() to execute the command (the string). Most of the subprocess examples have been dropped from the current tutorial, but you can draw some inspiration for the Muscle section (look at the second example, which uses subprocess.Popen): http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc77 Best wishes, Eric From yvan.strahm at uni.no Thu Feb 10 04:11:49 2011 From: yvan.strahm at uni.no (Yvan) Date: Thu, 10 Feb 2011 10:11:49 +0100 Subject: [Biopython] NcbiblastxCommandline error In-Reply-To: References:

Message-ID: <4D53ABD5.1030309@uni.no> On 10/02/11 03:02, Eric Talevich wrote: > On Wed, Feb 9, 2011 at 5:07 PM, Gregorio Manuel Iraola Bentancor< > goyo1987 at gmail.com> wrote: > >> Hello, my name is Gregorio from Uruguay. I am trying to perform some local >> blast in my machine, but I got this error: >> >> Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) >> [GCC 4.4.3] on linux2 >> Type "help", "copyright", "credits" or "license" for more information. >>>>> from Bio.Blast.Applications import NcbiblastxCommandline >>>>> blastx_cline = NcbiblastxCommandline(cmd='blastx',db='K10909.fasta', >> query='query.fasta',out='salida.xml') >>>>> blastx_cline() >> Traceback (most recent call last): >> File "", line 1, in >> TypeError: 'NcbiblastxCommandline' object is not callable >> >> I do not know what is happening, because the same code runs well in other >> computers. I would really appreciate any advice or solution. >> > Hi Gregorio, > > If the same code has worked on other machines, then you probably have > different versions of Biopython installed on these machines. The > "Commandline" objects became callable in Biopython version 1.55, but in > earlier versions we would first serialize the command line as a string, then > launch it as a subprocess. > > Can you update the Biopython installation on your local computer? If not, > I'd recommend converting the NcbiblastxCommandline to a string -- > str(blastx_cline) -- then using os.system() or subprocess.call() to execute > the command (the string). > > Most of the subprocess examples have been dropped from the current tutorial, > but you can draw some inspiration for the Muscle section (look at the second > example, which uses subprocess.Popen): > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc77 > > > Best wishes, > Eric > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython Hello, I am not sure I am remembering well but seems I had this error message already. It was a blast version problem. Did you check it? my 2 cents cheers, yvan From p.j.a.cock at googlemail.com Thu Feb 10 05:21:27 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 10 Feb 2011 10:21:27 +0000 Subject: [Biopython] NcbiblastxCommandline error In-Reply-To: References:

Message-ID: On Thu, Feb 10, 2011 at 2:02 AM, Eric Talevich wrote: > On Wed, Feb 9, 2011 at 5:07 PM, Gregorio Manuel Iraola Bentancor < > goyo1987 at gmail.com> wrote: > >> Hello, my name is Gregorio from Uruguay. I am trying to perform some local >> blast in my machine, but I got this error: >> >> Python 2.6.5 (r265:79063, Apr 16 2010, 13:57:41) >> [GCC 4.4.3] on linux2 >> Type "help", "copyright", "credits" or "license" for more information. >> >>> from Bio.Blast.Applications import NcbiblastxCommandline >> >>> blastx_cline = NcbiblastxCommandline(cmd='blastx',db='K10909.fasta', >> query='query.fasta',out='salida.xml') >> >>> blastx_cline() >> Traceback (most recent call last): >> ?File "", line 1, in >> TypeError: 'NcbiblastxCommandline' object is not callable >> >> I do not know what is happening, because the same code runs well in other >> computers. I would really appreciate any advice or solution. >> > > Hi Gregorio, > > If the same code has worked on other machines, then you probably have > different versions of Biopython installed on these machines. The > "Commandline" objects became callable in Biopython version 1.55, but in > earlier versions we would first serialize the command line as a string, then > launch it as a subprocess. I agree with Eric, that is the most likely explanation. You can check the version of Biopython installed with: import Bio print Bio.__version__ > Can you update the Biopython installation on your local computer? If not, > I'd recommend converting the NcbiblastxCommandline to a string -- > str(blastx_cline) -- then using os.system() or subprocess.call() to execute > the command (the string). > > Most of the subprocess examples have been dropped from the current tutorial, > but you can draw some inspiration for the Muscle section (look at the second > example, which uses subprocess.Popen): > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc77 Alternatively look at the tutorial shipped with Biopython 1.54, which will be in the zip or tar balls for the release under the Doc folder: http://biopython.org/DIST/biopython-1.54.tar.gz http://biopython.org/DIST/biopython-1.54.zip Peter From p.j.a.cock at googlemail.com Thu Feb 10 05:25:27 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 10 Feb 2011 10:25:27 +0000 Subject: [Biopython] SeattleSeq type library In-Reply-To: References: Message-ID: On Wed, Feb 9, 2011 at 9:42 PM, Adrian Johnson wrote: > Dear group, > > Is there a library in biopython that takes a pileup file as an input > and ?processes the file and ?gives back an output with annotations > with gene name, NM accession, mutation type and type of change > (splice site, coding etc) and many other details related > to protein and gene sequence. No, nor a parser for the related mpileup format. Would you like to write something? > > I use SeattleSeq Annotation and the limitation is that it is a web based tool > and limits the ability to use it in a pipeline. I am interested in similar tools > that can be used through a command line parameters. > Is working with GFF3 files an option? Brad has from GFF code we hope to integrate into Biopython shortly (available now as an extra install). > > The ideal situation would be a BioPython module that uses SOAP interface > to either NCBI or UCSC. > Which SOAP interface? There are lots of them... How about the NCBI Entrez REST API? See our Bio.Entrez module. Peter From p.j.a.cock at googlemail.com Fri Feb 11 04:18:08 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 11 Feb 2011 09:18:08 +0000 Subject: [Biopython] NcbiblastxCommandline error In-Reply-To: References:

Message-ID: On Fri, Feb 11, 2011 at 2:01 AM, Gregorio Manuel Iraola Bentancor wrote: > Thank you Peter, I solved the problem updating my Biopython version to 1.55. > Regards, > Gregorio. Great :) Peter P.S. Please try to keep the mailing list included in replies. From jordan.r.willis at Vanderbilt.Edu Sat Feb 12 20:52:42 2011 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Sat, 12 Feb 2011 19:52:42 -0600 Subject: [Biopython] PDBParser Class --> Output Message-ID: <748D99AD-22C5-4FAA-9DD6-926516EDE6CD@vanderbilt.edu> Hello, I am loving the PDB tools that are available with Biopython. I am getting a weird output I can't figure out when I parse a PDB. My code looks something like this. parser = PDBParser() for file in open("list_of_pds.txt".readlines(): struct = parser.get_structure("random",file.rstrip()) My script does what I want, except I get some standard output that looks like this for every single atom.: .....Thousands of lines G --> ? CD --> ? CE --> ? NZ --> ? H --> ? HA --> ? 1HB --> ? 2HB --> ? 1HG --> ? 2HG --> ? 1HD --> ? 2HD --> ? 1HE --> ? 2HE --> ? 1HZ --> ? 2HZ --> ? 3HZ --> ? N --> ? CA --> ? C --> ? O --> ? CB --> ? CG --> ? CD --> ? NE --> ? CZ --> ? NH1 --> ? NH2 --> ? H --> ? HA --> ? 1HB --> ? 2HB --> ? 1HG --> ? 2HG --> ? 1HD --> ? 2HD --> ? HE --> ? 1HH1 --> ? 2HH1 --> ? 1HH2 --> ? 2HH2 --> ? N --> ? CA --> ? C --> ? O --> ? OXT --> ? CB --> ? OG1 --> ? CG2 --> ? H --> ? HA --> ? HB --> ? HG1 --> ? 1HG2 --> ? 2HG2 --> ? .......More lines I wouldn't mind so much except i have to parse thousands of pdbs and the unnecessary printing adds to the computation time. Any ideas how to turn this off? Jordan From p.j.a.cock at googlemail.com Sun Feb 13 06:40:26 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 13 Feb 2011 11:40:26 +0000 Subject: [Biopython] PDBParser Class --> Output In-Reply-To: <748D99AD-22C5-4FAA-9DD6-926516EDE6CD@vanderbilt.edu> References: <748D99AD-22C5-4FAA-9DD6-926516EDE6CD@vanderbilt.edu> Message-ID: On Sun, Feb 13, 2011 at 1:52 AM, Willis, Jordan R wrote: > Hello, > > I am loving the PDB tools that are available with Biopython. I am > getting a weird output I can't figure out when I parse a PDB. My > code looks something like this. > > .....Thousands of lines > G --> ? > CD --> ? > CE --> ? > .......More lines > > I wouldn't mind so much except i have to parse thousands of pdbs and > the unnecessary printing adds to the computation time. > Any ideas how to turn this off? That looks like it is failing to assign the element to those atoms. Do your PDB files have the element column filled in? Which version of Biopython do you have as there have been a few changes to this bit of code recently... could you try the latest code from github? http://www.biopython.org/wiki/SourceCode Peter From jordan.r.willis at Vanderbilt.Edu Sun Feb 13 12:39:22 2011 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Sun, 13 Feb 2011 11:39:22 -0600 Subject: [Biopython] PDBParser Class --> Output In-Reply-To: References: <748D99AD-22C5-4FAA-9DD6-926516EDE6CD@vanderbilt.edu> Message-ID: <8C3CE2AE-0C15-4E2F-9060-5C94BCCE3CB1@Vanderbilt.Edu> 1.56, As it turns out, our molecular modeling suite does not output an element field. I went into Bio/PDB/Atom.py and commented out: if element is None : import warnings from PDBExceptions import PDBConstructionWarning warnings.warn("Atom object (name=%s) without element" % name, PDBConstructionWarning) element = "?" print name, "--> ?" elif len(element)>2 or element != element.upper() or element != element.strip(): But it is still taking the time on checking the error even though its not printing it. I wonder if you can just turn it off the error checking completely. Jordan On Feb 13, 2011, at 5:40 AM, Peter Cock wrote: > On Sun, Feb 13, 2011 at 1:52 AM, Willis, Jordan R > wrote: >> Hello, >> >> I am loving the PDB tools that are available with Biopython. I am >> getting a weird output I can't figure out when I parse a PDB. My >> code looks something like this. >> >> .....Thousands of lines >> G --> ? >> CD --> ? >> CE --> ? >> .......More lines >> >> I wouldn't mind so much except i have to parse thousands of pdbs and >> the unnecessary printing adds to the computation time. >> Any ideas how to turn this off? > > That looks like it is failing to assign the element to those atoms. Do > your PDB files have the element column filled in? Which version of > Biopython do you have as there have been a few changes to this > bit of code recently... could you try the latest code from github? > http://www.biopython.org/wiki/SourceCode > > Peter From p.j.a.cock at googlemail.com Sun Feb 13 12:54:44 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 13 Feb 2011 17:54:44 +0000 Subject: [Biopython] PDBParser Class --> Output In-Reply-To: <8C3CE2AE-0C15-4E2F-9060-5C94BCCE3CB1@Vanderbilt.Edu> References: <748D99AD-22C5-4FAA-9DD6-926516EDE6CD@vanderbilt.edu> <8C3CE2AE-0C15-4E2F-9060-5C94BCCE3CB1@Vanderbilt.Edu> Message-ID: On Sun, Feb 13, 2011 at 5:39 PM, Willis, Jordan R wrote: > 1.56, > > As it turns out, our molecular modeling suite does not output an > element field. Could you ask the authors to populate this field? > I went into Bio/PDB/Atom.py and commented out: > > ?if element is None : > ? ? ? ? ? ?import warnings > ? ? ? ? ? ?from PDBExceptions import PDBConstructionWarning > ? ? ? ? ? ?warnings.warn("Atom object (name=%s) without element" % name, > ? ? ? ? ? ? ? ? ? ? ? ? ?PDBConstructionWarning) > ? ? ? ? ? ?element = "?" > ? ? ? ? ? ?print name, "--> ?" > ? ? ? ?elif len(element)>2 or element != element.upper() or element != element.strip(): > > But it is still taking the time on checking the error even though its > not printing it. I wonder if you can just turn it off the error checking > completely. Commenting it out for now should be harmless - but as I said, this bit of code has changed a bit since Biopython 1.56, so I'd be interested to hear how the current code in github works for you. Peter From anaryin at gmail.com Sun Feb 13 17:43:21 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Sun, 13 Feb 2011 22:43:21 +0000 Subject: [Biopython] PDBParser Class --> Output In-Reply-To: References: <748D99AD-22C5-4FAA-9DD6-926516EDE6CD@vanderbilt.edu> <8C3CE2AE-0C15-4E2F-9060-5C94BCCE3CB1@Vanderbilt.Edu> Message-ID: Hello, That was indeed a dumb thing to leave there...My apologies.. However, as Peter pointed out, it has been corrected (and improved) in later versions so if you download the Git version that printing will be gone and actually the elements of the atoms should be automatically guessed! Best! Jo?o From jordan.r.willis at Vanderbilt.Edu Sun Feb 13 18:50:23 2011 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Sun, 13 Feb 2011 17:50:23 -0600 Subject: [Biopython] PDBParser Class --> Output In-Reply-To: References: <748D99AD-22C5-4FAA-9DD6-926516EDE6CD@vanderbilt.edu> <8C3CE2AE-0C15-4E2F-9060-5C94BCCE3CB1@Vanderbilt.Edu> Message-ID: Hello, Now I am getting this warning which automatically prints, PDBConstructionWarning: Used element 'C' for Atom (name=CE1) with given element '' warnings.warn(msg, PDBConstructionWarning) I think this is what you meant by guessing the elements, but it is still putting this to std.out. Any suggestions? Thanks, Jordan On Feb 13, 2011, at 4:43 PM, Jo?o Rodrigues wrote: > Hello, > > That was indeed a dumb thing to leave there...My apologies.. However, as Peter pointed out, it has been corrected (and improved) in later versions so if you download the Git version that printing will be gone and actually the elements of the atoms should be automatically guessed! > > Best! > > Jo?o From edvin.fuglebakk at gmail.com Mon Feb 14 03:29:13 2011 From: edvin.fuglebakk at gmail.com (Edvin Fuglebakk) Date: Mon, 14 Feb 2011 09:29:13 +0100 Subject: [Biopython] PDBParser Class --> Output In-Reply-To: References: <748D99AD-22C5-4FAA-9DD6-926516EDE6CD@vanderbilt.edu> <8C3CE2AE-0C15-4E2F-9060-5C94BCCE3CB1@Vanderbilt.Edu> Message-ID: <95E27938-F262-4F25-AF29-FBE387DB8782@gmail.com> On 13. feb. 2011, at 18.54, Peter Cock wrote: > On Sun, Feb 13, 2011 at 5:39 PM, Willis, Jordan R > wrote: >> 1.56, >> >> As it turns out, our molecular modeling suite does not output an >> element field. > > Could you ask the authors to populate this field? I would just like to comment on this remark that prior to pdb 2.0 there were no slots for element symbols in the ATOM / HETATM records. So it is probably not uncommon for old software to not populate this field. I have encountered missing elements myself from time to time. cheers -Edvin > >> I went into Bio/PDB/Atom.py and commented out: >> >> if element is None : >> import warnings >> from PDBExceptions import PDBConstructionWarning >> warnings.warn("Atom object (name=%s) without element" % name, >> PDBConstructionWarning) >> element = "?" >> print name, "--> ?" >> elif len(element)>2 or element != element.upper() or element != element.strip(): >> >> But it is still taking the time on checking the error even though its >> not printing it. I wonder if you can just turn it off the error checking >> completely. > > Commenting it out for now should be harmless - but as I said, > this bit of code has changed a bit since Biopython 1.56, so I'd > be interested to hear how the current code in github works for > you. > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Mon Feb 14 07:02:04 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 14 Feb 2011 12:02:04 +0000 Subject: [Biopython] PDBParser Class --> Output In-Reply-To: References: <748D99AD-22C5-4FAA-9DD6-926516EDE6CD@vanderbilt.edu> <8C3CE2AE-0C15-4E2F-9060-5C94BCCE3CB1@Vanderbilt.Edu>

Message-ID: On Sun, Feb 13, 2011 at 11:50 PM, Willis, Jordan R wrote: > Hello, > > Now I am getting this warning which automatically prints, > > PDBConstructionWarning: Used element 'C' for Atom (name=CE1) with given element '' > ?warnings.warn(msg, PDBConstructionWarning) > > I think this is what you meant by guessing the elements, but it is still putting this to std.out. Any suggestions? > > Thanks, > Jordan That should be going to stderr not stdout IIRC. You can use the Python warnings module to ignore these (i.e. ignore any PDBConstructionWarning). http://docs.python.org/library/warnings.html Peter From p.j.a.cock at googlemail.com Mon Feb 14 07:04:55 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 14 Feb 2011 12:04:55 +0000 Subject: [Biopython] PDBParser Class --> Output In-Reply-To: <95E27938-F262-4F25-AF29-FBE387DB8782@gmail.com> References: <748D99AD-22C5-4FAA-9DD6-926516EDE6CD@vanderbilt.edu> <8C3CE2AE-0C15-4E2F-9060-5C94BCCE3CB1@Vanderbilt.Edu> <95E27938-F262-4F25-AF29-FBE387DB8782@gmail.com> Message-ID: On Mon, Feb 14, 2011 at 8:29 AM, Edvin Fuglebakk wrote: > > On 13. feb. 2011, at 18.54, Peter Cock wrote: > >> On Sun, Feb 13, 2011 at 5:39 PM, Willis, Jordan R >> wrote: >>> 1.56, >>> >>> As it turns out, our molecular modeling suite does not output an >>> element field. >> >> Could you ask the authors to populate this field? > > I would just like to comment on this remark that prior to pdb 2.0 there were no > slots for element symbols in the ATOM / HETATM records. So it is probably > not uncommon for old software to not populate this field. I have encountered > missing elements myself from time to time. > > cheers > -Edvin Hmm - thinking out loud, maybe the warning could be silenced when the PDB parser is used in permissive mode? Does that sound possible Jo?o? Peter From anaryin at gmail.com Mon Feb 14 18:01:17 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 15 Feb 2011 00:01:17 +0100 Subject: [Biopython] PDBParser Class --> Output In-Reply-To: References: <748D99AD-22C5-4FAA-9DD6-926516EDE6CD@vanderbilt.edu> <8C3CE2AE-0C15-4E2F-9060-5C94BCCE3CB1@Vanderbilt.Edu> <95E27938-F262-4F25-AF29-FBE387DB8782@gmail.com> Message-ID: Yup, sounds sensible, i'll patch it tomorrow! Meanwhile, as Peter pointed out, with the warnings module it should be pretty simple to silence these messages. Cheers! Jo?o [...] Rodrigues http://doeidoei.wordpress.com On Mon, Feb 14, 2011 at 1:04 PM, Peter Cock wrote: > On Mon, Feb 14, 2011 at 8:29 AM, Edvin Fuglebakk > wrote: > > > > On 13. feb. 2011, at 18.54, Peter Cock wrote: > > > >> On Sun, Feb 13, 2011 at 5:39 PM, Willis, Jordan R > >> wrote: > >>> 1.56, > >>> > >>> As it turns out, our molecular modeling suite does not output an > >>> element field. > >> > >> Could you ask the authors to populate this field? > > > > I would just like to comment on this remark that prior to pdb 2.0 there > were no > > slots for element symbols in the ATOM / HETATM records. So it is probably > > not uncommon for old software to not populate this field. I have > encountered > > missing elements myself from time to time. > > > > cheers > > -Edvin > > Hmm - thinking out loud, maybe the warning could be silenced when the PDB > parser is used in permissive mode? Does that sound possible Jo?o? > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From sharanya.raghunath at utah.edu Thu Feb 17 15:59:28 2011 From: sharanya.raghunath at utah.edu (SHARANYA RAGHUNATH) Date: Thu, 17 Feb 2011 13:59:28 -0700 Subject: [Biopython] Retieving CEL Files from GEO Message-ID: <33C50A4964FC7847A7CE1CEAF0E6615E5916827A8B@C8V1.xds.umail.utah.edu> Hello, I understand that there is a way to parse through the GEO data repository in Biopython; however, is there a way to download the actual CEL files or the SOFT files using a GSE or GDS id? Thank You, Sharanya From sdavis2 at mail.nih.gov Thu Feb 17 17:51:36 2011 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 17 Feb 2011 17:51:36 -0500 Subject: [Biopython] Retieving CEL Files from GEO In-Reply-To: <33C50A4964FC7847A7CE1CEAF0E6615E5916827A8B@C8V1.xds.umail.utah.edu> References: <33C50A4964FC7847A7CE1CEAF0E6615E5916827A8B@C8V1.xds.umail.utah.edu> Message-ID: On Thu, Feb 17, 2011 at 3:59 PM, SHARANYA RAGHUNATH < sharanya.raghunath at utah.edu> wrote: > Hello, > > I understand that there is a way to parse through the GEO data repository > in Biopython; however, is there a way to download the actual CEL files or > the SOFT files using a GSE or GDS id? > > Hi, Sharanya. You might take a look at both the GEOquery and GEOmetadb Bioconductor packages. I know it is not a biopython solution, but I wanted to make sure that you were aware of them. However, where the raw data are available (and they are not always available), you can simply download them from the NCBI ftp site: ftp://ftp.ncbi.nih.gov/pub/geo/DATA/ Sean Disclaimer: I may be slightly biased about the relevance of the above software packages.... From lgautier at gmail.com Fri Feb 18 12:13:37 2011 From: lgautier at gmail.com (Laurent) Date: Fri, 18 Feb 2011 18:13:37 +0100 Subject: [Biopython] Retieving CEL Files from GEO In-Reply-To: References: Message-ID: <4D5EA8C1.10107@gmail.com> > > On Thu, Feb 17, 2011 at 3:59 PM, SHARANYA RAGHUNATH< > sharanya.raghunath at utah.edu> wrote: > >> Hello, >> >> I understand that there is a way to parse through the GEO data repository >> in Biopython; however, is there a way to download the actual CEL files or >> the SOFT files using a GSE or GDS id? >> >> > Hi, Sharanya. > > You might take a look at both the GEOquery and GEOmetadb Bioconductor > packages. I know it is not a biopython solution, but I wanted to make sure > that you were aware of them. However, where the raw data are available (and > they are not always available), you can simply download them from the NCBI > ftp site: > > ftp://ftp.ncbi.nih.gov/pub/geo/DATA/ > > Sean > > Disclaimer: I may be slightly biased about the relevance of the above > software packages.... > A similar question on this list received a similar answer ( http://lists.open-bio.org/pipermail/biopython-dev/2010-March/007503.html ), and despite the disclosed bias in the answer it was spot on (or so I heard). If you have a way to call R from Python, you can immediately use libraries available in R. Disclaimer: The above answer tries to avoid biases regarding the relevance of a specific software package. ;-) From clementsgalaxy at gmail.com Tue Feb 22 12:16:12 2011 From: clementsgalaxy at gmail.com (Dave Clements) Date: Tue, 22 Feb 2011 09:16:12 -0800 Subject: [Biopython] Galaxy Community Conference, May 25-26, Lunteren, The Netherlands In-Reply-To: References: Message-ID: Hello all, Just a reminder that the abstract submission deadline for the Galaxy Community Conference is next Monday, February 28. See http://galaxy.psu.edu/gcc2011/Abstracts.html for details. Cheers, Dave C. On Thu, Feb 3, 2011 at 5:01 PM, Dave Clements wrote: > We are pleased to announce the *2011 Galaxy Community Conference*, being > held *May 25-26 in Lunteren, The Netherlands*. The meeting will feature > two full days of presentations and discussion on extending Galaxy to use new > tools and data sources, deploying Galaxy at your organization, and best > practices for using Galaxy to further your own and your community's > research. See http://galaxy.psu.edu/gcc2011/* for complete details. > * > *About Galaxy: > *Galaxy is an open, web-based platform for *accessible, reproducible, and > transparent* computational biomedical research. > > - *Accessibility:* Galaxy enables users without programming experience > to easily specify parameters and run tools and workflows. > - *Reproducibility:* Galaxy captures all information necessary so that > any user can repeat and understand a complete computational analysis. > - *Transparency:* Galaxy enables users to share and publish analyses > via the web and create Pages--interactive, web-based documents that describe > a complete analysis. > > Galaxy is open source for all organizations. The public Galaxy service ( > http://usegalaxy.org) makes analysis tools, genomic data, > tutorial demonstrations, persistent workspaces, and publication services > available to any scientist that has access to the Internet. Local > Galaxy servers can be set up by downloading the Galaxy application and > customizing it to meet particular needs. > > *Conference Overview: > * > This event aims to engage a broader community of developers, data > producers, tool creators, and core facility and other research hub staff to > become an active part of the Galaxy community. We'll cover defining > resources in the Galaxy framework, increasing their visibility and making > them easier to use and integrate with other resources, how to extend Galaxy > to use custom data sources and custom tools, and best practices for using > Galaxy in your organization. > > Additional topics include, but are not limited to: > * Talks submitted by the Galaxy community > * Integration of tools (including NGS analysis tools) and distributed job > management > * Deployment of Galaxy instances on local resources and on the Cloud > * Management of large datasets with the Galaxy Library System > * Using the Galaxy LIMS functionality at NGS sequencing facilities > * Visualizing Data without leaving Galaxy > * Performing reproducible research > * Performing and sharing complex analyses with Workflows > * An "Introduction to Galaxy" session, offered on May 24, for Galaxy > newcomers. > > *Registration: > * > The conference fee is ?100 on or before April 24, and ?120 after that. The > meeting is being held at the Conference Centre De Werelt in Lunteren, The > Netherlands, which is also the conference hotel. You are encouraged to > register early, as space at the hotel (and at the "Intro to Galaxy" session) > is limited and is likely to fill up before the conference itself does. See > http://galaxy.psu.edu/gcc2011/Register.html > * > Abstract Submission: > * > Abstracts are now being accepted for short oral presentations. Proposals > on any topic of interest to the Galaxy community are welcome and > encouraged. The abstract submission deadline is the end of February 28. > See http://galaxy.psu.edu/gcc2011/Abstracts.html > * * > *Sponsors > * > The 2011 Galaxy Community Conference is co-sponsored by the US National > Science Foundation (NSF, http://www.nsf.gov/), and the Netherlands > Bioinformatics Centre (NBIC, http://www.nbic.nl/). NBIC is a > collaborative institute of the bioinformatics groups in the Netherlands. > Together, these groups perform cutting-edge research, develop novel tools > and support platforms, create an e-science infrastructure and educate the > next generations of bioinformaticians. > > We are looking forward to a great conference and hope to see you in the > Netherlands! > > The Galaxy and NBIC Teams > > -- > http://galaxy.psu.edu/gcc2011/ > http://getgalaxy.org > http://usegalaxy.org/ > -- http://galaxy.psu.edu/gcc2011/ http://getgalaxy.org http://usegalaxy.org/ From rojan at riken.jp Wed Feb 23 01:42:02 2011 From: rojan at riken.jp (Rojan Shrestha) Date: Wed, 23 Feb 2011 15:42:02 +0900 Subject: [Biopython] Biopython library for muliple sequence alignment Message-ID: <001501cbd324$c70a8570$551f9050$@jp> Hello: I want to do multiple sequence alignment using CLUSTW. Instead of standalone, I would like to use in my own program through biopython. I would like to know that whether biopython has clustw function or not. It would be very good if somebody gives information about this. Regards, Rojan From p.j.a.cock at googlemail.com Wed Feb 23 04:24:21 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 23 Feb 2011 09:24:21 +0000 Subject: [Biopython] Biopython library for muliple sequence alignment In-Reply-To: <001501cbd324$c70a8570$551f9050$@jp> References: <001501cbd324$c70a8570$551f9050$@jp> Message-ID: On Wed, Feb 23, 2011 at 6:42 AM, Rojan Shrestha wrote: > Hello: > > I want to do multiple sequence alignment using CLUSTW. Instead of > standalone, I would like to use in my own program through biopython. I would > like to know that whether biopython has clustw function or not. It would be > very good if somebody ?gives information about this. > > Regards, > > Rojan Hello Rojan, Biopython (and BioPerl too I believe) doesn't have any multiple sequence alignment code itself. Biopython does has pairwise sequence alignment code (with a fast implementation in C). Instead (again, like BioPerl) Biopython has a wrapper and parser for calling the ClustalW command line tool from within your script and loading its output. Similarly for other alignment tools like Muscle. If you really want to be able modify the multiple sequence alignment code itself, some of these command line tools are open source. Also, I *think* that BioJava has some code for this. I don't know what BioRuby does. Peter P.S. You only really need to ask this on the Biopython Discussion List. Since you included the OBF cross project list I have tried to comment on how the other projects handle this as well. From bnbowman at gmail.com Thu Feb 24 12:41:50 2011 From: bnbowman at gmail.com (Brett Bowman) Date: Thu, 24 Feb 2011 09:41:50 -0800 Subject: [Biopython] Strange Gaps when writing Multi-Fasta Message-ID: I'm trying to write my own script to parse multiple alignments from the new standalone PSI-Blast output, but when I try to write the results to a file, I get really odd results. I'm storing the sequence data in a dictionary, using the sequence ID as the key and the sequence itself as the value. However, when I try to write these results to file, one of the sequences comes out with an extra blank line in the middle, such as you see in sample_seq_1. This is despite the fact that if I print the sequence as a string, it comes out a single, contiguous line as you can see in sample_seq_2. I've tried both creating an array of seqs and writing them to file with SeqIO, as well as writing them out myself with a for loop such as: for i in range(0,length,60): print seq[i:i+60] and still I get the same problem. But what's weird is that, though which sequence shows the extra line is consistent with each method, they are different between methods. I.e., using SeqIO always throws an extra line into SeqA, but the for-loop above always throws an extra line into SeqB. What makes this doubly weird is that when I copy the Seq in question by hand to another file, the extra line disappears as if it was never there! Its almost as if that extra line is a glitch, or non-standard whitespace character or something. This has me completely stumped. Does anyone have any ideas as to why this is happening, or how to fix it? -Brett Bowman Senior Research Associate Cibus US LLC -------------- next part -------------- >YP_002749131 ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ ------------------------------------------------------------ -----------------------------------------GAGHVIQCLKKLGVTTVFG YPGGAILPVYDALY-E-SG----L----KHILTRHEQAAIHAAEGYARASGKVGVVFATS GPGATNLVTGLADAYMDSIPLVVITGQVATPLIGKDGFQEADVVGITVPVTKHNYQVRDV NQLSRIVQEAFYIAESGRPGPVLIDIPKD-V---Q----IE---K---V---T----S-- --F------Y--N---EV--I---E--I--P---G--Y------K---IED---MP---- D-S-M---K-L-KEV---AK---EISKAKRPLLY--IGGG--V--I--H----SGG--S- -D--E----LIKFAREHRI--PVVSTLMGLGAYPPG----------D-S-LFLGMLGMHG TYAANMAVTECNLLLALGVRFDDRVTGKLELFSPQS-K-KV-HIDIDSSEFHKNVTVEYP VVGDVK-NA----L----H---M-L---L------H-MPI-----D-T------------ -Q----T-----D----E----W---L---T----K----I----E---G-----WKEEY --------PLSY-N--QK-E---R-E-LKPQHVI-SLV-SE-L---T-N-G--E----AI -VTTEVGQHQMWAAHFYKAKNPRTFLTSGGLGTMGFGFPAAIGAQLA------KEEQLVI CIAGDASFQMNIQELQTVAENNIPVKVFIINNKFLGMVRQWQEMFYENRLSESKI----- ------G--------------S-------------------------------------- -------------------------------P-DFVKVAEAYGVKGLRATNSTEAK-QVM ---LEAFA-HE-G-PVVVDFCVEEG-------------EYVFPMVPPNKGNNEMIMK--- ------- -------------- next part -------------- >YP_002749131 -----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------GAGHVIQCLKKLGVTTVFGYPGGAILPVYDALY-E-SG----L----KHILTRHEQAAIHAAEGYARASGKVGVVFATSGPGATNLVTGLADAYMDSIPLVVITGQVATPLIGKDGFQEADVVGITVPVTKHNYQVRDVNQLSRIVQEAFYIAESGRPGPVLIDIPKD-V---Q----IE---K---V---T----S----F------Y--N---EV--I---E--I--P---G--Y------K---IED---MP----D-S-M---K-L-KEV---AK---EISKAKRPLLY--IGGG--V--I--H----SGG--S--D--E----LIKFAREHRI--PVVSTLMGLGAYPPG----------D-S-LFLGMLGMHGTYAANMAVTECNLLLALGVRFDDRVTGKLELFSPQS-K-KV-HIDIDSSEFHKNVTVEYPVVGDVK-NA----L----H---M-L---L------H-MPI-----D-T-------------Q----T-----D----E----W---L---T----K----I----E---G-----WKEEY--------PLSY-N--QK-E---R-E-LKPQHVI-SLV-SE-L---T-N-G--E----AI-VTTEVGQHQMWAAHFYKAKNPRTFLTSGGLGTMGFGFPAAIGAQLA------KEEQLVICIAGDASFQMNIQELQTVAENNIPVKVFIINNKFLGMVRQWQEMFYENRLSESKI-----------G--------------S---------------------------------------------------------------------P-DFVKVAEAYGVKGLRATNSTEAK-QVM---LEAFA-HE-G-PVVVDFCVEEG-------------EYVFPMVPPNKGNNEMIMK---------- From p.j.a.cock at googlemail.com Thu Feb 24 13:25:38 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 24 Feb 2011 18:25:38 +0000 Subject: [Biopython] Strange Gaps when writing Multi-Fasta In-Reply-To: References: Message-ID: On Thu, Feb 24, 2011 at 5:41 PM, Brett Bowman wrote: > I'm trying to write my own script to parse multiple alignments from > the new standalone PSI-Blast output, but when I try to write the > results to a file, I get really odd results. I'd guess you have a new line or carriage return in your string somehow (\n or \r). Try printing out repr(...) of you string. Peter From bnbowman at gmail.com Thu Feb 24 15:14:06 2011 From: bnbowman at gmail.com (Brett Bowman) Date: Thu, 24 Feb 2011 12:14:06 -0800 Subject: [Biopython] Strange Gaps when writing Multi-Fasta In-Reply-To: References: Message-ID: An excellent idea - so I just tried it. Using repr(raw_seq) does remove the extra-line, but it adds a single quotes (') around the string, making it useless for sequence alignments. So I decided to take that string and remove the quotes. However, when I take the representation of the string, and use re.sub() or string.replace() to remove the single quote chars, the extra blank line returns! Oy. Also for the record, I've tried re.sub('(\n|\r|\s)', '', raw_seq) and variations thereof with no success. Now I'm even more confused. Brett Bowman Senior Research Associate Cibus US LLC On Thu, Feb 24, 2011 at 10:25 AM, Peter Cock wrote: > On Thu, Feb 24, 2011 at 5:41 PM, Brett Bowman wrote: >> I'm trying to write my own script to parse multiple alignments from >> the new standalone PSI-Blast output, but when I try to write the >> results to a file, I get really odd results. > > I'd guess you have a new line or carriage return in your string > somehow (\n or \r). Try printing out repr(...) of you string. > > Peter > From p.j.a.cock at googlemail.com Thu Feb 24 15:20:58 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 24 Feb 2011 20:20:58 +0000 Subject: [Biopython] Strange Gaps when writing Multi-Fasta In-Reply-To: References:

Message-ID: On Thu, Feb 24, 2011 at 8:14 PM, Brett Bowman wrote: > An excellent idea - so I just tried it. ?Using repr(raw_seq) does > remove the extra-line, but it adds a single quotes (') around the > string, making it useless for sequence alignments. I just meant use repr as a diagnostic for checking your string objects. It wouldn't remove and new line, just display it as \n (likewise any carriage return would be shown as \r). >?So I decided to > take that string and remove the quotes. ?However, when I take the > representation of the string, and use re.sub() or string.replace() to > remove the single quote chars, the extra blank line returns! ?Oy. > > Also for the record, I've tried re.sub('(\n|\r|\s)', '', raw_seq) and > variations thereof with no success. > > Now I'm even more confused. Um. Can you put together a self contained example? e.g. a short input file and the short Python script? Peter From p.j.a.cock at googlemail.com Thu Feb 24 17:39:28 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 24 Feb 2011 22:39:28 +0000 Subject: [Biopython] Strange Gaps when writing Multi-Fasta In-Reply-To: References:

Message-ID: I got it - but a bit big for the mailing list maybe? On Thu, Feb 24, 2011 at 9:50 PM, Brett Bowman wrote: > Done. ?The script itself is still ~100 lines, but you can safely > ignore the top 90 which parse the blast file. ?Its the bottom 20 lines > that output everything that are confusing me. > > Simply download the script and the raw data file and run the following: > python psi_parser_simple.py simple_test.txt > > This will output the sequence data in 3 ways: > out_verA.txt - The Id printed in Fasta style, then the entire sequence > printed on the next line. ?No problems here. No line wrapping, so some text editors may break in odd places on the gap characters (treating them as hyphens), but seems fine. > out_verB.txt - Regular Fasta format, written by me. ?Here CBI21345 has > the weird blank line in the middle that I can't get rid of. I don't see any blank line in the CBI21345 record for out_verB.txt > out_verC.txt - Regular Fasta format, written by Biopython's SeqIO. > Here it is YP_002749131 that has the weird blank line that I also > can't get rid of. I don't see any blank line in the YP_002749131 record for out_verC.txt > How is it possible to get the same weird artifact or not, and in > different places, when all of the data is processed with the same > For-loop? Often funny things with new lines are due to OS differences in the new line (CR + LF on Windows, LF only on Unix). That is unlikely to be the issue here. How are you looking at the output files? I just used gedit on Linux, and double checked at the terminal, e.g. grep -C 30 CBI21345 out_verB.txt grep -C 20 YP_002749131 out_verC.txt I'll send you the three output files (off the mailing list) so you can compare them to what you get on your machine. Peter From p.j.a.cock at googlemail.com Fri Feb 25 07:56:19 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 25 Feb 2011 12:56:19 +0000 Subject: [Biopython] Strange Gaps when writing Multi-Fasta In-Reply-To: References:

Message-ID: On Fri, Feb 25, 2011 at 12:42 AM, Brett Bowman wrote: > I believe I've isolated the issue. ?Using universal mode doesn't > resolve the problem, nor do a few other 'universal' text tricks I've > found via google. ?Yet viewing the current files with either WordPad > on Windows, or gEdit on Linux shows no problems with any of the text > files. > > By process of elimination, I almost have to assume that the problem > lies within the Windows implimentation of gEdit - the one place I > never even thought to look. ?What a day. > > I'll forward our text files to the gEdit folks to see if they can isolate it. > > Sincerely, > Brett Thanks Brett - I'm glad you could solve this. Curious! I've CC'd the mailing list again to anyone else interested (now or later via the archives) sees the resolution. Peter From chapmanb at 50mail.com Tue Feb 1 16:03:04 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 1 Feb 2011 11:03:04 -0500 Subject: [Biopython] internal function to convert illumina quality scores to phred In-Reply-To: References: <2CCFE5EE-0A98-4BA9-A853-B727978B29B7@stanford.edu> <97487661-DAB2-43F8-8CCF-4FC0AE252582@stanford.edu> Message-ID: <20110201160304.GH17835@sobchak.mgh.harvard.edu> Alan and Peter; Alan, nice suggestions on conversion from phred. On the barcode sorting side there was just some discussion of this on the development list; I have a script that does barcode sorting and trimming with mismatches using Biopython: https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/barcode_sort_trim.py It does not use qualities, but this might be a framework you could build off to add that support. Peter, how hard do you think it would be to have SeqIO only convert from the fastq encoding to phred scores on demand? Most of the time when dealing with fastq I do not need any conversion at all and use the FastqGeneralIterator to just pull out the name, sequence and quality. You've done a lot of nice work with the correct conversions and it would be great to expose that directly though on-demand conversion as Alan is suggesting. Ideally you would use SeqIO as normal with fastq files, but the quality score would not be converted to solexa during parsing using letter_annotations["solexa_quality"] was accessed. Another option would just be to expose a function so folks could do: convert_fastq_illumina_to_quality(illumina_encoded_string) to get the phred quality scores for a string they were interested in. This way you could use FastqGeneralIterator for no SeqRecord/Seq overhead, but still make use of your conversion work. Brad From biopython at maubp.freeserve.co.uk Tue Feb 1 16:16:18 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Feb 2011 16:16:18 +0000 Subject: [Biopython] internal function to convert illumina quality scores to phred In-Reply-To: <20110201160304.GH17835@sobchak.mgh.harvard.edu> References: <2CCFE5EE-0A98-4BA9-A853-B727978B29B7@stanford.edu> <97487661-DAB2-43F8-8CCF-4FC0AE252582@stanford.edu> <20110201160304.GH17835@sobchak.mgh.harvard.edu> Message-ID: On Tue, Feb 1, 2011 at 4:03 PM, Brad Chapman wrote: > > Alan and Peter; > Alan, nice suggestions on conversion from phred. On the barcode > sorting side there was just some discussion of this on the > development list; I have a script that does barcode sorting > and trimming with mismatches using Biopython: > > https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/barcode_sort_trim.py > > It does not use qualities, but this might be a framework you could > build off to add that support. > > Peter, how hard do you think it would be to have SeqIO only convert > from the fastq encoding to phred scores on demand? Most of the time > when dealing with fastq I do not need any conversion at all and use > the FastqGeneralIterator to just pull out the name, sequence and > quality. > > You've done a lot of nice work with the correct conversions and it > would be great to expose that directly though on-demand conversion > as Alan is suggesting. Ideally you would use SeqIO as normal with > fastq files, but the quality score would not be converted to solexa > during parsing using letter_annotations["solexa_quality"] was > accessed. I actually implemented a proof of concept that does that. In order to not alter the SeqRecord behaviour, it was a new object which acted like a list of integers in many respects. The data is held as a FASTQ encoded string, and decoded (and then cached) on demand only. On output if it was already in the right encoding the string could be used as is, otherwise the conversion could be done very quickly with a precomputed table and the string translate() method (without having to go via a list of integers). It seemed to work, but I wasn't convinced about the benefits (given the complexity). I'd really want some real world FASTQ benchmarks to try it on... something you might have in the form of your scripts and the real data they were written for? I'm pretty sure this code is in a local git branch on one of my machines (probably at home), but I don't think I pushed it to github. I should do that... > Another option would just be to expose a function so folks > could do: > > convert_fastq_illumina_to_quality(illumina_encoded_string) > > to get the phred quality scores for a string they were interested > in. This way you could use FastqGeneralIterator for no > SeqRecord/Seq overhead, but still make use of your > conversion work. Yeah, three or four helper functions for the three decoding would be sensible. It looks like there is demand for it then... Peter From jp.verta at gmail.com Tue Feb 1 16:28:57 2011 From: jp.verta at gmail.com (Jukka-Pekka Verta) Date: Tue, 1 Feb 2011 11:28:57 -0500 Subject: [Biopython] Bio.Emboss.Primer3 -parser and Primer3 2.2.3 Message-ID: <249537ED-8F00-4AAF-B4BF-97BED5D64BB6@gmail.com> Hello, I'm having trouble parsing out the Bio.Emboss.Primer3Commandline output when using the Whitehead Institute Primer3 version 2.2.3 (and a compatible eprimer3 version). The parser (Bio.Emboss.Primer3) does not write out the reverse primer sequence i.e. the Primers -class member "reverse_seq" is empty. The parser worked fine with the 1.1.4 version of Primer3 and the compatible eprimer3. The output of the primer3-2.2.3 compatible eprimer3 -version looks nearly identical to the old distributions: # EPRIMER3 RESULTS FOR GQ0197_O05.1 # FORWARD PRIMER STATISTICS: # considered 5541 # GC content failed 14 # GC clamp failed 573 # low tm 4213 # high any compl 10 # high end compl 40 # long poly-x seq 8 # ok 683 # REVERSE PRIMER STATISTICS: # considered 5629 # GC content failed 211 # GC clamp failed 564 # low tm 4100 # high end compl 10 # long poly-x seq 12 # ok 732 # PRIMER PAIR STATISTICS: # considered 6607 # unacceptable product size 6551 # high end compl 20 # ok 36 # Start Len Tm GC% Sequence FORWARD PRIMER 799 23 63.93 69.57 AGCCACCAGGGGGTGCTCTCCAG REVERSE PRIMER 979 20 62.11 70.00 TGGCGACTCGGCCCATGCAC FORWARD PRIMER 799 20 60.18 70.00 AGCCACCAGGGGGTGCTCTC REVERSE PRIMER 979 20 62.11 70.00 TGGCGACTCGGCCCATGCAC FORWARD PRIMER 800 22 63.01 72.73 GCCACCAGGGGGTGCTCTCCAG REVERSE PRIMER 975 23 64.22 69.57 GGCGACTCGGCCCATGCACTGTC FORWARD PRIMER 799 23 63.93 69.57 AGCCACCAGGGGGTGCTCTCCAG REVERSE PRIMER 975 23 64.22 69.57 GGCGACTCGGCCCATGCACTGTC ...and that's why I'm kinda lost. Thank you for your help! JP Verta From biopython at maubp.freeserve.co.uk Tue Feb 1 16:41:08 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Feb 2011 16:41:08 +0000 Subject: [Biopython] Bio.Emboss.Primer3 -parser and Primer3 2.2.3 In-Reply-To: <249537ED-8F00-4AAF-B4BF-97BED5D64BB6@gmail.com> References: <249537ED-8F00-4AAF-B4BF-97BED5D64BB6@gmail.com> Message-ID: On Tue, Feb 1, 2011 at 4:28 PM, Jukka-Pekka Verta wrote: > Hello, > > I'm having trouble parsing out the Bio.Emboss.Primer3Commandline ... Could you file a bug (including version numbers of Biopython, EMBOSS, etc), with a short bit of Python code showing how you parse the file, and then (after filing the bug you can) attach the example Primer3 file to it. http://bugzilla.open-bio.org/enter_bug.cgi?product=Biopython I could try cut and pasting the example from the email, but white space and line wrapping tends to make a mess of things. Could you also clarify what happens with the other reverse primer attributes, like reverse_tm? Thanks, Peter From jp.verta at gmail.com Tue Feb 1 18:10:49 2011 From: jp.verta at gmail.com (Jukka-Pekka Verta) Date: Tue, 1 Feb 2011 13:10:49 -0500 Subject: [Biopython] Bio.Emboss.Primer3 -parser and Primer3 2.2.3 In-Reply-To: References: <249537ED-8F00-4AAF-B4BF-97BED5D64BB6@gmail.com> Message-ID: Thanks, bug 3173. JP On 2011-02-01, at 11:41 AM, Peter wrote: > On Tue, Feb 1, 2011 at 4:28 PM, Jukka-Pekka Verta wrote: >> Hello, >> >> I'm having trouble parsing out the Bio.Emboss.Primer3Commandline ... > > Could you file a bug (including version numbers of Biopython, EMBOSS, > etc), with a short bit of Python code showing how you parse the file, and > then (after filing the bug you can) attach the example Primer3 file to it. > http://bugzilla.open-bio.org/enter_bug.cgi?product=Biopython > > I could try cut and pasting the example from the email, but white space > and line wrapping tends to make a mess of things. > > Could you also clarify what happens with the other reverse primer > attributes, like reverse_tm? > > Thanks, > > Peter From biopython at maubp.freeserve.co.uk Tue Feb 1 20:39:30 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Feb 2011 20:39:30 +0000 Subject: [Biopython] internal function to convert illumina quality scores to phred In-Reply-To: References: <2CCFE5EE-0A98-4BA9-A853-B727978B29B7@stanford.edu> <97487661-DAB2-43F8-8CCF-4FC0AE252582@stanford.edu> <20110201160304.GH17835@sobchak.mgh.harvard.edu> Message-ID: On Tue, Feb 1, 2011 at 4:16 PM, Peter wrote: > On Tue, Feb 1, 2011 at 4:03 PM, Brad Chapman wrote: >> >> Peter, how hard do you think it would be to have SeqIO only convert >> from the fastq encoding to phred scores on demand? Most of the time >> when dealing with fastq I do not need any conversion at all and use >> the FastqGeneralIterator to just pull out the name, sequence and >> quality. >> >> You've done a lot of nice work with the correct conversions and it >> would be great to expose that directly though on-demand conversion >> as Alan is suggesting. Ideally you would use SeqIO as normal with >> fastq files, but the quality score would not be converted to solexa >> during parsing using letter_annotations["solexa_quality"] was >> accessed. > > I actually implemented a proof of concept that does that. In order > to not alter the SeqRecord behaviour, it was a new object which > acted like a list of integers in many respects. The data is held > as a FASTQ encoded string, and decoded (and then cached) on > demand only. On output if it was already in the right encoding > the string could be used as is, otherwise the conversion could > be done very quickly with a precomputed table and the string > translate() method (without having to go via a list of integers). > It seemed to work, but I wasn't convinced about the benefits > (given the complexity). I'd really want some real world FASTQ > benchmarks to try it on... something you might have in the form > of your scripts and the real data they were written for? > > I'm pretty sure this code is in a local git branch on one of my > machines (probably at home), but I don't think I pushed it to > github. I should do that... Found it and pushed it: https://github.com/peterjc/biopython/tree/fastq-tricks Note there are unit test failures (e.g. as currently implemented there is no range checking on the characters in the quality strings at parse time). We may want to continue this on the dev mailing list... Peter From rik at cogsci.ucsd.edu Tue Feb 1 23:02:27 2011 From: rik at cogsci.ucsd.edu (richard k belew) Date: Tue, 01 Feb 2011 15:02:27 -0800 Subject: [Biopython] Entrez.read(handle) AND handle.read() on the same handle? Message-ID: <4D489103.5020806@cogsci.ucsd.edu> there must be a simple way to do this, but i've not figured it out: i want to sniff at aspects of a record (using the dictionary returned by Entrez.read(handle)) and then cache the XML version (returned by handle.read()), only if it meets certain criteria. and without doing two separate efetch's! has anyone else bumped into this pattern? rik From p.j.a.cock at googlemail.com Tue Feb 1 23:20:07 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 1 Feb 2011 23:20:07 +0000 Subject: [Biopython] Entrez.read(handle) AND handle.read() on the same handle? In-Reply-To: <4D489103.5020806@cogsci.ucsd.edu> References: <4D489103.5020806@cogsci.ucsd.edu> Message-ID: On Tue, Feb 1, 2011 at 11:02 PM, richard k belew wrote: > there must be a simple way to do this, but i've > not figured it out: > > i want to sniff at aspects of a record (using the > dictionary returned by Entrez.read(handle)) and then > cache the XML version (returned by handle.read()), > only if it meets certain criteria. ?and without > doing two separate efetch's! > > has anyone else bumped into this pattern? > > ? ? ? ?rik The simplest solution is to use StringIO (or cStringIO) and cache it all in memory, then parse it: from StringIO import StringIO raw_data = efetch(...).read() record = Entrez.read(String(raw_data)) Peter From brettpthomas at gmail.com Wed Feb 2 15:04:03 2011 From: brettpthomas at gmail.com (Brett Thomas) Date: Wed, 2 Feb 2011 10:04:03 -0500 Subject: [Biopython] Use biopython to create database of genome intervals? Message-ID: Hi all, I'm looking to create a database of genome variants of varying size: some single base and some not. It needs to provide efficient range queries, such as "get me all genome variants in region X". Has anybody used biopython for something like this? I think this will require an interval tree, or something like it. Are there any implementations of interval trees in Biopython? Thanks, Brett From chapmanb at 50mail.com Wed Feb 2 15:25:25 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 2 Feb 2011 10:25:25 -0500 Subject: [Biopython] Use biopython to create database of genome intervals? In-Reply-To: References: Message-ID: <20110202152525.GE2151@kunkel> Brett; > I'm looking to create a database of genome variants of varying size: some > single base and some not. It needs to provide efficient range queries, such > as "get me all genome variants in region X". Has anybody used biopython for > something like this? > > I think this will require an interval tree, I'd recommend using bx-python, which contains an excellent IntervalTree implementation: https://bitbucket.org/james_taylor/bx-python/wiki/Home If you search GitHub there are several scripts you can use as examples to get started: https://github.com/search?langOverride=&language=python&q=intervaltree+bx&repo=&start_value=1&type=Code&x=0&y=0 But the basic usage is: import collections from bx.intervals.intersection import IntervalTree # build an interval tree itree = collections.defaultdict(IntervalTree) for chrom, start, end, data_dict in your_intervals: itree[chrom].insert(start, end, data_dict) # query the tree for chrom, start, end in regions_of_interest: overlaps = itree[chrom].find(start, end) Hope this helps, Brad From bnbowman at gmail.com Wed Feb 2 23:54:38 2011 From: bnbowman at gmail.com (Brett Bowman) Date: Wed, 2 Feb 2011 15:54:38 -0800 Subject: [Biopython] Multiple Sequence Alignment Conversion: A2M and A3M from Fasta Message-ID: I'm writing a Biopython script to pipeline the following process: 1) Parse Fasta From File 2) Blast it against NCBI and pull down a range of solid hits 3) Align the sequence with Muscle or ClustalW 4) Build a HMM profile of the alignment with HHmake 1-3 I've got down pat, its step 4 that seems to be the problem. In particular, HHmake appears to prefer A2M or A3M format alignments, and produces inferior results when fed an Aligned Fasta (*.AFA). Both alignment programs output to Fasta or ClustalW, but not A2M or A3M, and in addition I can't seem to find a definition for either format online anywhere. So: Does anyone know if there is a way to convert to A2M or A3M with Biopython? They do not appear supported by AlignIO. Otherwise, does anyone know where I could find a definition for the formats online so that I can write my own conversion? Brett Bowman Woelk Lab UCSD Medical School From biopython at maubp.freeserve.co.uk Thu Feb 3 00:10:02 2011 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 3 Feb 2011 00:10:02 +0000 Subject: [Biopython] Multiple Sequence Alignment Conversion: A2M and A3M from Fasta In-Reply-To: References: Message-ID: On Wed, Feb 2, 2011 at 11:54 PM, Brett Bowman wrote: > I'm writing a Biopython script to pipeline the following process: > 1) Parse Fasta From File > 2) Blast it against NCBI and pull down a range of solid hits > 3) Align the sequence with Muscle or ClustalW > 4) Build a HMM profile of the alignment with HHmake > > 1-3 I've got down pat, its step 4 that seems to be the problem. > ?In particular, HHmake appears to prefer A2M or A3M format alignments, > and produces inferior results when fed an Aligned Fasta (*.AFA). ?Both > alignment programs output to Fasta or ClustalW, but not A2M or A3M, and in > addition I can't seem to find a definition for either format online > anywhere. > > So: Does anyone know if there is a way to convert to A2M or A3M > with Biopython? ?They do not appear supported by AlignIO. ?Otherwise, does > anyone know where I could find a definition for the formats online so that I > can write my own conversion? Have you seen this HHMake manual: ftp://toolkit.lmb.uni-muenchen.de/HHsearch/HHsearch1.5.1/HHsearch-guide.pdf This describes the A2M and A3M formats, which I had not heard of before. I suspect these are file formats specific to the HHmake. It also says HHmake comes with a perl script reformat.pl which can be used to convert Clustal (or Stockholm format) to A3M - so just use that instead? Peter From p.j.a.cock at googlemail.com Thu Feb 3 16:40:31 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 3 Feb 2011 16:40:31 +0000 Subject: [Biopython] help possible? error while installing In-Reply-To: <4DDEB591-27B7-4D80-BF64-9CFFBE2206DF@med.umcg.nl> References: <4DDEB591-27B7-4D80-BF64-9CFFBE2206DF@med.umcg.nl> Message-ID: On Thu, Feb 3, 2011 at 4:35 PM, Ruben Mars wrote: > Hi there, > > I'm a biologist trying to get Biopython to work but I am getting the following error while installing and can't find anything about it in the help pages: > > mw159059:~ Rubenmars$ python /Users/Rubenmars/Downloads/biopython-1.56/setup.py install > Traceback (most recent call last): > ?File "/Users/Rubenmars/Downloads/biopython-1.56/setup.py", line 323, in > ? for line in open('Bio/__init__.py'): > IOError: [Errno 2] No such file or directory: 'Bio/__init__.py' > > I would be great if you could help me with this. > > I am using terminal from macOSX development tools. > > Best wishes, > > Ruben > Hi Rob, You should first change to the directory then run setup: cd /Users/Rubenmars/Downloads/biopython-1.56 python setup.py install Note you need to send your emails to biopython at lists.open-bio.org (or biopython at lbiopython.org), not biopython-owner at lists.open-bio.org which just does to a handful of people who help look after the mailing list. Peter From laserson at mit.edu Thu Feb 3 16:49:52 2011 From: laserson at mit.edu (Uri Laserson) Date: Thu, 3 Feb 2011 11:49:52 -0500 Subject: [Biopython] Output SeqRecord as XML? Message-ID: Does anyone have any experience or code that will write SeqRecord objects to XML (and also parse them)? Uri ................................................................................... Uri Laserson Graduate Student, Biomedical Engineering Harvard-MIT Division of Health Sciences and Technology M +1 917 742 8019 laserson at mit.edu From p.j.a.cock at googlemail.com Thu Feb 3 16:59:18 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 3 Feb 2011 16:59:18 +0000 Subject: [Biopython] Output SeqRecord as XML? In-Reply-To: References: Message-ID: On Thu, Feb 3, 2011 at 4:49 PM, Uri Laserson wrote: > Does anyone have any experience or code that will write SeqRecord objects to > XML (and also parse them)? > > Uri What kind of XML do you have in mind? INSDC, UniProt, TinySeq, ... Peter From p.j.a.cock at googlemail.com Thu Feb 3 18:40:45 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 3 Feb 2011 18:40:45 +0000 Subject: [Biopython] Output SeqRecord as XML? In-Reply-To: References: Message-ID: On Thu, Feb 3, 2011 at 6:15 PM, Uri Laserson wrote: > I am not familiar with those particular XML standards. > Basically, something that is clean/simple and that doesn't have some of the > arbitrary limitations of GenBank/EMBL format (like length of qualifiers, > etc.). ?What would you recommend? ?(If something is already supported in > biopython, that would obviously be a big plus.) > Uri There is nothing built into Biopython's SeqIO yet beyond the UniProt XML parser, which you are familiar with - right? ;). I suspect that either that or the INSDC XML format (INSDSeq, an XML version of GenBank/EMBL text files) might be good choices - although they too may have field limitations. I certainly wouldn't encourage you to invent yet another file format. Peter From clementsgalaxy at gmail.com Fri Feb 4 01:01:01 2011 From: clementsgalaxy at gmail.com (Dave Clements) Date: Thu, 3 Feb 2011 17:01:01 -0800 Subject: [Biopython] Galaxy Community Conference, May 25-26, Lunteren, The Netherlands Message-ID: We are pleased to announce the *2011 Galaxy Community Conference*, being held *May 25-26 in Lunteren, The Netherlands*. The meeting will feature two full days of presentations and discussion on extending Galaxy to use new tools and data sources, deploying Galaxy at your organization, and best practices for using Galaxy to further your own and your community's research. See http://galaxy.psu.edu/gcc2011/* for complete details. * *About Galaxy: *Galaxy is an open, web-based platform for *accessible, reproducible, and transparent* computational biomedical research. - *Accessibility:* Galaxy enables users without programming experience to easily specify parameters and run tools and workflows. - *Reproducibility:* Galaxy captures all information necessary so that any user can repeat and understand a complete computational analysis. - *Transparency:* Galaxy enables users to share and publish analyses via the web and create Pages--interactive, web-based documents that describe a complete analysis. Galaxy is open source for all organizations. The public Galaxy service ( http://usegalaxy.org) makes analysis tools, genomic data, tutorial demonstrations, persistent workspaces, and publication services available to any scientist that has access to the Internet. Local Galaxy servers can be set up by downloading the Galaxy application and customizing it to meet particular needs. *Conference Overview: * This event aims to engage a broader community of developers, data producers, tool creators, and core facility and other research hub staff to become an active part of the Galaxy community. We'll cover defining resources in the Galaxy framework, increasing their visibility and making them easier to use and integrate with other resources, how to extend Galaxy to use custom data sources and custom tools, and best practices for using Galaxy in your organization. Additional topics include, but are not limited to: * Talks submitted by the Galaxy community * Integration of tools (including NGS analysis tools) and distributed job management * Deployment of Galaxy instances on local resources and on the Cloud * Management of large datasets with the Galaxy Library System * Using the Galaxy LIMS functionality at NGS sequencing facilities * Visualizing Data without leaving Galaxy * Performing reproducible research * Performing and sharing complex analyses with Workflows * An "Introduction to Galaxy" session, offered on May 24, for Galaxy newcomers. *Registration: * The conference fee is ?100 on or before April 24, and ?120 after that. The meeting is being held at the Conference Centre De Werelt in Lunteren, The Netherlands, which is also the conference hotel. You are encouraged to register early, as space at the hotel (and at the "Intro to Galaxy" session) is limited and is likely to fill up before the conference itself does. See http://galaxy.psu.edu/gcc2011/Register.html * Abstract Submission: * Abstracts are now being accepted for short oral presentations. Proposals on any topic of interest to the Galaxy community are welcome and encouraged. The abstract submission deadline is the end of February 28. See http://galaxy.psu.edu/gcc2011/Abstracts.html * * *Sponsors * The 2011 Galaxy Community Conference is co-sponsored by the US National Science Foundation (NSF, http://www.nsf.gov/), and the Netherlands Bioinformatics Centre (NBIC, http://www.nbic.nl/). NBIC is a collaborative institute of the bioinformatics groups in the Netherlands. Together, these groups perform cutting-edge research, develop novel tools and support platforms, create an e-science infrastructure and educate the next generations of bioinformaticians. We are looking forward to a great conference and hope to see you in the Netherlands! The Galaxy and NBIC Teams -- http://galaxy.psu.edu/gcc2011/ http://getgalaxy.org http://usegalaxy.org/ From bnbowman at gmail.com Sat Feb 5 19:10:37 2011 From: bnbowman at gmail.com (Brett Bowman) Date: Sat, 5 Feb 2011 11:10:37 -0800 Subject: [Biopython] Bad Gateway IOerror when querying Entrez Message-ID: I'm running a script that takes a multi-fasta file and runs a local PSI-Blast on each seq, then queries NCBI for the full record of each one of those hits. It works fine most of the time, but occasionally it throws an IOError telling me that there is a "Bad Gateway". Curiously, this causes the script to crash sometimes but not others, and sometimes it gives me a warning of a "J" in the sequence when there are no Js in any of my files, except in the Fasta header. Does anyone know what causes this or how to stop it? Or do I just need to trap IO exceptions from every web query that I run, with appropriate code to retry failed queries? Brett Bowman Woelk Lab UCSD Medical School *** WARNING *** Assuming Amino (see -seqtype option), invalid letters found: J Traceback (most recent call last): File "AlignmentFromBlast.py", line 246, in seq_data = fetch_seqs_from_entrez(len(gi_nums), webenv, query_key) File "AlignmentFromBlast.py", line 125, in fetch_seqs_from_entrez webenv=webenv, query_key=query_key) File "/usr/lib/pymodules/python2.6/Bio/Entrez/__init__.py", line 109, in efetch return _open(cgi, variables) File "/usr/lib/pymodules/python2.6/Bio/Entrez/__init__.py", line 353, in _open raise IOError("Bad Gateway!") IOError: Bad Gateway! From p.j.a.cock at googlemail.com Sat Feb 5 21:48:38 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 5 Feb 2011 21:48:38 +0000 Subject: [Biopython] Bad Gateway IOerror when querying Entrez In-Reply-To: References: Message-ID: On Sat, Feb 5, 2011 at 7:10 PM, Brett Bowman wrote: > I'm running a script that takes a multi-fasta file and runs a local > PSI-Blast on each seq, then queries NCBI for the full record of each one of > those hits. ?It works fine most of the time, but occasionally it throws an > IOError telling me that there is a "Bad Gateway". ?Curiously, this causes > the script to crash sometimes but not others, and sometimes it gives me a > warning of a "J" in the sequence when there are no Js in any of my files, > except in the Fasta header. > > Does anyone know what causes this or how to stop it? ?Or do I just need to > trap IO exceptions from every web query that I run, with appropriate code to > retry failed queries? > > Brett Bowman Hi Brett, I've never had this 'J' problem (is this for PSI-BLAST?)... not sure what is going on there. However, a "Bad Gateway" or other IOErrors are almost to be expected with a long Entrez script (especially if run during peak times - don't forget to check the NCBI usage guidelines). I recommend you add a try/except block and some sensible retry code. Peter From hlapp at drycafe.net Sat Feb 5 23:45:47 2011 From: hlapp at drycafe.net (Hilmar Lapp) Date: Sat, 5 Feb 2011 18:45:47 -0500 Subject: [Biopython] NESCent Seeks Hackathon Whitepapers In-Reply-To: <0D7D89E4-C0D4-4347-A94C-21800E927746@ad.unc.edu> References:

<0D7D89E4-C0D4-4347-A94C-21800E927746@ad.unc.edu> Message-ID: <066A2391-6041-408C-B26E-9B867DE785C7@drycafe.net> The National Evolutionary Synthesis Center (NESCent), in keeping with its objective to promote collaborative development of open-source, reusable, and standards-supporting informatics resources, sponsors highly collaborative, face-to-face software development events, called "hackathons" (see [1]). To ensure that this program continues to be responsive to user needs and to tap into the expertise and creativity of the evolutionary biology community, NESCent is soliciting short whitepapers (2-6 pages) [2] on potential target areas for future hackathons. To further encourage submissions, we have now distilled specific guidelines for proposing hackathon events, based on the experiences gained from events we have sponsored in the past: http://informatics.nescent.org/wiki/Hackathon_Whitepaper_Guidelines The Center's Call for Informatics Whitepapers [3] includes not only hackathons, but also a large spectrum of other initiatives to be undertaken by the Center, including training, software development, collaborative ontology development, and coordination of data standards. Whitepapers are accepted at any time and reviewed on an on- going basis. URLs: [1] Collaborative cyberinfrastructure events and programs organized by NESCent: http://informatics.nescent.org/wiki/Main_Page [2] NESCent Call for Informatics Whitepapers http://www.nescent.org/informatics/whitepapers.php [3] Hackathon Whitepaper Guidelines: http://informatics.nescent.org/wiki/Hackathon_Whitepaper_Guidelines [4] Past NESCent-sponsored hackathons: http://informatics.nescent.org/wiki/Main_Page#Hackathons From dimitrakopoul at gmail.com Sun Feb 6 20:37:01 2011 From: dimitrakopoul at gmail.com (chris dimitrakopoulos) Date: Sun, 6 Feb 2011 22:37:01 +0200 Subject: [Biopython] Feature selection techniques modules Message-ID: Hello everyone, I am an msc student in University of Patras, Greece, in the research field of Bioinformatics. I recently become a member of the OBF and i appreciate the open source work of your OBF project. I had a discussion with Mr. Robert Buels about this year gsoc, cause i look forward to make an application and i found that OBF would be the organization most suitable for me. Generally, i was idling in the projects announced on previous years and i found them very interesting. As this year's potential projects have not been announced yet, i wanted to express to you an idea of mine, say briefly what I am thinking of doing, and ask you if you think it is a good idea and it is worth to make an application with this subject after March 28. Well, I think that feature selection techniques have become a very important issue in many bioinformatics implementations. In many cases (like protein interactions prediction), you have to find a way to collect the best set of features that leads to the best classification performance. I looked in Biopython libraries and i didn't find something relative about FS techniques implementation to a dataset of features (like t-test, ANOVA, Wilcoxon, CFS etc... ). Hence, i think that the creation of a library focused on FS techniques would be a good idea. Moreover, that library can have an hierarchical structure as there are different types of FS techniques, like filter, wrapper and embedded techniques. Furthermore, each type of them is divided into more groups, (f.e. filter methods are divided into univariate and multivariate methods, according to the consideration of feature dependencies) etc... Only some of the methods i am thinking of implementing are: T-test, ANOVA, Gamma, bivariate methods, CFS, MRMR which are some known filter feature selection techniques. In wrapper and embedded methods, the classifiers are been used in the process of feature selection, so we have techniques based on Genetic algorithms, Random forests, logistic regression, Decision Tree Learners, Bayesian Classifiers, etc.. In this case, the existing Biopython modules Bio.LogisticRegression, Bio.GA and Bio.NaiveBayes could be used. More information on the techniques I describe can be found on the following links: http://bioinformatics.oxfordjournals.org/content/23/19/2507.full.pdf+html http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=3570EDE4C7E11AAE7CA5F727800DC58A?doi=10.1.1.37.4643&rep=rep1&type=pdf New functions computing the above measures can be created. The calculation can be done between vectors of features, between a feature vector and the output vector, or even if in large datasets (with many features) been readen from a file, in which we want to implement feature selections. I send to you this email in order to express briefly my idea. Please let me know what do you think about it and if it is worth been proposed as one of my student applications in gsoc 2011, to open bioinformatics foundation. If you want me to tell you any further details about my thinking just ask me! :-) Look forward to hearing from you, Chris Dim From sdavis2 at mail.nih.gov Sun Feb 6 21:35:09 2011 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Sun, 6 Feb 2011 16:35:09 -0500 Subject: [Biopython] Feature selection techniques modules In-Reply-To: References: Message-ID: On Sun, Feb 6, 2011 at 3:37 PM, chris dimitrakopoulos < dimitrakopoul at gmail.com> wrote: > Hello everyone, > > I am an msc student in University of Patras, Greece, in the research field > of Bioinformatics. I recently become a member of the OBF and i appreciate > the open source work of your OBF project. > > I had a discussion with Mr. Robert Buels about this year gsoc, cause i look > forward to make an application and i found that OBF would be the > organization most suitable for me. Generally, i was idling in the projects > announced on previous years and i found them very interesting. As this > year's potential projects have not been announced yet, i wanted to express > to you an idea of mine, say briefly what I am thinking of doing, and ask > you > if you think it is a good idea and it is worth to make an application with > this subject after March 28. > > Well, I think that feature selection techniques have become a very > important > issue in many bioinformatics implementations. In many cases (like protein > interactions prediction), you have to find a way to collect the best set of > features that leads to the best classification performance. I looked in > Biopython libraries and i didn't find something relative about FS > techniques > implementation to a dataset of features (like t-test, ANOVA, Wilcoxon, CFS > etc... ). Hence, i think that the creation of a library focused on FS > techniques would be a good idea. Moreover, that library can have an > hierarchical structure as there are different types of FS techniques, like > filter, wrapper and embedded techniques. Furthermore, each type of them is > divided into more groups, (f.e. filter methods are divided into univariate > and multivariate methods, according to the consideration of feature > dependencies) etc... > > Only some of the methods i am thinking of implementing are: > > T-test, ANOVA, Gamma, bivariate methods, CFS, MRMR which are some known > filter feature selection techniques. > In wrapper and embedded methods, the classifiers are been used in the > process of feature selection, so we have techniques based on Genetic > algorithms, Random forests, logistic regression, Decision Tree Learners, > Bayesian Classifiers, etc.. In this case, the existing Biopython modules > Bio.LogisticRegression, Bio.GA and Bio.NaiveBayes could be used. > > Hi, Chris. You might want to look at the Rpy project. All of the above machine learning and feature selection algorithms (and many more) are implemented in R and can be wrapped fairly easily in python using Rpy. Sean > More information on the techniques I describe can be found on the following > links: > > http://bioinformatics.oxfordjournals.org/content/23/19/2507.full.pdf+html > > http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=3570EDE4C7E11AAE7CA5F727800DC58A?doi=10.1.1.37.4643&rep=rep1&type=pdf > > New functions computing the above measures can be created. The calculation > can be done between vectors of features, between a feature vector and the > output vector, or even if in large datasets (with many features) been > readen > from a file, in which we want to implement feature selections. > > I send to you this email in order to express briefly my idea. Please let me > know what do you think about it and if it is worth been proposed as one of > my student applications in gsoc 2011, to open bioinformatics foundation. If > you want me to tell you any further details about my thinking just ask me! > :-) > > Look forward to hearing from you, > Chris Dim > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Sun Feb 6 22:05:49 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 6 Feb 2011 22:05:49 +0000 Subject: [Biopython] Feature selection techniques modules In-Reply-To: References: Message-ID: On Sun, Feb 6, 2011 at 8:37 PM, chris dimitrakopoulos wrote: > Hello everyone, > > I am an msc student in University of Patras, Greece, in the research field > of Bioinformatics. I recently become a member of the OBF and i appreciate > the open source work of your OBF project. > > I had a discussion with Mr. Robert Buels about this year gsoc, cause i look > forward to make an application and i found that OBF would be the > organization most suitable for me. Generally, i was idling in the projects > announced on previous years and i found them very interesting. As this > year's potential projects have not been announced yet, i wanted to express > to you an idea of mine, say briefly what I am thinking of doing, and ask you > if you think it is a good idea and it is worth to make an application with > this subject after March 28. > > Well, I think that feature selection techniques have become a very important > issue in many bioinformatics implementations. In many cases (like protein > interactions prediction), you have to find a way to collect the best set of > features that leads to the best classification performance. I looked in > Biopython libraries and i didn't find something relative about FS techniques > implementation to a dataset of features (like t-test, ANOVA, Wilcoxon, CFS > etc... ). Hence, i think that the creation of a library focused on FS > techniques would be a good idea. Moreover, that library can have an > hierarchical structure as there are different types of FS techniques, like > filter, wrapper and embedded techniques. Furthermore, each type of them is > divided into more groups, (f.e. filter methods are divided into univariate > and multivariate methods, according to the consideration of feature > dependencies) etc... > > Only some of the methods i am thinking of implementing are: > > T-test, ANOVA, Gamma, bivariate methods, CFS, MRMR which are some known > filter feature selection techniques. > In wrapper and embedded methods, the classifiers are been used in the > process of feature selection, so we have techniques based on Genetic > algorithms, Random forests, logistic regression, Decision Tree Learners, > Bayesian Classifiers, etc.. In this case, the existing Biopython modules > Bio.LogisticRegression, Bio.GA and Bio.NaiveBayes could be used. > > More information on the techniques I describe can be found on the following > links: > > http://bioinformatics.oxfordjournals.org/content/23/19/2507.full.pdf+html > http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=3570EDE4C7E11AAE7CA5F727800DC58A?doi=10.1.1.37.4643&rep=rep1&type=pdf > > New functions computing the above measures can be created. The calculation > can be done between vectors of features, between a feature vector and the > output vector, or even if in large datasets (with many features) been readen > from a file, in which we want to implement feature selections. > > I send to you this email in order to express briefly my idea. Please let me > know what do you think about it and if it is worth been proposed as one of > my student applications in gsoc 2011, to open bioinformatics foundation. If > you want me to tell you any further details about my thinking just ask me! > :-) > > Look forward to hearing from you, > Chris Dim Hello Chris, This sounds interesting - a provided we can find some suitable mentors it could turn into a Google Summer of Code project. Something you could start with (now or as one of the first tasks if you write up a GSoC proposal) could be to understand the existing code in Biopython in this area (Bio.LogisticRegression, Bio.GA, Bio.NaiveBayes etc) and perhaps writing extra documentation for them (they are not covered in the tutorial at all), and perhaps some more unit tests too. One thing I would suggest checking is how much of the statistical code you mention is already written in other Python libraries (e.g. SciPy). For something as complicated as statistical testing there is no point reimplementing it. Tiago has previously said there are statics routines in SciPy he may want to use in his Biopython code for population genetics. So, check out SciPy: http://scipy.org/ Regards, Peter From bnbowman at gmail.com Mon Feb 7 22:30:19 2011 From: bnbowman at gmail.com (Brett Bowman) Date: Mon, 7 Feb 2011 14:30:19 -0800 Subject: [Biopython] Pulling Alignment From PSI-Blast Output Message-ID: I'm trying to use the PSI-Blast results from a series of proteins to detect distant homologues, using HMMs of various sorts. Currently I'm pulling down the sequence IDs with PSI-Blast, downloading the full sequences from NCBI, then aligning everything with ClustalW or Muscle. However this is eating up way more processor time than I have to spare, so I want to just pull the full multi-sequence alignment from the PSI-blast results if possible (OUTFMT option #3 or 4), for use in building the HMMs. But it doesn't look like AlignIO has a module for reading the peculiar format that PSI-Blast generates... Has this been done before, or will I need to write my own parser? Brett Bowman Woelk Lab UCSD School of Medicine From mjldehoon at yahoo.com Tue Feb 8 01:20:09 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 7 Feb 2011 17:20:09 -0800 (PST) Subject: [Biopython] Pulling Alignment From PSI-Blast Output In-Reply-To: Message-ID: <216797.39164.qm@web161211.mail.bf1.yahoo.com> One option you could try is to let PSI-Blast generate its output in XML and check if the information you need is present in the XML. If it is, you can parse the XML with the read() function in Bio.Entrez. You may find that Bio.Entrez needs an additional DTD file to be able to parse the PSI-Blast XML output (Bio.Entrez will tell you which one and where to store it). If so, please let us know, so we can include the required DTDs in the next release of Biopython. --Michiel. --- On Mon, 2/7/11, Brett Bowman wrote: > From: Brett Bowman > Subject: [Biopython] Pulling Alignment From PSI-Blast Output > To: biopython at biopython.org > Date: Monday, February 7, 2011, 5:30 PM > I'm trying to use the PSI-Blast > results from a series of proteins to detect > distant homologues, using HMMs of various sorts.? > Currently I'm pulling down > the sequence IDs with PSI-Blast, downloading the full > sequences from NCBI, > then aligning everything with ClustalW or Muscle.? > However this is eating up > way more processor time than I have to spare, so I want to > just pull the > full multi-sequence alignment from the PSI-blast results if > possible (OUTFMT > option #3 or 4), for use in building the HMMs.? But it > doesn't look like > AlignIO has a module for reading the peculiar format that > PSI-Blast > generates... > > Has this been done before, or will I need to write my own > parser? > > Brett Bowman > Woelk Lab > UCSD School of Medicine > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From bnbowman at gmail.com Tue Feb 8 07:40:10 2011 From: bnbowman at gmail.com (Brett Bowman) Date: Mon, 7 Feb 2011 23:40:10 -0800 Subject: [Biopython] Pulling Alignment From PSI-Blast Output In-Reply-To: <216797.39164.qm@web161211.mail.bf1.yahoo.com> References: <216797.39164.qm@web161211.mail.bf1.yahoo.com> Message-ID: I thought about that, but there doesn't appear to be any multiple-alignment data in the XML file - just pair-wise alignments of the query with each hit. In addition, when I parse the output file with NCBIXML I get a Bio.Blast.Record.Blast object, instead of a Bio.Blast.Record.PSIBlast object. The Biopython cookbook describes how to work with a PSIBlast object, but it doesn't really cover how to make one... On Mon, Feb 7, 2011 at 5:20 PM, Michiel de Hoon wrote: > One option you could try is to let PSI-Blast generate its output in XML and > check if the information you need is present in the XML. If it is, you can > parse the XML with the read() function in Bio.Entrez. You may find that > Bio.Entrez needs an additional DTD file to be able to parse the PSI-Blast > XML output (Bio.Entrez will tell you which one and where to store it). If > so, please let us know, so we can include the required DTDs in the next > release of Biopython. > > --Michiel. > > --- On Mon, 2/7/11, Brett Bowman wrote: > > > From: Brett Bowman > > Subject: [Biopython] Pulling Alignment From PSI-Blast Output > > To: biopython at biopython.org > > Date: Monday, February 7, 2011, 5:30 PM > > I'm trying to use the PSI-Blast > > results from a series of proteins to detect > > distant homologues, using HMMs of various sorts. > > Currently I'm pulling down > > the sequence IDs with PSI-Blast, downloading the full > > sequences from NCBI, > > then aligning everything with ClustalW or Muscle. > > However this is eating up > > way more processor time than I have to spare, so I want to > > just pull the > > full multi-sequence alignment from the PSI-blast results if > > possible (OUTFMT > > option #3 or 4), for use in building the HMMs. But it > > doesn't look like > > AlignIO has a module for reading the peculiar format that > > PSI-Blast > > generates... > > > > Has this been done before, or will I need to write my own > > parser? > > > > Brett Bowman > > Woelk Lab > > UCSD School of Medicine > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > > From bnbowman at gmail.com Tue Feb 8 07:44:08 2011 From: bnbowman at gmail.com (Brett Bowman) Date: Mon, 7 Feb 2011 23:44:08 -0800 Subject: [Biopython] Pulling Alignment From PSI-Blast Output In-Reply-To: References:

Message-ID: I had never heard of jackHMMER until now, so I'll look into it. However the outline of my current project is essential "This paper used X to find Y, so I want you to use X to find a homologue of Y in this other background", so I'm not sure how much wiggle room I have to change the methodologies used. The source paper used HHmake to create HMMs from the output of PSI-Blast, so I am trying to do the same if possible. -Brett On Mon, Feb 7, 2011 at 9:41 PM, Ruchira Datta wrote: > If you're using HMMs anyway, why not use jackhmmer? It's been shown to be > more sensitive than PSI-BLAST at the same number of iterations, and with an > option it will output the alignment. > > Note that its alignment is in Stockholm format though, and if you want > something else, BioPython's Stockholm parsing is very slow. > > --Ruchira > On Feb 7, 2011 2:31 PM, "Brett Bowman" wrote: > > I'm trying to use the PSI-Blast results from a series of proteins to > detect > > distant homologues, using HMMs of various sorts. Currently I'm pulling > down > > the sequence IDs with PSI-Blast, downloading the full sequences from > NCBI, > > then aligning everything with ClustalW or Muscle. However this is eating > up > > way more processor time than I have to spare, so I want to just pull the > > full multi-sequence alignment from the PSI-blast results if possible > (OUTFMT > > option #3 or 4), for use in building the HMMs. But it doesn't look like > > AlignIO has a module for reading the peculiar format that PSI-Blast > > generates... > > > > Has this been done before, or will I need to write my own parser? > > > > Brett Bowman > > Woelk Lab > > UCSD School of Medicine > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > From ruchira.datta at gmail.com Tue Feb 8 09:22:20 2011 From: ruchira.datta at gmail.com (Ruchira Datta) Date: Tue, 8 Feb 2011 01:22:20 -0800 Subject: [Biopython] Pulling Alignment From PSI-Blast Output In-Reply-To: References: