From kellrott at gmail.com Mon Nov 2 15:06:37 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Mon, 2 Nov 2009 12:06:37 -0800 Subject: [Biopython] Using SeqLocation to extract subsequence Message-ID: This should be a relatively simple question, but I didn't find any google hits... I'm parsing a genbank file of a chromosome, and I want to take the FeatureLocation data from a SeqFeature and extract the referenced DNA. Basically take a 'CDS' feature and get the gene DNA that coded it. Is there a function that I can pass the location data from a feature record and it will extract the DNA, including doing segment joining and reverse translation? I could write this myself, but it seems like a better idea to use something that has been well tested. Kyle From biopython at maubp.freeserve.co.uk Mon Nov 2 15:24:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 2 Nov 2009 20:24:36 +0000 Subject: [Biopython] Using SeqLocation to extract subsequence In-Reply-To: References: Message-ID: <320fb6e00911021224k5ddc5e3blb1b92b3cbc103355@mail.gmail.com> On Mon, Nov 2, 2009 at 8:06 PM, Kyle Ellrott wrote: > This should be a relatively simple question, but I didn't find any google > hits... > > I'm parsing a genbank file of a chromosome, and I want to take the > FeatureLocation data from a SeqFeature and extract the referenced DNA. > Basically take a 'CDS' feature and get the gene DNA that coded it. ?Is there > a function that I can pass the location data from a feature record and it > will extract the DNA, including doing segment joining and reverse > translation? > > I could write this myself, but it seems like a better idea to use something > that has been well tested. You missed this thread earlier this month: http://lists.open-bio.org/pipermail/biopython/2009-October/005695.html Are you on the dev mailing list? I was hoping to get a little discussion going there, before moving over to the discussion list for more general comment. The code mentioned there is the best tested bit of code I can suggest for now: http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006922.html Note there is no such thing as a SeqLocation object. There is a FeatureLocation, but you need the strand information - hence my code requires a SeqFeature object to fully describe the location. Peter From kellrott at gmail.com Mon Nov 2 16:31:28 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Mon, 2 Nov 2009 13:31:28 -0800 Subject: [Biopython] Using SeqLocation to extract subsequence In-Reply-To: <320fb6e00911021224k5ddc5e3blb1b92b3cbc103355@mail.gmail.com> References: <320fb6e00911021224k5ddc5e3blb1b92b3cbc103355@mail.gmail.com> Message-ID: > > You missed this thread earlier this month: > http://lists.open-bio.org/pipermail/biopython/2009-October/005695.html > > Are you on the dev mailing list? I was hoping to get a little discussion > going there, before moving over to the discussion list for more general > comment. I didn't need to do it when the original discussion came through, so it got 'filtered' ;-) I guess if multiple people are asking the same question independently, it's probably a timely issue. I'll probably go ahead and pull the SeqRecord fork into my git fork and start playing around with it. Kyle From biopython at maubp.freeserve.co.uk Mon Nov 2 17:30:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 2 Nov 2009 22:30:37 +0000 Subject: [Biopython] Using SeqLocation to extract subsequence In-Reply-To: References: <320fb6e00911021224k5ddc5e3blb1b92b3cbc103355@mail.gmail.com> Message-ID: <320fb6e00911021430n5704cecaue2c1d1010f7e91cc@mail.gmail.com> On Mon, Nov 2, 2009 at 9:31 PM, Kyle Ellrott wrote: >> >> You missed this thread earlier this month: >> http://lists.open-bio.org/pipermail/biopython/2009-October/005695.html >> >> Are you on the dev mailing list? I was hoping to get a little discussion >> going there, before moving over to the discussion list for more general >> comment. > > I didn't need to do it when the original discussion came through, so it got > 'filtered' ;-) ?I guess if multiple people are asking the same question > independently, it's probably a timely issue. > > I'll probably go ahead and pull the SeqRecord fork into my git fork and > start playing around with it. Cool - sorry if the previous email was brusque - I was in the middle of dinner preparation and shouldn't have been checking emails. If you just want to try the sequence extraction for a SeqFeature, the code is on the trunk (as noted, as a function in a unit test). My SeqRecord github branch is looking at other issues. Peter From biopython at maubp.freeserve.co.uk Tue Nov 3 07:52:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 3 Nov 2009 12:52:54 +0000 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <737542.47267.qm@web62401.mail.re1.yahoo.com> References: <1FA32B70-84CC-4012-97F5-9B67D56BADC6@gmail.com> <737542.47267.qm@web62401.mail.re1.yahoo.com> Message-ID: <320fb6e00911030452o616c0b4cycceb53f3561e6a42@mail.gmail.com> On Fri, Oct 16, 2009 at 1:04 AM, Michiel de Hoon wrote: > > Last time I checked (which was a few weeks ago), a multiple-query PSIBlast > search gives a file consisting of concatenated XML files. The problem is in > the design of Blast XML output. For a single-query PSIBlast, the fields under > are used to store the output of the PSIBlast iterations. > For multiple-query regular Blast, the same fields are used to store the search > results of each query. With multiple-query PSIBlast, there is then no way to > store the output in the current XML format. I've been meaning to write to NCBI > about this, but I haven't gotten round to it yet. Will do so this weekend. > > --Michiel. Did you get any reply? Peter From mjldehoon at yahoo.com Tue Nov 3 07:56:23 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 3 Nov 2009 04:56:23 -0800 (PST) Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <320fb6e00911030452o616c0b4cycceb53f3561e6a42@mail.gmail.com> Message-ID: <234359.27566.qm@web62401.mail.re1.yahoo.com> > Did you get any reply? > Yes, but just that they'll look into it. Nothing concrete yet, but I guess changing the Blast XML output is something that needs to be done very carefully, so it may take a while. Will keep you guys posted if I get a reply. --Michiel. --- On Tue, 11/3/09, Peter wrote: > From: Peter > Subject: Re: [Biopython] Problems parsing with PSIBlastParser > To: "Michiel de Hoon" > Cc: "Biopython Mailing List" > Date: Tuesday, November 3, 2009, 7:52 AM > On Fri, Oct 16, 2009 at 1:04 AM, > Michiel de Hoon > wrote: > > > > Last time I checked (which was a few weeks ago), a > multiple-query PSIBlast > > search gives a file consisting of concatenated XML > files. The problem is in > > the design of Blast XML output. For a single-query > PSIBlast, the fields under > > are used to store the > output of the PSIBlast iterations. > > For multiple-query regular Blast, the same fields are > used to store the search > > results of each query. With multiple-query PSIBlast, > there is then no way to > > store the output in the current XML format. I've been > meaning to write to NCBI > > about this, but I haven't gotten round to it yet. Will > do so this weekend. > > > > --Michiel. > > Did you get any reply? > > Peter > From biopython at maubp.freeserve.co.uk Tue Nov 3 08:32:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 3 Nov 2009 13:32:55 +0000 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <87FF6F4A-E169-44D2-8DD1-549F6BA07574@illinois.edu> References: <1FA32B70-84CC-4012-97F5-9B67D56BADC6@gmail.com> <737542.47267.qm@web62401.mail.re1.yahoo.com> <320fb6e00911030452o616c0b4cycceb53f3561e6a42@mail.gmail.com> <87FF6F4A-E169-44D2-8DD1-549F6BA07574@illinois.edu> Message-ID: <320fb6e00911030532r38e2fb2bjf7fffe0aa3686cc8@mail.gmail.com> On Tue, Nov 3, 2009 at 1:16 PM, Chris Fields wrote: > > We had the same problem w/ the BioPerl XML parser and ended up preprocessing > the data into separate XML files, carrying over the relevant information > into each file (yes, there is a better way, but it essentially involves a > redesign of the XML parser and related objects). > > BTW, the same thing happens if one runs multiple queries in the same file. > ?All individual report XML are in one single XML file, and information > relevant to all reports is only found into the first report. ?I think this > has been known for a while. ?I've repeatedly tried contacting NCBI but > haven't had a response re: this problem. > > chris Hi Chris, Old versions of blastall (also) used to produce concatenated XML files for multiple queries, but from about 2.2.14 they started (ab)using the iteration fields originally for PSI-BLAST output to hold multiple queries (there was some discussion of this on Biopython Bugs 1933 and 1970 - Biopython *should* cope with either). Apparently pgpblast was left producing concatenated XML files. The upshot of this is multi-query BLASTP etc XML files look just like single query multi-round PSI-BLAST XML files. This means having a single BLAST XML parser that automatically treats the two differently is tricky. Does that fit with your experience? Peter From cjfields at illinois.edu Tue Nov 3 08:16:02 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 3 Nov 2009 07:16:02 -0600 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <320fb6e00911030452o616c0b4cycceb53f3561e6a42@mail.gmail.com> References: <1FA32B70-84CC-4012-97F5-9B67D56BADC6@gmail.com> <737542.47267.qm@web62401.mail.re1.yahoo.com> <320fb6e00911030452o616c0b4cycceb53f3561e6a42@mail.gmail.com> Message-ID: <87FF6F4A-E169-44D2-8DD1-549F6BA07574@illinois.edu> Peter, On Nov 3, 2009, at 6:52 AM, Peter wrote: > On Fri, Oct 16, 2009 at 1:04 AM, Michiel de Hoon > wrote: >> >> Last time I checked (which was a few weeks ago), a multiple-query >> PSIBlast >> search gives a file consisting of concatenated XML files. The >> problem is in >> the design of Blast XML output. For a single-query PSIBlast, the >> fields under >> are used to store the output of the >> PSIBlast iterations. >> For multiple-query regular Blast, the same fields are used to store >> the search >> results of each query. With multiple-query PSIBlast, there is then >> no way to >> store the output in the current XML format. I've been meaning to >> write to NCBI >> about this, but I haven't gotten round to it yet. Will do so this >> weekend. >> >> --Michiel. > > Did you get any reply? > > Peter We had the same problem w/ the BioPerl XML parser and ended up preprocessing the data into separate XML files, carrying over the relevant information into each file (yes, there is a better way, but it essentially involves a redesign of the XML parser and related objects). BTW, the same thing happens if one runs multiple queries in the same file. All individual report XML are in one single XML file, and information relevant to all reports is only found into the first report. I think this has been known for a while. I've repeatedly tried contacting NCBI but haven't had a response re: this problem. chris From cjfields at illinois.edu Tue Nov 3 08:40:53 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 3 Nov 2009 07:40:53 -0600 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <320fb6e00911030532r38e2fb2bjf7fffe0aa3686cc8@mail.gmail.com> References: <1FA32B70-84CC-4012-97F5-9B67D56BADC6@gmail.com> <737542.47267.qm@web62401.mail.re1.yahoo.com> <320fb6e00911030452o616c0b4cycceb53f3561e6a42@mail.gmail.com> <87FF6F4A-E169-44D2-8DD1-549F6BA07574@illinois.edu> <320fb6e00911030532r38e2fb2bjf7fffe0aa3686cc8@mail.gmail.com> Message-ID: <93C54341-8762-4EA6-8D0F-3388E5005A5C@illinois.edu> On Nov 3, 2009, at 7:32 AM, Peter wrote: > On Tue, Nov 3, 2009 at 1:16 PM, Chris Fields > wrote: >> >> We had the same problem w/ the BioPerl XML parser and ended up >> preprocessing >> the data into separate XML files, carrying over the relevant >> information >> into each file (yes, there is a better way, but it essentially >> involves a >> redesign of the XML parser and related objects). >> >> BTW, the same thing happens if one runs multiple queries in the >> same file. >> All individual report XML are in one single XML file, and >> information >> relevant to all reports is only found into the first report. I >> think this >> has been known for a while. I've repeatedly tried contacting NCBI >> but >> haven't had a response re: this problem. >> >> chris > > Hi Chris, > > Old versions of blastall (also) used to produce concatenated XML > files for > multiple queries, but from about 2.2.14 they started (ab)using the > iteration > fields originally for PSI-BLAST output to hold multiple queries > (there was > some discussion of this on Biopython Bugs 1933 and 1970 - Biopython > *should* cope with either). > > Apparently pgpblast was left producing concatenated XML files. > The upshot of this is multi-query BLASTP etc XML files look just like > single query multi-round PSI-BLAST XML files. This means having a > single BLAST XML parser that automatically treats the two differently > is tricky. > > Does that fit with your experience? > > Peter Yes, pretty much. Ours now handles both report types w/o problems. We have a pluggable XML parser that is switched out based on whether one expects normal BLAST XML (the default) or PSI-BLAST XML (has to be indicated). With text reports we can determine this on the fly b/c the blast type should indicate whether it is PSI BLAST or not, but IIRC this wasn't the case with XML. I haven't checked to see if this has been fixed yet on NCBI's end, but I'm assuming it hasn't. chris From biopython at maubp.freeserve.co.uk Tue Nov 3 08:52:20 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 3 Nov 2009 13:52:20 +0000 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <93C54341-8762-4EA6-8D0F-3388E5005A5C@illinois.edu> References: <1FA32B70-84CC-4012-97F5-9B67D56BADC6@gmail.com> <737542.47267.qm@web62401.mail.re1.yahoo.com> <320fb6e00911030452o616c0b4cycceb53f3561e6a42@mail.gmail.com> <87FF6F4A-E169-44D2-8DD1-549F6BA07574@illinois.edu> <320fb6e00911030532r38e2fb2bjf7fffe0aa3686cc8@mail.gmail.com> <93C54341-8762-4EA6-8D0F-3388E5005A5C@illinois.edu> Message-ID: <320fb6e00911030552g5913f641obe7d8075e6c15d2b@mail.gmail.com> On Tue, Nov 3, 2009 at 1:40 PM, Chris Fields wrote: > > On Nov 3, 2009, at 7:32 AM, Peter wrote: >> ... >> The upshot of this is multi-query BLASTP etc XML files look just like >> single query multi-round PSI-BLAST XML files. This means having a >> single BLAST XML parser that automatically treats the two differently >> is tricky. >> >> Does that fit with your experience? >> >> Peter > > Yes, pretty much. ?Ours now handles both report types w/o problems. ?We have > a pluggable XML parser that is switched out based on whether one expects > normal BLAST XML (the default) or PSI-BLAST XML (has to be indicated). ?With > text reports we can determine this on the fly b/c the blast type should > indicate whether it is PSI BLAST or not, but IIRC this wasn't the case with > XML. ?I haven't checked to see if this has been fixed yet on NCBI's end, but > I'm assuming it hasn't. Certainly with 2.2.18 (where I have an example handy), the XML from pgpblast is practically identical to that from blastall. You *may* be able to infer this from looking at the complete file (e.g. any iteration messages). Having the user specify if they are expecting PSI-BLAST output (as you do in BioPerl) seems like the best option. We might do this via an optional argument to the existing Bio.Blast.NCBIXML parser, or add a second PSI-Blast specific parser. The later might be best for dealing with multi-query PSI-BLAST XML files, and using the same PSI BLAST specific objects as the old plain text parser. For plain text output, the Biopython use must already explicitly choose our PSI-BLAST parser over the default parser. Peter From kellrott at gmail.com Wed Nov 4 18:06:40 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Wed, 4 Nov 2009 15:06:40 -0800 Subject: [Biopython] Biopython on Jython In-Reply-To: <320fb6e00910220215u323dd237r28e2b8a15b651e43@mail.gmail.com> References: <320fb6e00910211555m4379dd34wfecdd9373cfd0498@mail.gmail.com> <320fb6e00910220215u323dd237r28e2b8a15b651e43@mail.gmail.com> Message-ID: > >> You probably noticed I merged some of your fixes to get (the non C and > >> non NumPy bits of) Biopython to work on Jython, but not all. Could you > >> update your github branch to the trunk at some point? That would help > >> in picking up more of your fixes. > > > > I've tried to keep my branch up to speed with the mainline. But I didn't > > branch my work from master, so it may harder to extract... > > True, but I can probably manage. > I just rebased my jython related work into a seperate fork. So it should be easier to pull out patches now. I think there is still some work in Bio/Data/CodonTable.py, Bio/SubsMat/MatrixInfo.py and some of the unit tests that should make jython work a bit better. Kyle From stran104 at chapman.edu Wed Nov 4 19:25:35 2009 From: stran104 at chapman.edu (Matthew Strand) Date: Wed, 4 Nov 2009 16:25:35 -0800 Subject: [Biopython] Get Organism Given Bio.Blast.Record.Blast Object Message-ID: <2a63cc350911041625y1d7fe1eamd2dc8a51cc9d5074@mail.gmail.com> Dear users, Given a Bio.Blast.Record.Blast object is it possible to recover the organism name without using entrez to query the NCBI servers? Often the organism is listed in Bio.Blast.Record.Alignment.title but I do not see a way to reliably extract it from this data. I have reviewed the API and the UML diagram in the cookbook: http://www.biopython.org/DIST/docs/api/Bio.Blast.Record-module.html and http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#fig:blastrecordrespectively Input is appreciated, --Matthew Strand From biopython at maubp.freeserve.co.uk Thu Nov 5 05:44:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 5 Nov 2009 10:44:54 +0000 Subject: [Biopython] Get Organism Given Bio.Blast.Record.Blast Object In-Reply-To: <2a63cc350911041625y1d7fe1eamd2dc8a51cc9d5074@mail.gmail.com> References: <2a63cc350911041625y1d7fe1eamd2dc8a51cc9d5074@mail.gmail.com> Message-ID: <320fb6e00911050244y5f1c2dcbs53b2a1e34a07c01b@mail.gmail.com> On Thu, Nov 5, 2009 at 12:25 AM, Matthew Strand wrote: > Dear users, > Given a Bio.Blast.Record.Blast object is it possible to recover the organism > name without using entrez to query the NCBI servers? > > Often the organism is listed in Bio.Blast.Record.Alignment.title but I do > not see a way to reliably extract it from this data. The organism is not explicitly given in BLAST results. This is nothing to do with the Biopython parser. However, ... The NCBI tend to encode the organism in the match title within square brackets (and where redundant sequences have been merged, you probably can have two organisms). You might rely on this. Alternatively, most (all?) of the NCBI BLAST databases will use GI numbers, so you could use that to map to the organism. This can be done via Entrez (online), or offline by downloading the mapping. See: http://lists.open-bio.org/pipermail/biopython/2009-June/005304.html If you are using a custom BLAST database, then it all depends on how the database was created. Peter From konrad.koehler at mac.com Thu Nov 5 08:21:39 2009 From: konrad.koehler at mac.com (Konrad Koehler) Date: Thu, 05 Nov 2009 14:21:39 +0100 Subject: [Biopython] Bio.PDB: parsing PDB files for ATOM records Message-ID: <153781524001004023681590133497734615609-Webmail@me.com> Hello everyone, I wanted to use Bio:PDB to retrieve the atom element symbol from columns 77-78 of the PDB file. This is apparently not possible with the lastest version of Biopython 1.52. Some time ago, Macro Zhu posted the following fix: http://osdir.com/ml/python.bio.general/2008-04/msg00038.html which I have tried to implement in the current 1.52 version, however I cannot seem to get this to work. Is there any way to retrieve the element symbol using the current version of Biopython? If not, I would like to request that this functionality be added to Bio.PDB. Best regards, Konrad From konrad.koehler at mac.com Thu Nov 5 10:51:08 2009 From: konrad.koehler at mac.com (Konrad Koehler) Date: Thu, 05 Nov 2009 16:51:08 +0100 Subject: [Biopython] Fwd: Bio.PDB: parsing PDB files for ATOM records Message-ID: <13397911330439330334250671630377437584-Webmail@me.com> Contray to my first post, the modifications to Bio.PDB outlined below: http://osdir.com/ml/python.bio.general/2008-04/msg00038.html do work with the lastest version of Bio.PDB. (I must have introduced a typo in my first try, on the second try it worked perfectly). I would however request that these changes be incorporated into the production version of Bio.PDB. Best regards, Konrad >From: "Konrad Koehler" >To: >Date: November 05, 2009 03:23:05 PM CET >Subject: [Biopython] Bio.PDB: parsing PDB files for ATOM records > >Hello everyone, > >I wanted to use Bio:PDB to retrieve the atom element symbol from columns 77-78 of the PDB file. This is apparently not possible with the lastest version of Biopython 1.52. > >Some time ago, Macro Zhu posted the following fix: > >http://osdir.com/ml/python.bio.general/2008-04/msg00038.html > >which I have tried to implement in the current 1.52 version, however I cannot seem to get this to work. > >Is there any way to retrieve the element symbol using the current version of Biopython? If not, I would like to request that this functionality be added to Bio.PDB. > >Best regards, > >Konrad > >_______________________________________________ >Biopython mailing list - Biopython at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/biopython > > From jon.brate at bio.uio.no Thu Nov 5 19:30:14 2009 From: jon.brate at bio.uio.no (=?ISO-8859-1?Q?Jon_Br=E5te?=) Date: Fri, 6 Nov 2009 01:30:14 +0100 Subject: [Biopython] Parsing Blast results in XML format Message-ID: <4AAC53FE-86C2-4DAC-880C-A45D270B9C57@bio.uio.no> Dear all, I have a Blast output file in xml format generated by qBlast done through biopython. The Blast was performed with 22 query sequences and 50 hits were returned for each query. The result is in one single xml file. I want to extract all the sequence IDs for all the hits (22x50) and I have been checking out the BioPython cookbook page 53. I am using this code, but I am only getting the 50 hits for the 1st query sequence: from Bio.Blast import NCBIXM blast_records = NCBIXML.parse(result_handle) blast_record = blast_records.next() save_file = open("/Users/jonbra/Desktop/my_fasta_seq.fasta", 'w') for alignment in blast_record.alignments: for hsp in alignment.hsps: save_file.write('>%s\n' % (alignment.title,)) save_file.close() Can anyone help me to retrieve all the hits for all the query sequences? Best wishes Jon Br?te ------------------------------------------------------------- Jon Br?te PhD student Microbial Evolution Research Group (MERG) Department of Biology University of Oslo P.b. 1066 Blindern N-0316 Oslo Norway Phone: +47 22855083 From jon.brate at bio.uio.no Thu Nov 5 19:29:39 2009 From: jon.brate at bio.uio.no (=?ISO-8859-1?Q?Jon_Br=E5te?=) Date: Fri, 6 Nov 2009 01:29:39 +0100 Subject: [Biopython] Parsing Blast results in XML format Message-ID: Dear all, I have a Blast output file in xml format generated by qBlast done through biopython. The Blast was performed with 22 query sequences and 50 hits were returned for each query. The result is in one single xml file. I want to extract all the sequence IDs for all the hits (22x50) and I have been checking out the BioPython cookbook page 53. I am using this code, but I am only getting the 50 hits for the 1st query sequence: from Bio.Blast import NCBIXM blast_records = NCBIXML.parse(result_handle) blast_record = blast_records.next() save_file = open("/Users/jonbra/Desktop/my_fasta_seq.fasta", 'w') for alignment in blast_record.alignments: for hsp in alignment.hsps: save_file.write('>%s\n' % (alignment.title,)) save_file.close() Can anyone help me to retrieve all the hits for all the query sequences? Best wishes Jon Br?te ------------------------------------------------------------- Jon Br?te PhD student Microbial Evolution Research Group (MERG) Department of Biology University of Oslo P.b. 1066 Blindern N-0316 Oslo Norway Phone: +47 22855083 From mjldehoon at yahoo.com Thu Nov 5 23:22:17 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 5 Nov 2009 20:22:17 -0800 (PST) Subject: [Biopython] Parsing Blast results in XML format In-Reply-To: Message-ID: <820813.8529.qm@web62407.mail.re1.yahoo.com> > blast_record = blast_records.next() You're only pulling out the first Blast record. If you call blast_records.next() again, it will give you the second Blast record. And so on. Easiest solution is to have a for-loop: blast_records = NCBIXML.parse(result_handle) for blast_record in blast_records: # get the information you need from blast_record. --Michiel. --- On Thu, 11/5/09, Jon Br?te wrote: > From: Jon Br?te > Subject: [Biopython] Parsing Blast results in XML format > To: biopython at lists.open-bio.org > Date: Thursday, November 5, 2009, 7:29 PM > Dear all, > > I have a Blast output file in xml format generated by > qBlast done through biopython. The Blast was performed with > 22 query sequences and 50 hits were returned for each query. > The result is in one single xml file. I want to extract all > the sequence IDs for all the hits (22x50) and I have been > checking out the BioPython cookbook page 53. > > I am using this code, but I am only getting the 50 hits for > the 1st query sequence: > from Bio.Blast import NCBIXM > blast_records = NCBIXML.parse(result_handle) > blast_record = blast_records.next() > > save_file = > open("/Users/jonbra/Desktop/my_fasta_seq.fasta", 'w') > > for alignment in blast_record.alignments: > ? ? for hsp in alignment.hsps: > ? ? ? ? ? ? > save_file.write('>%s\n' % (alignment.title,)) > save_file.close() > Can anyone help me to retrieve all the hits for all the > query sequences? > > Best wishes > > Jon Br?te > > > ------------------------------------------------------------- > Jon Br?te > PhD student > > Microbial Evolution Research Group (MERG) > Department of Biology > University of Oslo > P.b. 1066 Blindern > N-0316 Oslo > Norway > Phone: +47 22855083 > > > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Fri Nov 6 07:22:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 6 Nov 2009 12:22:03 +0000 Subject: [Biopython] Getting the sequence for a SeqFeature Message-ID: <320fb6e00911060422u2d2742d5r7b5b1db98c991df5@mail.gmail.com> Hi all, I am planing to add a new method to the SeqFeature object, but would like a little feedback first. This email is really just the background - I'll write up a few examples later to try and make this a bit clearer... A task that comes up every so often on the mailing lists, which I have needed to do myself in the past, is getting the nucleotide sequences for features in a GenBank file, e.g. http://lists.open-bio.org/pipermail/biopython/2007-September/003703.html http://lists.open-bio.org/pipermail/biopython/2009-October/005695.html http://lists.open-bio.org/pipermail/biopython-dev/2009-May/005991.html http://lists.open-bio.org/pipermail/biopython-dev/2009-May/005997.html http://lists.open-bio.org/pipermail/biopython-dev/2009-November/006958.html Often, once you have the nucleotide sequence, you'll want to translate it, e.g. CDS features or mat_peptides as here: http://lists.open-bio.org/pipermail/bioperl-l/2009-October/031493.html If you parse a GenBank file (or an EMBL file etc) with SeqIO, you typically get a SeqRecord object with the the full nucleotide sequence (as record.seq, a Seq object) and a list of features (as record.features, a list of SeqFeature objects). For most prokaryotic features, things are fairly easy - you just need the (non fuzzy) start and end positions of the SeqFeature, and the strand. Then slice the parent sequence, and take the reverse complement if required. However, there are also rare cases like joins to consider (e.g. a ribosomal slippage), but joins are common if you deal with eukaryotes since intron/exon splicing is normal. Here you need to look at the subfeatures, and their locations - and indeed their strands, as there are a few mixed strand features in real GenBank files. In the above examples I have been thinking about genomes, or any nucleic sequence - but the same applies to proteins where the features might be the positions of domains. All the same issues apply except for strands. As noted in the linked threads, I have some working code currently on a github branch with unit tests which seems to handle all this. I would like to include this in Biopython, but first would like a little feedback on the proposed interface. What I am proposing is adding a method to the SeqFeature object taking the parent sequence (as a Seq like object, or even a string) as a required argument. This would return the region of the parent sequence described by the feature location and strands (including any subfeatures for joins). This could instead be done as a stand alone function, or as a method of the Seq object (as I suggested back in 2007). However, on reflection, I think the SeqFeature is most appropriate. http://lists.open-bio.org/pipermail/biopython/2007-September/003706.html With this basic functionality in place, it would then be much easier to take a parent SeqRecord and a child SeqFeature, and build a child SeqRecord taking the sequence from the parent SeqRecord (using the above new code), and annotation from the SeqFeature. This could (later) be added to Biopython as well, perhaps as a method of the SeqRecord. As this email is already very long, I'll delay giving any examples. Peter From biopython at maubp.freeserve.co.uk Fri Nov 6 07:47:42 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 6 Nov 2009 12:47:42 +0000 Subject: [Biopython] Getting the sequence for a SeqFeature In-Reply-To: <320fb6e00911060422u2d2742d5r7b5b1db98c991df5@mail.gmail.com> References: <320fb6e00911060422u2d2742d5r7b5b1db98c991df5@mail.gmail.com> Message-ID: <320fb6e00911060447g779f2ac2i7739a28c3f4a4077@mail.gmail.com> On Fri, Nov 6, 2009 at 12:22 PM, Peter wrote: > Hi all, > > I am planing to add a new method to the SeqFeature object, but > would like a little feedback first. This email is really just the > background - I'll write up a few examples later to try and make > this a bit clearer... OK, here is a non-trivial example - the first CDS feature in the GenBank file NC_000932.gb (included as a Biopython unit test), which is a three part join on the reverse strand. In this case, the GenBank file includes the protein translation for the CDS features so we can use it to check our results. We can parse this GenBank file into a SeqRecord with: from Bio import SeqIO record = SeqIO.read(open("../biopython/Tests/GenBank/NC_000932.gb"), "gb") Let's have a look at the first CDS feature (index 2): f = record.features[2] print f.type, f.location, f.strand, f.location_operator for sub_f in f.sub_features : print " - ", sub_f.location, sub_f.strand table = f.qualifiers.get("transl_table",[1])[0] # List of one int print "Table", table Giving: CDS [97998:69724] -1 join - [97998:98024] -1 - [98561:98793] -1 - [69610:69724] -1 Table 11 Looking at the raw GenBank file, this feature has location string: complement(join(97999..98024,98562..98793,69611..69724)) i.e. To get the sequence you need to do this (note zero based Python counting as in the output above): print (record.seq[97998:98024] + record.seq[98561:98793] + record.seq[69610:69724]).reverse_complement() And then translate it using NCBI genetic code table 11, print "Manual translation:" print (record.seq[97998:98024] + record.seq[98561:98793] + record.seq[69610:69724]).reverse_complement().translate(table=11, cds=True) print "Given translation:" print f.qualifiers["translation"][0] # List of one string print "Biopython translation (with proposed code):" print f.extract(record.seq).translate(table, cds=True) And the output, together with the provided translation in the feature annotation, and the shortcut with the new code I am proposing to include in Biopython: Manual translation: MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTRVYTITPKKPNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTLDAVGVKDRQQGRSKYGVKKPK Given translation: MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTRVYTITPKKPNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTLDAVGVKDRQQGRSKYGVKKPK Biopython translation: MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTRVYTITPKKPNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTLDAVGVKDRQQGRSKYGVKKPK The point of all this was with the proposed new extract method, you just need: feature_seq = f.extract(record.seq) instead of: feature_seq = (record.seq[97998:98024] + record.seq[98561:98793] + record.seq[69610:69724]).reverse_complement() which is in itself a slight simplification since you'd need to get the those coordinates from the sub features, worry about strands, etc. Peter From biopython at maubp.freeserve.co.uk Mon Nov 9 06:21:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 9 Nov 2009 11:21:21 +0000 Subject: [Biopython] Biopython & p3d In-Reply-To: <7278B893-A99D-45A9-B96E-F20653EB25AD@uni-muenster.de> References: <2034B111-0AB9-4FE8-B1F9-62E753DA0FE8@uni-muenster.de> <320fb6e00910210218n39b746bu6c5fd826efe6097f@mail.gmail.com> <905F76E9-4BA4-4297-86E0-E6BDDDEEC624@uni-muenster.de> <320fb6e00910210401l737252deg78de117143395279@mail.gmail.com> <320fb6e00910211514r952732fw6c6d48f9be0e46c9@mail.gmail.com> <7278B893-A99D-45A9-B96E-F20653EB25AD@uni-muenster.de> Message-ID: <320fb6e00911090321w2af06272x18d59615942d8dae@mail.gmail.com> On Mon, Nov 9, 2009 at 10:57 AM, Christian Fufezan wrote: > > back ! :) > > lets get back into the discussion (or sum it up) > > The consensus was > a) both packages (biopython.pdb and p3d) have advantages > b) possibly merge both modules while keeping the best of both of them could > be an interesting step forward. Hi Christian - thanks for getting back to us. That seems like a fair summary. For those that missed it, the thread is archived here: http://lists.open-bio.org/pipermail/biopython/2009-October/005721.html > On 22 Oct 2009, at 00:14, Peter wrote: > >> On Wed, Oct 21, 2009 at 7:22 PM, Christian Fufezan wrote: >>>> >>>> Biopython might be improved by defining an atom >>>> property (list or iterator?) instead of the get_atoms() method. >>> >>> agree. ?I would argue that p3d's atom/vector class seems the way to go. >> >> We can probably have similar things for chains etc. Any other >> views on this? I never liked the get_* and set_* methods in >> Bio.PDB myself, and using Python properties seem more >> natural here (they may not have existing when Bio.PDB was >> first started - I'd have to check). >> >> [We should probably break out specific suggestions like this >> into new mailing list threads, and CC people like Thomas H.] I must do that... without looking into the details, it seems like a relatively straightforward addition which should make Bio.PDB easier to use. >> The drill down is great for selecting a particular residue or >> chain (or for NMR, a particular model). It is also good for >> looping over these structures - e.g. to process psi/phi >> angles along a protein backbone. > > cannot really see an advantage here. If one can directly access all the > atoms one's interested in with one line and then just collect phi,psi > angles, why would one need to drill down through the structures? > > Looping over structure elements is even more refined with the natural > human language interface: > imagine: residues_of_interest = protein.query('alpha and residue > 12..51 and model 2') > > if you like looking you can also do for model in models: > protein.query('alpha and residue 12..51 and model',model) > > or > > for residue in range (12,51): > ?protein.query('alpha and residue' , residue , 'and model 2') > > but looping over each residue and then do a conditional check if the residue > is in range (12-51) and if atom type is alpha carbon seems for me a bit of > an overhead. In fact that's one of the point I like about p3d most. one can > define the query in a way that nested loops are rarely need. Imagine you > want to collect chi1 angles of all His... In psuedo code, I would picture something like this: [residue.chi1 for residue in model.residues if residue.name="His"] (That almost certainly won't work as is with Bio.PDB, I'm just tying to convey how I would expect to be able to tackle the problem with a list comprehension) > from the following (I chopped some bits ... ), I can read that biopythons > pdb module (with numpy) works similar to p3d - or to be more correct > p3d works like biopython in combination with numpy, in the sense that one > can use atoms as vectors. That seems like a fair summary. In p3d, the atoms are (also) vector like objects, while in Biopython, the atoms have a numpy coord property. As long as you are happy with numpy, this allows fast and efficient vector operations. >>> so writing an structural alignment script is straight forward >>> (see e.g. http://p3d.fufezan.net/index.php?title=alignByATP). >> >> Structural alignment is not so different in Biopython - just the details. >> e.g. >> http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ >> > very nice - like the Bio.PDB.Superimposer(). It does all the vector > operations needed to align structures, nice. Involvement of numpy certainly > makes it powerful. Indeed - numpy is *very* powerful. > The nested loops to find all alpha carbons is a biopython.pdb classic ;) I would probably write that with a list comprehension nowadays, but they are essentially just syntactic sugar for (nested) loops. > to round thinks up: > p3ds strength comes with the natural human user interface that allows the > combination of sets and the spatial information (less nested loops). > However, I am not sure if the biopython's community wants such an extension. > Biopython.pdb has a long history, it works like it is and users are > comfortable with it, so maybe there is not much to merge after all. That seems fair, although that doesn't mean there aren't things we can improve in Bio.PDB (moving from get/set methods to properties for example). My personal view (and I did not write Bio.PDB and have only made relatively light usage of it) is that working with the nested structures (of the flattened lists) it provides is fairly natural with Python lists, or list comprehensions. The p3d "natural language" interface is an interesting abstraction, and may be easier for some, but to me is just another layer on top of the raw functionality - and another query syntax to learn. That said, it probably would be possible to layer something like this on top of the existing Bio.PDB objects (but I personally have no interest in doing this, and no need for it - keeping on top of the sequence side of things in Biopython is enough to keep me busy!). I would be delighted if other people on the people on the mailing list who *do* work with PDB files could comment. e.g. Thomas and Kristian, cc'd. Peter From fufezan at uni-muenster.de Mon Nov 9 05:57:13 2009 From: fufezan at uni-muenster.de (Christian Fufezan) Date: Mon, 9 Nov 2009 11:57:13 +0100 Subject: [Biopython] Biopython & p3d In-Reply-To: <320fb6e00910211514r952732fw6c6d48f9be0e46c9@mail.gmail.com> References: <2034B111-0AB9-4FE8-B1F9-62E753DA0FE8@uni-muenster.de> <320fb6e00910210218n39b746bu6c5fd826efe6097f@mail.gmail.com> <905F76E9-4BA4-4297-86E0-E6BDDDEEC624@uni-muenster.de> <320fb6e00910210401l737252deg78de117143395279@mail.gmail.com> <320fb6e00910211514r952732fw6c6d48f9be0e46c9@mail.gmail.com> Message-ID: <7278B893-A99D-45A9-B96E-F20653EB25AD@uni-muenster.de> back ! :) lets get back into the discussion (or sum it up) The consensus was a) both packages (biopython.pdb and p3d) have advantages b) possibly merge both modules while keeping the best of both of them could be an interesting step forward. On 22 Oct 2009, at 00:14, Peter wrote: > On Wed, Oct 21, 2009 at 7:22 PM, Christian Fufezan wrote: >>> Biopython might be improved by defining an atom >>> property (list or iterator?) instead of the get_atoms() method. >> >> agree. I would argue that p3d's atom/vector class seems the way to >> go. > > We can probably have similar things for chains etc. Any other > views on this? I never liked the get_* and set_* methods in > Bio.PDB myself, and using Python properties seem more > natural here (they may not have existing when Bio.PDB was > first started - I'd have to check). > > [We should probably break out specific suggestions like this > into new mailing list threads, and CC people like Thomas H.] > >>> One might also ask for x, y and z properties on the atom object >>> to provide direct access to the three coordinates as floats. Do >>> you think this sort of little thing would help improve Bio.PDB? >>> >> yes indeed, that is _the_ information a pdb module should offer >> without any addition. Better would be even if the atoms are >> treatable as vectors (see below). p3d has a series of atom >> object attributes that are convenient. > > I would argue that the x-y-z triple (which Biopython has) is > more important that separate x, y, and z floats. We seem > to agree here. > What I meant is that I think the most important thing a pdb module should offer is the possibility to do vector operations directly with atom objects, i.e. before translating them. Whether the values are stored in three attributes (.x,.y,.z, p3d) or as a tuple (biopython), seems not really important as long simple vector operations are possible. > The Biopython atom's coord property is an x-y-z triple (as a > one dimensional numpy array). The Bio.PDB code also > defines its own vector objects on top of this, but my memory > of the details is hazy here. As I recall, I personally stuck > with the numpy objects in my scripts using Bio.PDB. > The version I used, one had to convert the entity into a vector. But that's already some time ago, I guess. >>> Yes, it should be possible to offer nice nested access and nice flat >>> access from the same objects. Internally the current Biopython PDB >>> structure could perhaps be handled as filtered views of a complete >>> list of all the atoms (using sets and trees or a database or >>> whatever). >>> That might make some things faster too. >> >> I agree to some extent. As above, I can only say that I >> cannot see the advantage of a nested data structure. >> Maybe you can explain with an example where drilling >> through the nested structure could come in handy. > > The drill down is great for selecting a particular residue or > chain (or for NMR, a particular model). It is also good for > looping over these structures - e.g. to process psi/phi > angles along a protein backbone. cannot really see an advantage here. If one can directly access all the atoms one's interested in with one line and then just collect phi,psi angles, why would one need to drill down through the structures? Looping over structure elements is even more refined with the natural human language interface: imagine: residues_of_interest = protein.query('alpha and residue 12..51 and model 2') if you like looking you can also do for model in models: protein.query('alpha and residue 12..51 and model',model) or for residue in range (12,51): protein.query('alpha and residue' , residue , 'and model 2') but looping over each residue and then do a conditional check if the residue is in range (12-51) and if atom type is alpha carbon seems for me a bit of an overhead. In fact that's one of the point I like about p3d most. one can define the query in a way that nested loops are rarely need. Imagine you want to collect chi1 angles of all His... > >>>> Yes that was one thing that we were really missing. Also the fact >>>> that >>>> biopython requires the unfolded entity to be converted to vectors >>>> and so >>>> forth was a bit complex and we needed fast and direct access to the >>>> coordinates, the very essence of pdb files. >>> >>> I'm not quite sure what you mean here by "vectors". Could you >>> be a little more specific? Do you want NumPy style objects or >>> something else? >> >> In p3d the atom objects are vectors, > > I don't immediately see what the intention is here. What does > "adding" or "subtracting" two atom/vector objects give you? A > new non-atom vector would be my guess? What about > multiplying by a scaler? Again, getting a non-atom vector > object back makes most sense. > Yes, right one gets a vector back. This vector can then be used in the query function. Imagine you want to survey residues that span a membrane along a given path. With p3d you can easily generate a series of vectors and more importantly, one can use these vectors in the query function. for c in [k/10.0 * (startVector-endVector) for k in range(1,10)]: pdb.query('protein and within 3 of ' c) to visualize the path in e.g. VMD one can also print those vectors in a pdb format. from the following (I chopped some bits ... ), I can read that biopythons pdb module (with numpy) works similar to p3d - or to be more correct p3d works like biopython in combination with numpy, in the sense that one can use atoms as vectors. >> so writing an structural alignment script is straight forward >> (see e.g. http://p3d.fufezan.net/index.php?title=alignByATP). > > Structural alignment is not so different in Biopython - just the > details. e.g. > http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ > very nice - like the Bio.PDB.Superimposer(). It does all the vector operations needed to align structures, nice. Involvement of numpy certainly makes it powerful. The nested loops to find all alpha carbons is a biopython.pdb classic ;) to round thinks up: p3ds strength comes with the natural human user interface that allows the combination of sets and the spatial information (less nested loops). However, I am not sure if the biopython's community wants such an extension. Biopython.pdb has a long history, it works like it is and users are comfortable with it, so maybe there is not much to merge after all. From ap12 at sanger.ac.uk Mon Nov 9 10:29:20 2009 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Mon, 9 Nov 2009 15:29:20 +0000 Subject: [Biopython] fasta-m10 al_start and al_end? In-Reply-To: <320fb6e00910260717r50974e1epb7ee94d41ff5aa6c@mail.gmail.com> References: <82A8DBBC-6964-4433-AF9C-EF049160A2DF@sanger.ac.uk> <320fb6e00910231140w46243f9bp9751b6476e3e55b0@mail.gmail.com> <1E613DD7-3CF3-4F12-8C61-D12440F5AE1D@sanger.ac.uk> <320fb6e00910260304j417fa20ep2b353b6e74ca055d@mail.gmail.com> <320fb6e00910260717r50974e1epb7ee94d41ff5aa6c@mail.gmail.com> Message-ID: Hi Peter, Thanks for adding these private variables. They are called _al_start and _al_stop. While testing the code today, I found a little bug. For the match record: alignment.add_sequence(match_descr, match_align_seq) record = alignment.get_all_seqs()[-1] assert record.id == match_descr or record.description == match_descr #assert record.seq.tostring() == match_align_seq record.id = match_descr.split(None,1)[0].strip(",") record.name = "match" record.annotations["original_length"] = int(match_annotation["sq_len"]) #TODO - handle start/end coordinates properly. Short term hack for now: record._al_start = int(query_annotation["al_start"]) record._al_stop = int(query_annotation["al_stop"]) the al_start and al_stop should be taken from match_annotation instead of query_annotation, I think. Kind regards, Anne. On 26 Oct 2009, at 14:17, Peter wrote: > On Mon, Oct 26, 2009 at 10:04 AM, Peter > wrote: >> On Fri, Oct 23, 2009 at 11:00 PM, Anne Pajon >> wrote: >>> >>> Hi Peter, >>> >>> Thanks for your fast answer. >>> >>> I've already discovered the _annotations and I am prepared to >>> update my >>> code as soon as a better solution is provided. >> >> Good. >> >>> Concerning the al_start and al_end, I am looking for a solution >>> very soon, >>> as I am working on an annotation pipeline prototype in python. >>> What would be >>> your recommendation? Writing a parser myself, using another tool >>> (but which >>> one?), or helping storing this information in SeqRecord in >>> biopython as it >>> is almost there. Thanks to let me know. >> >> I would rather not add them directly to the SeqRecord annotations >> dictionary because that will make doing something meaningful with >> slicing (the SeqRecord, or in future the Alignment) much harder. I >> think the best way to handle these is in the Alignment object, but >> this isn't really supported at the moment. >> >> Are you happy to run a development version of Biopython, or at least >> to update the file Bio/AlignIO/FastaIO.py? I'm thinking in the short >> term we can record these bits of information as private properties of >> the SeqRecord, i.e. _al_start and _al_end > > Make that _al_start and _al_end (to match the field names used in > the FASTA output). This change is in the repository now, which you > can grab via github. See http://www.biopython.org/wiki/SourceCode > > As with any "private" variables (leading underscore), they are not > really intended for public use, but should at least solve your > immediate requirement for now. > > Peter -- Dr Anne Pajon - Pathogen Genomics Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From biopython at maubp.freeserve.co.uk Mon Nov 9 10:46:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 9 Nov 2009 15:46:05 +0000 Subject: [Biopython] fasta-m10 al_start and al_end? In-Reply-To: References: <82A8DBBC-6964-4433-AF9C-EF049160A2DF@sanger.ac.uk> <320fb6e00910231140w46243f9bp9751b6476e3e55b0@mail.gmail.com> <1E613DD7-3CF3-4F12-8C61-D12440F5AE1D@sanger.ac.uk> <320fb6e00910260304j417fa20ep2b353b6e74ca055d@mail.gmail.com> <320fb6e00910260717r50974e1epb7ee94d41ff5aa6c@mail.gmail.com> Message-ID: <320fb6e00911090746udc6cfb3l5cfbc72a4cf190c8@mail.gmail.com> On Mon, Nov 9, 2009 at 3:29 PM, Anne Pajon wrote: > > Hi Peter, > > Thanks for adding these private variables. They are called _al_start and > _al_stop. > > While testing the code today, I found a little bug. For the match record: > .. > the al_start and al_stop should be taken from match_annotation instead of > query_annotation, I think. > > Kind regards, > Anne. Yes, you are absolutely right. Sorry about that - fixed now. Peter From biopython at maubp.freeserve.co.uk Thu Nov 12 07:04:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 12 Nov 2009 12:04:32 +0000 Subject: [Biopython] Additions to the SeqRecord Message-ID: <320fb6e00911120404l79106ddbs350a06252ea0831a@mail.gmail.com> Hello all, Something we added in Biopython 1.50 was the ability to slice a SeqRecord, which tries to do something sensible with all the annotation - in particular per-letter-annotation (like quality scores) and features (which have locations) are handled as you would naturally expect. Something you can look forward to in our next release (assuming no major issues crop up in testing) is adding SeqRecord objects together. Again, this will try and do something unambiguous with the annotation. I have two motivational examples in mind which combine slicing and addition of SeqRecord objects to edit a record while preserving as much annotation as possible. For example, removing a section of sequence, say letters from 100 to 200: from Bio import SeqIO record = SeqIO.read(...) deletion_mutant = record[:100] + record[200:] (The above would make sense for both protein and nucleotide records). Or, for a circular nucleotide sequence (like a plasmid or many small genomes), you might want to shift the origin, e.g. by 150 bases: shifted = record[150:] + record[:150] You can already do both these examples with the latest (unreleased) code. However, the situation with the annotation isn't ideal. When slicing a record, for non-location based annotation there is no way to know for sure if the annotation still applies to the daughter sequence. Therefore in the face of this ambiguity, when we added SeqRecord slicing in Biopython 1.50, we did not copy the dbxrefs and annotations dictionary to the daughter record. i.e. You currently have to do this manually (if required), for example: deletion_mutant = record[:100] + record[200:] deletion_mutant.dbxrefs = record.dbxrefs[:] deletion_mutant.annotations = record.annotations.copy() I would like to propose changing the SeqRecord slice behaviour to blindly copy the dbxrefs list and annotations dict to the daughter record (just like the id, name and description are already blindly copied even though they may not make sense for the daughter record). Then these slicing+addition examples will "just work" without the user having to explicitly copy the dbxrefs and annotations dict. This is a non-backwards compatible change, but with hindsight is perhaps a more natural behaviour. We would of course highlight this in the release notes (maybe with some worked examples on the blog). Does changing SeqRecord slicing like this seem like a good idea? Peter P.S. The code changes required are very small (two extra lines), see this commit on my experimental branch on github for details - most of the changes are documentation and unit tests for this work: http://github.com/peterjc/biopython/commit/41e944f338476a79bd7f8998196df21a1c06d4f7 From lpritc at scri.ac.uk Thu Nov 12 08:47:20 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Thu, 12 Nov 2009 13:47:20 +0000 Subject: [Biopython] Additions to the SeqRecord In-Reply-To: <320fb6e00911120404l79106ddbs350a06252ea0831a@mail.gmail.com> Message-ID: Hi, To avoid issues with the inadvertent propagation of inappropriate annotation, I'd be more comfortable with it being an optional feature of the slice - to be used when appropriate and with caution - than the default behaviour. One counterexample I can think of is the slicing of a sequence for which a feature or annotation applies only to a subregion of the SeqRecord. This is not an uncommon property of modular proteins. If I were to slice the N-terminal domains of a set of sequences with distinct N- and C-terminal domains, I would not want to carry through annotation for the C-terminal domains. If I did this without noticing, there may be a danger of, say, downstream use inferring inappropriate class membership if I wanted to generate a set of sequences containing that C-terminal domain, and I did this automatically based on the annotation of a SeqRecord. Another counterexample would be propagated inappropriate class membership for annotations that require a complete sequence for context. For example, many bacterial CDS annotations feature reports of BLAST matches to other databases. These are results derived from the full length feature, and the BLAST match obtained from the slice result is likely to differ. Having seen first-hand the propagation of faulty annotations (e.g. presence of a signal peptide and other functionally-related motifs) through to cloning - and the resultant waste of time, money and other resources - I would seek to avoid this kind of behaviour. As it is, the propagation of sequence ID and description without modification to indicate that a copy and potential change has been done is potentially dangerous, and needs to be done with some care to avoid 'poisoning the well'. The behaviour you describe makes most sense in the context of per-letter-annotation (as this is the natural granularity of the changes), and for relatively small changes to a large sequence containing multiple features whose annotations are reasonably self-contained. I too would like to be able to treat these specially on occasion, conserving much of the annotation. However, I think the potential pitfalls are pretty significant and would not want this to be default behaviour. A third way might be only to include those annotations with location data where the region covered by the annotation is not disrupted by the slicing. For example, a slice/addition that removed sites 200-300 would retain features/annotations that ran from 120-199 and 301-350, but not carry forward features that ran from 120-201, or from 250-301. Features and annotations that span the full record length would not be carried forward under this proposal. Best, L. On 12/11/2009 12:04, "Peter" wrote: > Hello all, > > Something we added in Biopython 1.50 was the ability to slice a SeqRecord, > which tries to do something sensible with all the annotation - in particular > per-letter-annotation (like quality scores) and features (which have > locations) > are handled as you would naturally expect. > > Something you can look forward to in our next release (assuming no > major issues crop up in testing) is adding SeqRecord objects together. > Again, this will try and do something unambiguous with the annotation. > > I have two motivational examples in mind which combine slicing and > addition of SeqRecord objects to edit a record while preserving as much > annotation as possible. For example, removing a section of sequence, > say letters from 100 to 200: > > from Bio import SeqIO > record = SeqIO.read(...) > deletion_mutant = record[:100] + record[200:] > > (The above would make sense for both protein and nucleotide records). > Or, for a circular nucleotide sequence (like a plasmid or many small > genomes), you might want to shift the origin, e.g. by 150 bases: > > shifted = record[150:] + record[:150] > > You can already do both these examples with the latest (unreleased) code. > However, the situation with the annotation isn't ideal. When slicing a record, > for non-location based annotation there is no way to know for sure if the > annotation still applies to the daughter sequence. Therefore in the face of > this ambiguity, when we added SeqRecord slicing in Biopython 1.50, we > did not copy the dbxrefs and annotations dictionary to the daughter record. > i.e. You currently have to do this manually (if required), for example: > > deletion_mutant = record[:100] + record[200:] > deletion_mutant.dbxrefs = record.dbxrefs[:] > deletion_mutant.annotations = record.annotations.copy() > > I would like to propose changing the SeqRecord slice behaviour to > blindly copy the dbxrefs list and annotations dict to the daughter record > (just like the id, name and description are already blindly copied even > though they may not make sense for the daughter record). Then these > slicing+addition examples will "just work" without the user having to > explicitly copy the dbxrefs and annotations dict. > > This is a non-backwards compatible change, but with hindsight is > perhaps a more natural behaviour. We would of course highlight this > in the release notes (maybe with some worked examples on the blog). > > Does changing SeqRecord slicing like this seem like a good idea? > > Peter > > P.S. The code changes required are very small (two extra lines), see > this commit on my experimental branch on github for details - most > of the changes are documentation and unit tests for this work: > http://github.com/peterjc/biopython/commit/41e944f338476a79bd7f8998196df21a1c0 > 6d4f7 > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > ______________________________________________________________________ > This email has been scanned by the MessageLabs Email Security System. > For more information please visit http://www.messagelabs.com/email > ______________________________________________________________________ -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From biopython at maubp.freeserve.co.uk Thu Nov 12 09:08:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 12 Nov 2009 14:08:46 +0000 Subject: [Biopython] Additions to the SeqRecord In-Reply-To: References: <320fb6e00911120404l79106ddbs350a06252ea0831a@mail.gmail.com> Message-ID: <320fb6e00911120608w235ff7e3qfbb5b07fea9d3148@mail.gmail.com> On Thu, Nov 12, 2009 at 1:47 PM, Leighton Pritchard wrote: > > Hi, > > To avoid issues with the inadvertent propagation of inappropriate > annotation, I'd be more comfortable with it being an optional feature of the > slice - to be used when appropriate and with caution - than the default > behaviour. Better safe than sorry? > One counterexample I can think of is the slicing of a sequence for which a > feature or annotation applies only to a subregion of the SeqRecord. ?This is > not an uncommon property of modular proteins. ?If I were to slice the > N-terminal domains of a set of sequences with distinct N- and C-terminal > domains, I would not want to carry through annotation for the C-terminal > domains. If I did this without noticing, there may be a danger of, say, > downstream use inferring inappropriate class membership if I wanted to > generate a set of sequences containing that C-terminal domain, and I did > this automatically based on the annotation of a SeqRecord. > > Another counterexample would be propagated inappropriate class membership > for annotations that require a complete sequence for context. ?For example, > many bacterial CDS annotations feature reports of BLAST matches to other > databases. ?These are results derived from the full length feature, and the > BLAST match obtained from the slice result is likely to differ. Both good examples. > Having seen first-hand the propagation of faulty annotations (e.g. presence > of a signal peptide and other functionally-related motifs) through to > cloning - and the resultant waste of time, money and other resources - I > would seek to avoid this kind of behaviour. ?As it is, the propagation of > sequence ID and description without modification to indicate that a copy and > potential change has been done is potentially dangerous, and needs to be > done with some care to avoid 'poisoning the well'. Yes - as already noted in the documentation, the id/name/description may not apply to the sliced record, and some caution is advisable. > The behaviour you describe makes most sense in the context of > per-letter-annotation (as this is the natural granularity of the changes), > and for relatively small changes to a large sequence containing multiple > features whose annotations are reasonably self-contained. I too would like > to be able to treat these specially on occasion, conserving much of the > annotation. ?However, I think the potential pitfalls are pretty significant > and would not want this to be default behaviour. OK. So the current behaviour on the trunk is acceptable (for annotation where we know the location), but the proposed change for location-less annotation is too risky. > A third way might be only to include those annotations with location data > where the region covered by the annotation is not disrupted by the slicing. > For example, a slice/addition that removed sites 200-300 would retain > features/annotations that ran from 120-199 and 301-350, but not carry > forward features that ran from 120-201, or from 250-301. ?Features and > annotations that span the full record length would not be carried forward > under this proposal. Exactly - SeqFeatures entirely within the sliced region are kept. Those outside the sliced region (or crossing the boundary) are lost. As a result, because GenBank-style source feature span the whole sequence, they are lost on slicing to a sub-sequence. This is the current behaviour and I wasn't suggesting any changes. General annotation in the SeqRecord's annotation dictionary has no location information - it may apply to the whole sequence (e.g from organism X) or just part (e.g. a text note it contains XXX domain). Likewise the database cross reference list. The dbxref list and annotations dict are thus the hardest to handle - the only practical automatic actions on slicing are to discard them (the current behaviour on Biopython 1.50 to date), or keep them all as per my suggestion (which as you stress, is risky). In light of Leighton's valid concerns, and weighing this against the limited benefits which only apply in special cases like the examples I gave, let's leave things as they are. i.e. Explicit is better than implicit (Zen of Python), if you want to propagate the annotations dict and dbxrefs to a sliced record, you must continue do it explicity. Thanks for the feedback! Peter From biopython at maubp.freeserve.co.uk Thu Nov 12 11:53:31 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 12 Nov 2009 16:53:31 +0000 Subject: [Biopython] Additions to the SeqRecord In-Reply-To: <65d4b7fc0911120837v1a3f2a41scd128adbd2be615e@mail.gmail.com> References: <320fb6e00911120404l79106ddbs350a06252ea0831a@mail.gmail.com> <65d4b7fc0911120837v1a3f2a41scd128adbd2be615e@mail.gmail.com> Message-ID: <320fb6e00911120853r32612646s7b80e0e3d320097c@mail.gmail.com> On Thu, Nov 12, 2009 at 4:37 PM, Carlos Javier Borroto wrote: > > On Thu, Nov 12, 2009 at 7:04 AM, Peter wrote: >> You can already do both these examples with the latest (unreleased) code. > > I'll love to test this unreleased code, is there any documentation on > how to install from git? Yes, first grab the source code from git, or via the github download link: http://biopython.org/wiki/SourceCode Then install from source - just like you would from a zip or tarball. There are instructions for this on the download page: http://biopython.org/wiki/Download#Installation_Instructions Peter From villahozbale at wisc.edu Thu Nov 12 14:51:11 2009 From: villahozbale at wisc.edu (ANGEL VILLAHOZ-BALETA) Date: Thu, 12 Nov 2009 13:51:11 -0600 Subject: [Biopython] ExPASy.get_sprot_raw giving a FASTA record instead of a Swiss-Prot/UniprotKB record? Message-ID: <6fc0f5c82c743.4afc12cf@wiscmail.wisc.edu> Hi to all, I am using Biopython 1.5.1 and it seems that I have met with a strange situation... When using ExPASy.get_sprot_raw, it gives me a FASTA record instead of a Swiss-Prot/UniProtKB record... Anyone has met the same situation? You can test the following example: from Bio import ExPASy from Bio import SeqIO handle = ExPASy.get_sprot_raw("O23729") seq_record = SeqIO.read(handle, "swiss") handle.close() print seq_record.id print seq_record.name print seq_record.description print repr(seq_record.seq) print "Length %i" % len(seq_record) print seq_record.annotations["keywords"] and write me its result... Thanks very much, Angel Villahoz-Baleta. From biopython at maubp.freeserve.co.uk Thu Nov 12 18:43:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 12 Nov 2009 23:43:35 +0000 Subject: [Biopython] ExPASy.get_sprot_raw giving a FASTA record instead of a Swiss-Prot/UniprotKB record? In-Reply-To: <6fc0f5c82c743.4afc12cf@wiscmail.wisc.edu> References: <6fc0f5c82c743.4afc12cf@wiscmail.wisc.edu> Message-ID: <320fb6e00911121543s65167605kb5541ac6f8fdd449@mail.gmail.com> On Thu, Nov 12, 2009 at 7:51 PM, ANGEL VILLAHOZ-BALETA wrote: > Hi to all, > > I am using Biopython 1.5.1 and it seems that I have met with > a strange situation... When using ExPASy.get_sprot_raw, it > gives me a FASTA record instead of a Swiss-Prot/UniProtKB > record... > > Anyone has met the same situation? I hadn't tried this recently, but you are right. It looks like ExPASy/UniProt have broken this :( The URL which Biopython requests is: http://www.expasy.ch/cgi-bin/get-sprot-raw.pl?O23729 You can check via a tool like wget that this is now being redirected to a URL giving the FASTA file: http://www.uniprot.org/uniprot/O23729.fasta Please contact ExPASy/uniprot to alert them that they have broken this old URL redirection, and ask them nicely to fix it to point here in order to get the swiss format: http://www.uniprot.org/uniprot/O23729.txt Thanks! Peter P.S. Perhaps we should also update our URLs, but that won't help people using the current version of Biopython. From chapmanb at 50mail.com Fri Nov 13 08:23:46 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 13 Nov 2009 08:23:46 -0500 Subject: [Biopython] Additions to the SeqRecord In-Reply-To: <320fb6e00911120608w235ff7e3qfbb5b07fea9d3148@mail.gmail.com> References: <320fb6e00911120404l79106ddbs350a06252ea0831a@mail.gmail.com> <320fb6e00911120608w235ff7e3qfbb5b07fea9d3148@mail.gmail.com> Message-ID: <20091113132346.GB48178@sobchak.mgh.harvard.edu> Hi Peter; [...Discussion on what to do with full length features and annotations when slicing SeqRecords...] > Exactly - SeqFeatures entirely within the sliced region are kept. Those > outside the sliced region (or crossing the boundary) are lost. As a result, > because GenBank-style source feature span the whole sequence, they > are lost on slicing to a sub-sequence. This is the current behaviour and > I wasn't suggesting any changes. > > General annotation in the SeqRecord's annotation dictionary has no > location information - it may apply to the whole sequence (e.g from > organism X) or just part (e.g. a text note it contains XXX domain). > Likewise the database cross reference list. > > The dbxref list and annotations dict are thus the hardest to handle - > the only practical automatic actions on slicing are to discard them > (the current behaviour on Biopython 1.50 to date), or keep them all > as per my suggestion (which as you stress, is risky). Good discussion. Agreed that copying may be confusing. One hybrid approach is to provide a function make makes copying them easy if someone does want to save the annotations, dbxrefs and full length feature sources: sliced = rec[:100] sliced.set_full_length_features(rec) where set_full_length_features copied over the annotations and dbxrefs, ala your code example: deletion_mutant.dbxrefs = record.dbxrefs[:] deletion_mutant.annotations = record.annotations.copy() and perhaps also added any whole sequence sequence features from the original SeqRecord. This would help with discoverability for people who do want to retain all of the source and other high level information when they slice. Brad From biopython at maubp.freeserve.co.uk Fri Nov 13 08:51:48 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 13 Nov 2009 13:51:48 +0000 Subject: [Biopython] Additions to the SeqRecord In-Reply-To: <20091113132346.GB48178@sobchak.mgh.harvard.edu> References: <320fb6e00911120404l79106ddbs350a06252ea0831a@mail.gmail.com> <320fb6e00911120608w235ff7e3qfbb5b07fea9d3148@mail.gmail.com> <20091113132346.GB48178@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00911130551h21f5c438gfbb68cc25bd9ea7b@mail.gmail.com> On Fri, Nov 13, 2009 at 1:23 PM, Brad Chapman wrote: > Hi Peter; > > [...Discussion on what to do with full length features and annotations > ?when slicing SeqRecords...] > > Good discussion. Agreed that copying may be confusing. One hybrid > approach is to provide a function make makes copying them easy if > someone does want to save the annotations, dbxrefs and full length > feature sources: > > sliced = rec[:100] > sliced.set_full_length_features(rec) > > where set_full_length_features copied over the annotations and > dbxrefs, ala your code example: > > deletion_mutant.dbxrefs = record.dbxrefs[:] > deletion_mutant.annotations = record.annotations.copy() > > and perhaps also added any whole sequence sequence features from the > original SeqRecord. This would help with discoverability for people > who do want to retain all of the source and other high level information > when they slice. > > Brad Hi Brad. Interesting idea - but I'm not sure about that name (maybe something like copy_annotation would be better?) and personally don't think it is actually any clearer than the two lines: deletion_mutant.dbxrefs = record.dbxrefs[:] deletion_mutant.annotations = record.annotations.copy() [We should in the meantime add those line to the relevant examples in the docstring and Tutorial in the repository.] Regarding the special case of the source feature in GenBank files, for tasks like removing part of the record, or doing an origin shift, you may want to recreate a new source feature reusing the old source feature annotation (e.g. NCBI taxon ID). However, the location would have to reflect the new modified sequence length. I have another idea to "solve" this problem: I am actually be tempted to remove the source SeqFeature, and instead handle it via the annotations dict. To me this seems more natural than having it as an entry in the feature table - a GenBank file format choice I never really understood. My guess is they didn't want to introduce a record level extensible annotation header block, which is what the source feature could be regarded as handling. i.e. When parsing a GenBank (or EMBL) file, the source feature information could get stored in the SeqRecord annotations dictionary. When writing to GenBank (or in future EMBL) format, if the annotations dictionary contained relevant fields, we would generate a source feature for the full sequence. Does that make sense? It requires looking at the source feature not as a feature which happens to span the whole sequence, but as annotation for the whole sequence (which happens to be in the GenBank features table due to a historical choice or accident). Peter From yvan.strahm at bccs.uib.no Fri Nov 13 10:00:42 2009 From: yvan.strahm at bccs.uib.no (Yvan Strahm) Date: Fri, 13 Nov 2009 16:00:42 +0100 Subject: [Biopython] fetch random id Message-ID: <4AFD749A.7050504@bccs.uib.no> Hello List, I have to crash test a webservice so was wondering if any one knows a way to get random sequence id from swissprot or genbank? Thank for your help. yvan From biopython at maubp.freeserve.co.uk Fri Nov 13 11:10:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 13 Nov 2009 16:10:41 +0000 Subject: [Biopython] fetch random id In-Reply-To: <4AFD749A.7050504@bccs.uib.no> References: <4AFD749A.7050504@bccs.uib.no> Message-ID: <320fb6e00911130810x417aca71id03845b0af985968@mail.gmail.com> On Fri, Nov 13, 2009 at 3:00 PM, Yvan Strahm wrote: > Hello List, > > I have to crash test a webservice so was wondering if any one knows a way to > get random sequence id from swissprot or genbank? > Thank for your help. > > yvan GI identifiers are numbers, any I would expect most 8 digit GI numbers to be valid IDs. So you could try just using random integers. Of course, some will have been deprecated etc so they may trigger real failures. Alternatively, download a list of valid IDs from the FTP site (or compile a list via an Entrez search and save this to disk), and pick a random entry to use each time in the test. Peter From chapmanb at 50mail.com Fri Nov 13 12:20:33 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 13 Nov 2009 12:20:33 -0500 Subject: [Biopython] fetch random id In-Reply-To: <320fb6e00911130810x417aca71id03845b0af985968@mail.gmail.com> References: <4AFD749A.7050504@bccs.uib.no> <320fb6e00911130810x417aca71id03845b0af985968@mail.gmail.com> Message-ID: <20091113172033.GG48178@sobchak.mgh.harvard.edu> Hi Yvan; > I have to crash test a webservice so was wondering if any one knows a way to > get random sequence id from swissprot or genbank? ExPASy can give you a random SwissProt entry: http://www.expasy.org/cgi-bin/get-random-entry.pl?S See the ExPASy documentation for all of their URLs: http://ca.expasy.org/expasy_urls.html You can use this to get the UniProt ID from the redirect: >>> import urllib2 >>> u = urllib2.urlopen("http://www.expasy.org/cgi-bin/get-random-entry.pl?S") >>> u.geturl() 'http://www.uniprot.org/uniprot/Q824C8' Hope this helps, Brad From yvan.strahm at bccs.uib.no Fri Nov 13 14:20:26 2009 From: yvan.strahm at bccs.uib.no (Yvan Strahm) Date: Fri, 13 Nov 2009 20:20:26 +0100 Subject: [Biopython] fetch random id In-Reply-To: <20091113172033.GG48178@sobchak.mgh.harvard.edu> References: <4AFD749A.7050504@bccs.uib.no> <320fb6e00911130810x417aca71id03845b0af985968@mail.gmail.com> <20091113172033.GG48178@sobchak.mgh.harvard.edu> Message-ID: <4AFDB17A.3020504@bccs.uib.no> Hello Brad and Peter, Thanks a lot for the pointers especially the expasy links Really great Cheers, yvan Brad Chapman wrote: > Hi Yvan; > >> I have to crash test a webservice so was wondering if any one knows a way to >> get random sequence id from swissprot or genbank? > > ExPASy can give you a random SwissProt entry: > > http://www.expasy.org/cgi-bin/get-random-entry.pl?S > > See the ExPASy documentation for all of their URLs: > > http://ca.expasy.org/expasy_urls.html > > You can use this to get the UniProt ID from the redirect: > >>>> import urllib2 >>>> u = urllib2.urlopen("http://www.expasy.org/cgi-bin/get-random-entry.pl?S") >>>> u.geturl() > 'http://www.uniprot.org/uniprot/Q824C8' > > Hope this helps, > Brad > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From han.chen1986 at gmail.com Sat Nov 14 08:25:21 2009 From: han.chen1986 at gmail.com (Han Chen) Date: Sat, 14 Nov 2009 21:25:21 +0800 Subject: [Biopython] About bioseqI0 Message-ID: <73a1f0de0911140525x4add0b61g7e5786d3b25c8703@mail.gmail.com> Hi, List of helpful people, could your please offer me some help about bioseqIo? here is the error message when run "python setup.py test": ====================================================================== ERROR: test_SeqIO_online ---------------------------------------------------------------------- Traceback (most recent call last): File "run_tests.py", line 248, in runTest suite = unittest.TestLoader().loadTestsFromName(name) File "/usr/local/lib/python2.6/unittest.py", line 576, in loadTestsFromName module = __import__('.'.join(parts_copy)) File "test_SeqIO_online.py", line 42, in records = list(SeqIO.parse(handle, "swiss")) File "/home/ch/biopython-1.51/build/lib.linux-x86_64-2.6/Bio/SeqIO/SwissIO.py", line 39, in SwissIterator for swiss_record in swiss_records: File "/home/ch/biopython-1.51/build/lib.linux-x86_64-2.6/Bio/SwissProt/__init__.py", line 113, in parse record = _read(handle) File "/home/ch/biopython-1.51/build/lib.linux-x86_64-2.6/Bio/SwissProt/__init__.py", line 240, in _read raise ValueError("Unknown keyword '%s' found" % key) ValueError: Unknown keyword '>s' found ---------------------------------------------------------------------- Ran 124 tests in 60.180 seconds FAILED (failures = 1) Is there anything wrong with SeqIO? I meet the following error when using other package: DeprecationWarning: Bio.Fasta is deprecated. Please use the "fasta" support in Bio.SeqIO (or Bio.AlignIO) instead could you please help me about this?? thank you very much! sincerely yours, Han From biopython at maubp.freeserve.co.uk Sat Nov 14 08:38:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 14 Nov 2009 13:38:45 +0000 Subject: [Biopython] About bioseqI0 In-Reply-To: <73a1f0de0911140525x4add0b61g7e5786d3b25c8703@mail.gmail.com> References: <73a1f0de0911140525x4add0b61g7e5786d3b25c8703@mail.gmail.com> Message-ID: <320fb6e00911140538w3edbc7efr491c5aa9420c0ac4@mail.gmail.com> 2009/11/14 Han Chen : > Hi, List of helpful people, > > could your please offer me some help about bioseqIo? > > here is the error message when run "python setup.py test": > > ====================================================================== > ERROR: test_SeqIO_online > ---------------------------------------------------------------------- > Traceback (most recent call last): > ... > ValueError: Unknown keyword '>s' found > > Is there anything wrong with SeqIO? That is due to the ExPAYs website problem just recently reported: http://lists.open-bio.org/pipermail/biopython/2009-November/005823.html We can update Biopython to use the new URL, but it would be nice if ExPASy can fix their redirection as well. > I meet the following error when using other package: > > DeprecationWarning: Bio.Fasta is deprecated. Please use the "fasta" support > in Bio.SeqIO (or Bio.AlignIO) instead > > could you please help me about this?? thank you very much! Which bit of Biopython are you trying to use? As the message says, Bio.Fasta is deprecated. This is just a warning message for now, but Bio.Fasta will one day be removed. Peter From mitlox at op.pl Sat Nov 14 23:26:38 2009 From: mitlox at op.pl (xyz) Date: Sun, 15 Nov 2009 14:26:38 +1000 Subject: [Biopython] SeqIO.convert Message-ID: <4AFF82FE.404@op.pl> Hello, I have to convert fastq to fasta and to trim the sequence. I have found SeqIO.convert: from Bio import SeqIO count = SeqIO.convert("a.fastq", "fastq", "a.fasta", "fasta") But I do not know how can I trim the sequence. Thank you in advance. Best regards, From chapmanb at 50mail.com Sun Nov 15 09:38:55 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 15 Nov 2009 09:38:55 -0500 Subject: [Biopython] SeqIO.convert In-Reply-To: <4AFF82FE.404@op.pl> References: <4AFF82FE.404@op.pl> Message-ID: <20091115143826.GA2712@kunkel> Hello; > I have to convert fastq to fasta and to trim the sequence. I have found > SeqIO.convert: > > from Bio import SeqIO > count = SeqIO.convert("a.fastq", "fastq", "a.fasta", "fasta") > > But I do not know how can I trim the sequence. SeqIO.convert is a format converter only, but you can use it along with other Biopython modules to trim adaptors. Here's a description: http://bcbio.wordpress.com/2009/08/09/trimming-adaptors-from-short-read-sequences/ along with code: http://github.com/chapmanb/bcbb/blob/master/align/adaptor_trim.py Hope this helps, Brad From biopython at maubp.freeserve.co.uk Sun Nov 15 09:55:48 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 15 Nov 2009 14:55:48 +0000 Subject: [Biopython] SeqIO.convert In-Reply-To: <20091115143826.GA2712@kunkel> References: <4AFF82FE.404@op.pl> <20091115143826.GA2712@kunkel> Message-ID: <320fb6e00911150655p53afb9b2y35086efbb2f355a5@mail.gmail.com> On Sun, Nov 15, 2009 at 2:38 PM, Brad Chapman wrote: > Hello; > >> I have to convert fastq to fasta and to trim the sequence. I have found >> SeqIO.convert: >> >> from Bio import SeqIO >> count = SeqIO.convert("a.fastq", "fastq", "a.fasta", "fasta") >> >> But I do not know how can I trim the sequence. > > SeqIO.convert is a format converter only, but you can use it along > with other Biopython modules to trim adaptors. ... It all depends on what you mean by "trim" the sequence. In addition to Brad's examples, there are some simpler ones in the Tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:FASTQ-slicing-off-primer http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:FASTQ-slicing-off-adaptor Peter From biopython at maubp.freeserve.co.uk Mon Nov 16 05:23:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 16 Nov 2009 10:23:16 +0000 Subject: [Biopython] ExPASy.get_sprot_raw giving a FASTA record instead of a Swiss-Prot/UniprotKB record? In-Reply-To: <320fb6e00911121543s65167605kb5541ac6f8fdd449@mail.gmail.com> References: <6fc0f5c82c743.4afc12cf@wiscmail.wisc.edu> <320fb6e00911121543s65167605kb5541ac6f8fdd449@mail.gmail.com> Message-ID: <320fb6e00911160223q6a3eb5a3l49229903a9de482@mail.gmail.com> On Thu, Nov 12, 2009 at 11:43 PM, Peter wrote: > > P.S. Perhaps we should also update our URLs, but that > won't help people using the current version of Biopython. > I checked the ExPASy page http://www.expasy.ch/expasy_urls.html then updated our code to use the currently recommended URL, http://www.uniprot.org/uniprot/XXX.txt instead of the old URL, http://www.expasy.ch/cgi-bin/get-sprot-raw.pl?XXX If anyone is curious about the details, see: http://github.com/biopython/biopython/commit/6689bf8657d9515965d63f9c77e6348233472046 This means the next release of Biopython will not depend on the old ExPASy URL, but it would still be ideal if ExPASy/Uniprot could fix that for the benefit of users of older versions of Biopython and other scripts. Did you try and contact them about this yet? Thanks, Peter From chapmanb at 50mail.com Tue Nov 17 08:24:17 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 17 Nov 2009 08:24:17 -0500 Subject: [Biopython] Additions to the SeqRecord In-Reply-To: <320fb6e00911130551h21f5c438gfbb68cc25bd9ea7b@mail.gmail.com> References: <320fb6e00911120404l79106ddbs350a06252ea0831a@mail.gmail.com> <320fb6e00911120608w235ff7e3qfbb5b07fea9d3148@mail.gmail.com> <20091113132346.GB48178@sobchak.mgh.harvard.edu> <320fb6e00911130551h21f5c438gfbb68cc25bd9ea7b@mail.gmail.com> Message-ID: <20091117132417.GE68691@sobchak.mgh.harvard.edu> Hi Peter; > > [...Discussion on what to do with full length features and annotations > > ?when slicing SeqRecords...] > > [...Proposal to have a function that does the copying...] > > Interesting idea - but I'm not sure about that name (maybe something like > copy_annotation would be better?) and personally don't think it is actually > any clearer than the two lines: > > deletion_mutant.dbxrefs = record.dbxrefs[:] > deletion_mutant.annotations = record.annotations.copy() Yes, I am terrible at thinking up function names -- copy_annotation is great. Here I'm not as worried about clarity as I am about discoverability. It's another way for people to realize that the annotations were not copied. > Regarding the special case of the source feature in GenBank files, for > tasks like removing part of the record, or doing an origin shift, you may > want to recreate a new source feature reusing the old source feature > annotation (e.g. NCBI taxon ID). However, the location would have to > reflect the new modified sequence length. > > I have another idea to "solve" this problem: > > I am actually be tempted to remove the source SeqFeature, and instead > handle it via the annotations dict. To me this seems more natural than > having it as an entry in the feature table - a GenBank file format choice I > never really understood. My guess is they didn't want to introduce a record > level extensible annotation header block, which is what the source feature > could be regarded as handling. > > i.e. When parsing a GenBank (or EMBL) file, the source feature information > could get stored in the SeqRecord annotations dictionary. When writing to > GenBank (or in future EMBL) format, if the annotations dictionary contained > relevant fields, we would generate a source feature for the full sequence. > > Does that make sense? It requires looking at the source feature not as > a feature which happens to span the whole sequence, but as annotation > for the whole sequence (which happens to be in the GenBank features > table due to a historical choice or accident). I like that. You're right that those full length features are really annotations in disguise. Instead of removing the source SeqFeature, I would suggest making it available in both places. This way you mimic what GenBank is doing, but also make it available in a more accessible and natural place. So for something like: source 1..4411532 /organism="Mycobacterium tuberculosis H37Rv" /mol_type="genomic DNA" /strain="H37Rv" /db_xref="taxon:83332" you would have the source SeqFeature, but also the organism, mol_type and strain in the annotations dictionary, and the cross reference in dbxrefs. Nice idea. Brad From biopython at maubp.freeserve.co.uk Tue Nov 17 09:53:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Nov 2009 14:53:44 +0000 Subject: [Biopython] Additions to the SeqRecord In-Reply-To: <20091117132417.GE68691@sobchak.mgh.harvard.edu> References: <320fb6e00911120404l79106ddbs350a06252ea0831a@mail.gmail.com> <320fb6e00911120608w235ff7e3qfbb5b07fea9d3148@mail.gmail.com> <20091113132346.GB48178@sobchak.mgh.harvard.edu> <320fb6e00911130551h21f5c438gfbb68cc25bd9ea7b@mail.gmail.com> <20091117132417.GE68691@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00911170653p223b671aj68459868348fe524@mail.gmail.com> Peter wrote: >> >> Regarding the special case of the source feature in GenBank files, for >> tasks like removing part of the record, or doing an origin shift, you may >> want to recreate a new source feature reusing the old source feature >> annotation (e.g. NCBI taxon ID). However, the location would have to >> reflect the new modified sequence length. >> >> I have another idea to "solve" this problem: >> >> I am actually be tempted to remove the source SeqFeature, and instead >> handle it via the annotations dict. To me this seems more natural than >> having it as an entry in the feature table - a GenBank file format choice I >> never really understood. My guess is they didn't want to introduce a record >> level extensible annotation header block, which is what the source feature >> could be regarded as handling. >> >> i.e. When parsing a GenBank (or EMBL) file, the source feature information >> could get stored in the SeqRecord annotations dictionary. When writing to >> GenBank (or in future EMBL) format, if the annotations dictionary contained >> relevant fields, we would generate a source feature for the full sequence. >> >> Does that make sense? It requires looking at the source feature not as >> a feature which happens to span the whole sequence, but as annotation >> for the whole sequence (which happens to be in the GenBank features >> table due to a historical choice or accident). Brad Chapman wrote: > > I like that. You're right that those full length features are really > annotations in disguise. Good :) > Instead of removing the source SeqFeature, > I would suggest making it available in both places. This way you > mimic what GenBank is doing, but also make it available in a more > accessible and natural place. So for something like: > > ? ? source ? ? ? ? ?1..4411532 > ? ? ? ? ? ? ? ? ? ? /organism="Mycobacterium tuberculosis H37Rv" > ? ? ? ? ? ? ? ? ? ? /mol_type="genomic DNA" > ? ? ? ? ? ? ? ? ? ? /strain="H37Rv" > ? ? ? ? ? ? ? ? ? ? /db_xref="taxon:83332" > > you would have the source SeqFeature, but also the organism, > mol_type and strain in the annotations dictionary, and the cross > reference in dbxrefs. Nice idea. Good point about the dbxrefs - that makes sense :) Interesting idea about having the parser record the source feature in both the SeqFeature (as it does now) and the SeqRecord annotations dict (as I suggested). That would certainly make sense in the short term for a transition period, but in the long term we should deprecate using a source SeqFeature. After all, for accessing this information "There should be one-- and preferably only one -- obvious way to do it" (Zen of Python). This also applies to the code for writing out GenBank files - if the information is in two places, which takes priority? Peter From biopython at maubp.freeserve.co.uk Tue Nov 17 11:55:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Nov 2009 16:55:15 +0000 Subject: [Biopython] ExPASy.get_sprot_raw giving a FASTA record instead of a Swiss-Prot/UniprotKB record? In-Reply-To: <6fa094f832a30.4b027fef@wiscmail.wisc.edu> References: <6fc0f5c82c743.4afc12cf@wiscmail.wisc.edu> <320fb6e00911121543s65167605kb5541ac6f8fdd449@mail.gmail.com> <320fb6e00911160223q6a3eb5a3l49229903a9de482@mail.gmail.com> <6fa094f832a30.4b027fef@wiscmail.wisc.edu> Message-ID: <320fb6e00911170855y614a6cd3oad794d6314bc1512@mail.gmail.com> On Tue, Nov 17, 2009 at 4:50 PM, ANGEL VILLAHOZ-BALETA wrote: > > Yes, Peter, I isolated it from my source code and I chose another programming way since I preferred to be a bit less dependent from the ExPASy server. > > Anyway, I have just emailed all this information to the help desk of ExPASy to get a potential benefit for our Biopython community. > > Thanks, > > Angel Villahoz-Baleta > Bioinformatics Programmer > University of Wisconsin-Madison Thanks, Peter From animesh.agrawal at anu.edu.au Wed Nov 18 03:19:20 2009 From: animesh.agrawal at anu.edu.au (Animesh Agrawal) Date: Wed, 18 Nov 2009 19:19:20 +1100 Subject: [Biopython] Divergent sequence data set Message-ID: <000001ca6827$d4174af0$7c45e0d0$@agrawal@anu.edu.au> Hi, I have been trying to develop a divergent sequence data set for a phylogenetic analysis. Do we have something in Biopython, where for a given set of sequences we can choose identity threshold to reduce redundancy in the dataset. Cheers, Animesh From biopython at maubp.freeserve.co.uk Wed Nov 18 05:24:48 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Nov 2009 10:24:48 +0000 Subject: [Biopython] Divergent sequence data set In-Reply-To: <2207210305477723158@unknownmsgid> References: <2207210305477723158@unknownmsgid> Message-ID: <320fb6e00911180224u4de6e30bsa121b11ac60c0ce3@mail.gmail.com> On Wed, Nov 18, 2009 at 8:19 AM, Animesh Agrawal wrote: > > Hi, > > I have been trying to develop a divergent sequence data set for a > phylogenetic analysis. Do we have something in Biopython, where for a given > set of ?sequences we can choose identity threshold to reduce redundancy in > the dataset. > > Cheers, > > Animesh Hi Animesh, There are probably 100s of ways to do this. I think you should consult the literature as the the best approach (in terms of the algorithm), or talk to a phylogeneticist. Once you have an algorithm in mind, it can probably be done with python. For example, you could do pairwise BLAST alignments (e.g. using the NCBI standalone tools) or maybe pairwise Needleman-Wunsch global alignment (e.g. using the EMBOSS needle tool) and construct a distance matrix in terms of percentage identity. You could build a rough phylogenetic tree (perhaps using NJ if your starting dataset is very large), and use that to sample the nodes to get a fairly uniform distribution w.r.t. the phylogenetic space. These are just rough ideas - I am not a phylogenetics specialist. I have a vague recollection that one of the sequence alignment tools includes an option to do something like this for you... but I can't remember the details. Peter From biopython at maubp.freeserve.co.uk Wed Nov 18 06:31:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Nov 2009 11:31:40 +0000 Subject: [Biopython] Additions to the SeqRecord In-Reply-To: <320fb6e00911170653p223b671aj68459868348fe524@mail.gmail.com> References: <320fb6e00911120404l79106ddbs350a06252ea0831a@mail.gmail.com> <320fb6e00911120608w235ff7e3qfbb5b07fea9d3148@mail.gmail.com> <20091113132346.GB48178@sobchak.mgh.harvard.edu> <320fb6e00911130551h21f5c438gfbb68cc25bd9ea7b@mail.gmail.com> <20091117132417.GE68691@sobchak.mgh.harvard.edu> <320fb6e00911170653p223b671aj68459868348fe524@mail.gmail.com> Message-ID: <320fb6e00911180331o53ddfc4bi4ccdb0f1972221f5@mail.gmail.com> Peter wrote: >>> Regarding the special case of the source feature in GenBank files, for >>> tasks like removing part of the record, or doing an origin shift, you may >>> want to recreate a new source feature reusing the old source feature >>> annotation (e.g. NCBI taxon ID). However, the location would have to >>> reflect the new modified sequence length. >>> >>> I have another idea to "solve" this problem: >>> >>> I am actually be tempted to remove the source SeqFeature, and instead >>> handle it via the annotations dict. To me this seems more natural than >>> having it as an entry in the feature table - a GenBank file format choice I >>> never really understood. My guess is they didn't want to introduce a record >>> level extensible annotation header block, which is what the source feature >>> could be regarded as handling. >>> >>> i.e. When parsing a GenBank (or EMBL) file, the source feature information >>> could get stored in the SeqRecord annotations dictionary. When writing to >>> GenBank (or in future EMBL) format, if the annotations dictionary contained >>> relevant fields, we would generate a source feature for the full sequence. >>> >>> Does that make sense? It requires looking at the source feature not as >>> a feature which happens to span the whole sequence, but as annotation >>> for the whole sequence (which happens to be in the GenBank features >>> table due to a historical choice or accident). Let's call that idea Plan(B). I've started a thread on the BioSQL mailing list, as this possible change would have implications for Biopython's use of BioSQL for storing this information. Unless we put some special case handling code in our BioSQL wrapper, it would mean Biopython would treat the "source" features differently to all the other Bio* interfaces for BioSQL. That would be bad. http://lists.open-bio.org/pipermail/biosql-l/2009-November/001642.html In thinking about this, perhaps there is another less invasive change, which I'm going to call Plan(C): We expect (and could even enforce this assumption) there to be at most one "source" feature in a GenBank/EMBL file, and that it should span the full length of the sequence. Taking this a special case, when slicing a SeqRecord, we could also slice the "source" SeqFeature to match the new reduced sequence. Furthermore, when adding two SeqRecord objects, we would try to combine the two "source" SeqFeatures - taking only common annotation information. And I'll use Plan(A) for leaving things as they stand, pros and cons: * pro - no code changes at all * con - "source" annotation remains a bit hidden * con - still lose "source" features on slicing Plan(B) pros and cons ("source" as top level annotation): * pro - elegant handling of "source" annotation * pro - no changes in SeqRecord * con - special case code in GenBank/EMBL input/output * con - may need special case code in BioSQL wrapper * con - fairly big break to backwards compatibility (affecting any scripts accessing or creating "source" features), depending on how such a transition was made. Place(C) pros and cons (special "source" slicing/adding): * con - "source" annotation remains a bit hidden * con - special case code in SeqRecord * pro - no changes in GenBank/EMBL input/output * pro - no changes in BioSQL wrapper * pro - minor break to backwards compatibility (affecting slicing of "source" features only - remember SeqRecord addition hasn't been released yet). Any thoughts? I've probably missed some advantages and disadvantages, and alternative ideas are welcome. This new idea to just special case slicing/adding of the "source" feature (Plan C) lacks the elegance of moving the "source" annotation to the top level (Plan B). However, it is much less invasive and looks quite practical and intuitive. Peter From schafer at rostlab.org Wed Nov 18 08:14:06 2009 From: schafer at rostlab.org (=?ISO-8859-1?Q?Christian_Sch=E4fer?=) Date: Wed, 18 Nov 2009 08:14:06 -0500 Subject: [Biopython] Divergent sequence data set In-Reply-To: <000001ca6827$d4174af0$7c45e0d0$@agrawal@anu.edu.au> References: <000001ca6827$d4174af0$7c45e0d0$@agrawal@anu.edu.au> Message-ID: <4B03F31E.9040901@rostlab.org> There are stand-alone tools out there like cd-hit or uniqueProt for the purpose of creating sequence-unique subsets on particular thresholds. If you want to access them from within your python code, it's easy to do so via commands.getoutput() or similar means and then parsing the result. Chris Animesh Agrawal wrote: > Hi, > > I have been trying to develop a divergent sequence data set for a > phylogenetic analysis. Do we have something in Biopython, where for a given > set of sequences we can choose identity threshold to reduce redundancy in > the dataset. > > > > Cheers, > > Animesh > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Wed Nov 18 08:30:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Nov 2009 13:30:35 +0000 Subject: [Biopython] Additions to the SeqRecord In-Reply-To: <320fb6e00911180331o53ddfc4bi4ccdb0f1972221f5@mail.gmail.com> References: <320fb6e00911120404l79106ddbs350a06252ea0831a@mail.gmail.com> <320fb6e00911120608w235ff7e3qfbb5b07fea9d3148@mail.gmail.com> <20091113132346.GB48178@sobchak.mgh.harvard.edu> <320fb6e00911130551h21f5c438gfbb68cc25bd9ea7b@mail.gmail.com> <20091117132417.GE68691@sobchak.mgh.harvard.edu> <320fb6e00911170653p223b671aj68459868348fe524@mail.gmail.com> <320fb6e00911180331o53ddfc4bi4ccdb0f1972221f5@mail.gmail.com> Message-ID: <320fb6e00911180530h54544fbby1b7b284cabc874da@mail.gmail.com> Peter wrote: > In thinking about this, perhaps there is another less invasive change, > which I'm going to call Plan(C): > > We expect (and could even enforce this assumption) there to be at > most one "source" feature in a GenBank/EMBL file, and that it should > span the full length of the sequence. Taking this a special case, when > slicing a SeqRecord, we could also slice the "source" SeqFeature to > match the new reduced sequence. Furthermore, when adding two > SeqRecord objects, we would try to combine the two "source" > SeqFeatures - taking only common annotation information. Here is an outline of what I have in mind here (incomplete, but does the basics). If we want to talk about the implementation, perhaps we should move this to the dev list... http://github.com/peterjc/biopython/commit/a074919b9925cb908935abf3161a50758f21f607 However, the point is that "Plan C" looks possible, and seems to have potential for dealing with SeqRecord slicing and addition where there is a "source" SeqFeature fairly nicely (i.e. preserving it for things like removing part of a sequence, or doing an origin shift). Peter From biopython at maubp.freeserve.co.uk Wed Nov 18 08:40:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Nov 2009 13:40:18 +0000 Subject: [Biopython] Additions to the SeqRecord In-Reply-To: <320fb6e00911180530h54544fbby1b7b284cabc874da@mail.gmail.com> References: <320fb6e00911120404l79106ddbs350a06252ea0831a@mail.gmail.com> <320fb6e00911120608w235ff7e3qfbb5b07fea9d3148@mail.gmail.com> <20091113132346.GB48178@sobchak.mgh.harvard.edu> <320fb6e00911130551h21f5c438gfbb68cc25bd9ea7b@mail.gmail.com> <20091117132417.GE68691@sobchak.mgh.harvard.edu> <320fb6e00911170653p223b671aj68459868348fe524@mail.gmail.com> <320fb6e00911180331o53ddfc4bi4ccdb0f1972221f5@mail.gmail.com> <320fb6e00911180530h54544fbby1b7b284cabc874da@mail.gmail.com> Message-ID: <320fb6e00911180540y4bd82f09l5f6fbf5eed9e8ce1@mail.gmail.com> Hi all, Over on the BioSQL mailing list, Chris Fields just made an interesting point - there are real GenBank files with multiple source features: Chris Fields wrote: > > Just to note, there are a few cases where there are two or more > source features. This pops up mainly with chimeric sequences, > for example: > > http://www.ncbi.nlm.nih.gov/nuccore/21727885 > > We have run into this a couple of times on the bioperl list. In this > case, each feature is limited to specific locations on the sequence > and doesn't pertain to the entire sequence. NCBI only notes the > first source on the ORGANISM line; last time I checked, EMBL > used both. > > chris At very least, this will make an excellent example of the unit tests! Peter From biopython at maubp.freeserve.co.uk Wed Nov 18 10:47:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Nov 2009 15:47:43 +0000 Subject: [Biopython] Fwd: Bio.PDB: parsing PDB files for ATOM records In-Reply-To: <13397911330439330334250671630377437584-Webmail@me.com> References: <13397911330439330334250671630377437584-Webmail@me.com> Message-ID: <320fb6e00911180747s5ab221c7tef1a6c83a749ab75@mail.gmail.com> On Thu, Nov 5, 2009 at 3:51 PM, Konrad Koehler wrote: > Contray to my first post, the modifications to Bio.PDB outlined below: > > http://osdir.com/ml/python.bio.general/2008-04/msg00038.html > > do work with the lastest version of Bio.PDB. ?(I must have introduced a > typo in my first try, on the second try it worked perfectly). > > I would however request that these changes be incorporated into the > production version of Bio.PDB. > > Best regards, > > Konrad I just found your email in my spam folder :( This was filed as Bug 2495, http://bugzilla.open-bio.org/show_bug.cgi?id=2495 Peter From Jose.Lacal at OpenPHI.com Wed Nov 18 18:19:09 2009 From: Jose.Lacal at OpenPHI.com (Jose C. Lacal) Date: Wed, 18 Nov 2009 18:19:09 -0500 Subject: [Biopython] Parsing records off PubMed vs. PubMedCentral. Message-ID: <1258586349.25095.52.camel@DESK01> Greetings: I'm just starting to use BioPython and this may be a dumb question. I've been following the excellent tutorial at http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc88 My question refers to section 8.11.1 a.) I am able to query, retrieve and parse files from db="pubmed" as per the code below. This works. from Bio import Entrez, Medline Entrez.email = "Jose.Lacal at OpenPHI.com" handle = handle = Entrez.esearch(db="pubmed", term="hypertension[all]&George+Mason+University[affl]", rettype="medline", retmode="text") record = Entrez.read(handle) print record["IdList"] idlist = record["IdList"] handle = Entrez.efetch(db="pubmed",id=idlist,rettype="medline",retmode="text") records = Medline.parse(handle) for record in records: print record["AU"] b.) But when I change db="pubmed" to db="pmc" I get an error message: KeyError: 'AU' It looks like "pmc" does not have the same keys as "pubmed" And I've been unable to find the equivalent format to parse files downloaded from "pmc" Pointers and suggestions most appreciated. regards. -- ----- ----- ----- Jose C. Lacal, Founder & Chief Vision Officer Open Personalized Health Informatics "OpenPHI" 15625 NW 15th Avenue; Suite 15 Miami, FL 33169-5601 USA www.OpenPHI.com [M] +1 (954) 553-1984 Jose.Lacal at OpenPHI.com OpenPHI is an information management company. We acquire, compile, and manage mailing lists in the global academic & bio-medical spaces. See: http://www.openphi.com/healthmining.html From biopython at maubp.freeserve.co.uk Thu Nov 19 06:03:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 19 Nov 2009 11:03:04 +0000 Subject: [Biopython] Parsing records off PubMed vs. PubMedCentral. In-Reply-To: <1258586349.25095.52.camel@DESK01> References: <1258586349.25095.52.camel@DESK01> Message-ID: <320fb6e00911190303y54ee4957q7a59217deea47c1f@mail.gmail.com> On Wed, Nov 18, 2009 at 11:19 PM, Jose C. Lacal wrote: > Greetings: > > I'm just starting to use BioPython and this may be a dumb question. > > I've been following the excellent tutorial at > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc88 > > My question refers to section 8.11.1 > > > a.) I am able to query, retrieve and parse files from db="pubmed" as per > the code below. This works. > > > from Bio import Entrez, Medline > Entrez.email = "Jose.Lacal at OpenPHI.com" > > handle = handle = Entrez.esearch(db="pubmed", > term="hypertension[all]&George+Mason+University[affl]", > rettype="medline", retmode="text") > > record = Entrez.read(handle) > print record["IdList"] > > idlist = record["IdList"] > handle = > Entrez.efetch(db="pubmed",id=idlist,rettype="medline",retmode="text") > > records = Medline.parse(handle) > for record in records: > ? ? ? ?print record["AU"] > OK, good :) > b.) But when I change db="pubmed" to db="pmc" I get an error message: > KeyError: 'AU' > > It looks like "pmc" does not have the same keys as "pubmed" And I've > been unable to find the equivalent format to parse files downloaded from > "pmc" > > Pointers and suggestions most appreciated. regards. Correct - PubMed and PubMedCentral are different databases and use different identifiers. You can use Entrez ELink to map between them. e.g. The Biopython application note has PMID 19304878, but its PMCID is 2682512. >>> from Bio import Entrez >>> print Entrez.efetch(db="pubmed",id="19304878",rettype="medline",retmode="text").read() PMID- 19304878 OWN - NLM STAT- MEDLINE DA - 20090515 DCOM- 20090709 LR - 20091104 IS - 1367-4811 (Electronic) VI - 25 IP - 11 DP - 2009 Jun 1 TI - Biopython: freely available Python tools for computational molecular biology and bioinformatics. PG - 1422-3 ... Now, according to the documentation for EFetch, PMC should support rettype="medline" (just like PubMed): http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchlit_help.html >>> print Entrez.efetch(db="pmc",id="2682512", retmode="medline", rettype="text").read()

Error occurred: Report 'text' not found in 'pmc' presentation


    ... Odd. I also tried the XML from EFetch for PMC, but it fails to validate. I wonder if this in an NCBI glitch? I have emailed them about this. In the meantime, I would suggest you just use PubMed not PMC - it covers more journals but in less depth. Peter From cmckay at u.washington.edu Thu Nov 19 18:42:12 2009 From: cmckay at u.washington.edu (Cedar McKay) Date: Thu, 19 Nov 2009 15:42:12 -0800 Subject: [Biopython] allow ambiguities is sequence matching? Message-ID: Hello all, Apologies if this is covered in the tutorial anywhere, if so I didn't see it. I am trying to test whether sequence A appears anywhere in sequence B. The catch is I want to allow n mismatches. Right now my code looks like: #record is a SeqRecord #query_seq is a string if query_seq in record.seq: do something If I want query_seq to match despite n nucleotide mismatches, how should I do that? It seems like something that would be pretty common for people working with DNA probes. Is this even a biopython problem? Or is it just a general python problem? thanks, Cedar From biopython at maubp.freeserve.co.uk Fri Nov 20 05:03:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 20 Nov 2009 10:03:15 +0000 Subject: [Biopython] allow ambiguities is sequence matching? In-Reply-To: References: Message-ID: <320fb6e00911200203t1bc09df6jc0dac7d619a644f0@mail.gmail.com> On Thu, Nov 19, 2009 at 11:42 PM, Cedar McKay wrote: > Hello all, > Apologies if this is covered in the tutorial anywhere, if so I didn't see > it. > > I am trying to test whether sequence A appears anywhere in sequence B. The > catch is I want to allow n mismatches. Right now my code looks like: > > #record is a SeqRecord > #query_seq is a string > if query_seq in record.seq: > ? ? ? ?do something > > > If I want query_seq to match despite n nucleotide mismatches, how should I > do that? It seems like something that would be pretty common for people > working with DNA probes. Is this even a biopython problem? Or is it just a > general python problem? We have in general tried to keep the Seq object API as much like that of the Python string as is reasonable, for example the find, startswith and endswith methos. Likewise, the "in" operator on the Seq object also works like a python string, it uses plain string matching (see Bug 2853, this was added in Biopython 1.51). It sounds like you want some kind of fuzzy find... one solution would be regular expressions, another might be to use the Bio.Motif module. There have been similar discussions on the mailing list before, but no clear consensus - see for example Bug 2601. Peter From biopython at maubp.freeserve.co.uk Fri Nov 20 05:49:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 20 Nov 2009 10:49:09 +0000 Subject: [Biopython] Parsing records off PubMed vs. PubMedCentral. In-Reply-To: <320fb6e00911190303y54ee4957q7a59217deea47c1f@mail.gmail.com> References: <1258586349.25095.52.camel@DESK01> <320fb6e00911190303y54ee4957q7a59217deea47c1f@mail.gmail.com> Message-ID: <320fb6e00911200249o4c5c736bia43dd0b586c32ccd@mail.gmail.com> On Thu, Nov 19, 2009 at 11:03 AM, Peter wrote: > > Now, according to the documentation for EFetch, PMC should support > rettype="medline" (just like PubMed): > http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchlit_help.html > >>>> print Entrez.efetch(db="pmc",id="2682512", retmode="medline", rettype="text").read() > > >

    Error occurred: Report 'text' not found in 'pmc' > presentation


      > ... > > > Odd. I also tried the XML from EFetch for PMC, but it fails to > validate. I wonder if this in an NCBI glitch? I have emailed them > about this. > I had a reply from someone at the NCBI, who had also noticed a problem, and has reported this to the EFetch developers. > In the meantime, I would suggest you just use PubMed not PMC - it > covers more journals but in less depth. Peter From biopython at maubp.freeserve.co.uk Fri Nov 20 09:29:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 20 Nov 2009 14:29:25 +0000 Subject: [Biopython] Seq object ungap method In-Reply-To: <320fb6e00911061100g2122ef77se80f5bc6198cb030@mail.gmail.com> References: <320fb6e00911061100g2122ef77se80f5bc6198cb030@mail.gmail.com> Message-ID: <320fb6e00911200629s3fe49d0di256d5a0dbc24ffc0@mail.gmail.com> Hi all, Something we discussed last year was adding an ungap method to the Seq object. e.g. http://lists.open-bio.org/pipermail/biopython/2008-September/004523.html http://lists.open-bio.org/pipermail/biopython/2008-September/004527.html As mentioned earlier this month on the dev mailing list, http://lists.open-bio.org/pipermail/biopython-dev/2009-November/006983.html I actually made the time to implement this, and posted it on a github branch - you can see the updated Bio/Seq.py file here: http://github.com/peterjc/biopython/blob/ungap/Bio/Seq.py I've included a copy of the proposed docstring for the new Seq object ungap method at the end of this email, which tries to illustrate how this would be used. I'd like some comments - is this worth including in Biopython? Thanks, Peter -- This is the proposed docstring for the new Seq object ungap method, the examples double as doctest unit tests: Return a copy of the sequence without the gap character(s). The gap character can be specified in two ways - as an explicit argument, or via the sequence's alphabet. For example: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_dna >>> my_dna = Seq("ATA--TGAAAT-TTGAAAA", generic_dna) >>> my_dna Seq('ATA--TGAAAT-TTGAAAA', DNAAlphabet()) >>> my_dna.ungap("-") Seq('ATATGAAATTTGAAAA', DNAAlphabet()) If the gap character is not given as an argument, it will be taken from the sequence's alphabet (if defined). Notice that the returned sequence's alphabet is adjusted since it no longer requires a gapped alphabet: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC, Gapped, HasStopCodon >>> my_pro = Seq("MVVLE=AD*", HasStopCodon(Gapped(IUPAC.protein, "="))) >>> my_pro Seq('MVVLE=AD*', HasStopCodon(Gapped(IUPACProtein(), '='), '*')) >>> my_pro.ungap() Seq('MVVLEAD*', HasStopCodon(IUPACProtein(), '*')) Or, with a simpler gapped DNA example: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC, Gapped >>> my_seq = Seq("CGGGTAG=AAAAAA", Gapped(IUPAC.unambiguous_dna, "=")) >>> my_seq Seq('CGGGTAG=AAAAAA', Gapped(IUPACUnambiguousDNA(), '=')) >>> my_seq.ungap() Seq('CGGGTAGAAAAAA', IUPACUnambiguousDNA()) As long as it is consistent with the alphabet, although it is redundant, you can stil supply the gap character as an argument to this method: >>> my_seq Seq('CGGGTAG=AAAAAA', Gapped(IUPACUnambiguousDNA(), '=')) >>> my_seq.ungap("=") Seq('CGGGTAGAAAAAA', IUPACUnambiguousDNA()) However, if the gap character given as the argument disagrees with that declared in the alphabet, an exception is raised: >>> my_seq Seq('CGGGTAG=AAAAAA', Gapped(IUPACUnambiguousDNA(), '=')) >>> my_seq.ungap("-") Traceback (most recent call last): ... ValueError: Gap '-' does not match '=' from alphabet Finally, if a gap character is not supplied, and the alphabet does not define one, an exception is raised: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_dna >>> my_dna = Seq("ATA--TGAAAT-TTGAAAA", generic_dna) >>> my_dna Seq('ATA--TGAAAT-TTGAAAA', DNAAlphabet()) >>> my_dna.ungap() Traceback (most recent call last): ... ValueError: Gap character not given and not defined in alphabet From schafer at rostlab.org Fri Nov 20 11:55:58 2009 From: schafer at rostlab.org (Christian Schaefer) Date: Fri, 20 Nov 2009 11:55:58 -0500 Subject: [Biopython] allow ambiguities is sequence matching? In-Reply-To: References: Message-ID: <4B06CA1E.1010000@rostlab.org> Hey Cedar, I'm currently doing something similar on protein sequences. A simple brute force method could work like this: Slide the short sequence 'underneath' the long sequence. After each step translate the current overlap into a bit-string where 1 indicates a match and 0 a mismatch. Now you can easily apply a regex on this bit-string to look for particular patterns like 'n mismatches allowed'. Hope that helps. Chris Cedar McKay wrote: > Hello all, > Apologies if this is covered in the tutorial anywhere, if so I didn't > see it. > > I am trying to test whether sequence A appears anywhere in sequence B. > The catch is I want to allow n mismatches. Right now my code looks like: > > #record is a SeqRecord > #query_seq is a string > if query_seq in record.seq: > do something > > > If I want query_seq to match despite n nucleotide mismatches, how should > I do that? It seems like something that would be pretty common for > people working with DNA probes. Is this even a biopython problem? Or is > it just a general python problem? > > thanks, > Cedar > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From anaryin at gmail.com Mon Nov 23 04:02:06 2009 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 23 Nov 2009 01:02:06 -0800 Subject: [Biopython] SeqIO.parse Question Message-ID: Dear all, This is merely a suggestion. I've been using SeqIO.parse on some user input I receive from a server. I'm using the following code: for num, record in enumerate(SeqIO.parse(StringIO(FASTA_sequence), 'fasta')): req_seq = record.seq.tostring() req_name = record.id Since I have no clue what the user might introduce, regarding the number of sequences, I have to user parse, instead of read. If I introduce only one sequence and it is a valid FASTA sequence, it does its work flawlessly. If I insert several FASTA sequences and one of them is wrongly formatted, it won't complain at all. If I insert a single wrong sequence, it doesn't complain either. Is there a convenient way for me to check FASTA formats? The usual startswith('>') doesn't work for multiple sequences. And the user might have spaces in the sequence so a split('\n') is also ruled out to split the sequences. At the moment, I'm checking if the first sequence of the input starts with '>', and if it does, the parser kicks in and for every req_seq object I check if there is any character that is not valid (a number or an otherwise weird character). If I get a mis-formatted sequence in there it will complain because spaces, newlines, and numbers ( often found in sequence names ) are not in my allowed list. However, if there's an easier way, it would save me some if checks and for loops :) Suggestions? Best regards to all, Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ From biopython at maubp.freeserve.co.uk Mon Nov 23 05:18:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 23 Nov 2009 10:18:24 +0000 Subject: [Biopython] SeqIO.parse Question In-Reply-To: References: Message-ID: <320fb6e00911230218i346d104cr81fb46710d4fb8a4@mail.gmail.com> On Mon, Nov 23, 2009 at 9:02 AM, Jo?o Rodrigues wrote: > Dear all, > > This is merely a suggestion. I've been using SeqIO.parse on some user input > I receive from a server. > > I'm using the following code: > > for num, record in enumerate(SeqIO.parse(StringIO(FASTA_sequence), > 'fasta')): > > ? ?req_seq = record.seq.tostring() > ? ?req_name = record.id > > Since I have no clue what the user might introduce, regarding the number of > sequences, I have to user parse, instead of read. If I introduce only one > sequence and it is a valid FASTA sequence, it does its work flawlessly. If I > insert several FASTA sequences and one of them is wrongly formatted, it > won't complain at all. If I insert a single wrong sequence, it doesn't > complain either. Can you give us an example? > Is there a convenient way for me to check FASTA formats? The usual > startswith('>') doesn't work for multiple sequences. And the user might have > spaces in the sequence so a split('\n') is also ruled out to split the > sequences. You could do something like ("\n"+FASTA_sequence).count("\n>") to get the number of records. > At the moment, I'm checking if the first sequence of the input starts with > '>', and if it does, the parser kicks in and for every req_seq object I > check if there is any character that is not valid (a number or an otherwise > weird character). If I get a mis-formatted sequence in there it will > complain because spaces, newlines, and numbers ( often found in sequence > names ) are not in my allowed list. > > However, if there's an easier way, it would save me some if checks and for > loops :) Suggestions? I'm not 100% sure what you are tying to do - some examples should help. Peter From anaryin at gmail.com Mon Nov 23 05:49:14 2009 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 23 Nov 2009 02:49:14 -0800 Subject: [Biopython] SeqIO.parse Question In-Reply-To: <320fb6e00911230218i346d104cr81fb46710d4fb8a4@mail.gmail.com> References: <320fb6e00911230218i346d104cr81fb46710d4fb8a4@mail.gmail.com> Message-ID: Sorry for the clouded explanation :x I'll try to show you an example: I have a server that runs BLAST queries from user deposited sequences. Those sequences have to in FASTA format. 4 Users deposit their sequences User 1: >SequenceName AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA User2: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA User3: >Sequence1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Sequence2 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB User4: >SequenceOops AAAAAAAAAAAAAA1AAAAAAAAAAAAAAAAAAAAAAA Now, if I run this through a python script that has simply something like this: user_input = getInput() # Gets input from the user (can be single or multiple sequences) for record in SeqIO.parse(StringIO(user_input),'fasta'): # Parses each sequence on at a time print record.id print "Parsed" This will happen for each of the users up there: User1 will run smoothly and 'SequenceName' will be displayed. 'Parsed' will also be displayed. User2 will be shown 'Parsed'. Despite his sequence is not in FASTA format, the parser didn't throw an exception saying so. It just skips the for loop ( maybe treats the SeqIO.parse as None ). User3 will be shown 'Sequence1' and 'Parsed', although his second sequence is not correctly formatted. User4 will be shown 'SequenceOops' and 'Parsed', although there is a 1 in the sequence ( which is not a valid character for any sequence ). My question is basically: is there a way to do a sanity check to a file to see if it really contains proper FASTA sequences? The way I'm doing it works ok but it seems to be a bit too messy to be the best solution. I'm first checking if the first character of the user input is a '>'. If it is, I'm then passing the whole input to the Biopython parser. For each record the parser consumes, I get the sequence back, or what the parser thinks is a sequence, and then I check to see if there are any numbers, blankspaces, etc, in the sequence. If there are, I'll raise an exception. With those 4 examples: User 1 passes everything ok User 2 fails the first check. User 3 and 4 fail the second check because of blank spaces and numbers. This might sound a bit stupid on my part, and I apologize in advance, but this way I don't see much of a use in SeqIO.parse function. I'd do almost the same with user_input.split('\n>'). Is this clearer? My code is here: http://pastebin.com/m4d993239 From biopython at maubp.freeserve.co.uk Mon Nov 23 06:19:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 23 Nov 2009 11:19:46 +0000 Subject: [Biopython] SeqIO.parse Question In-Reply-To: References: <320fb6e00911230218i346d104cr81fb46710d4fb8a4@mail.gmail.com> Message-ID: <320fb6e00911230319n661c3075y664f848e4f14d271@mail.gmail.com> Thanks for clarifying Jo?o :) On Mon, Nov 23, 2009 at 10:49 AM, Jo?o Rodrigues wrote: > Sorry for the clouded explanation :x I'll try to show you an example: > > I have a server that runs BLAST queries from user deposited sequences. Those > sequences have to in FASTA format. 4 Users deposit their sequences > > User 1: >>SequenceName > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Valid record, fine. > User2: > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Missing ">" header, this contains no FASTA records. > User3: >>Sequence1 > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA > Sequence2 > BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB Assuming you don't mind numbers in your sequence (which do get used in some situations), this is a valid FASTA file with a single record, equivalent to identifier "Sequence1" and sequence: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAASequence2BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB > User4: >>SequenceOops > AAAAAAAAAAAAAA1AAAAAAAAAAAAAAAAAAAAAAA As in example 1, assuming you don't mind numbers in your sequence, this is a valid FASTA file. > Now, if I run this through a python script that has simply something like > this: > > user_input = getInput() # Gets input from the user (can be single or > multiple sequences) > > for record in SeqIO.parse(StringIO(user_input),'fasta'): # Parses each > sequence on at a time > ? print record.id > print "Parsed" > > This will happen for each of the users up there: > > User1 will run smoothly and 'SequenceName' will be displayed. 'Parsed' will > also be displayed. > > User2 will be shown 'Parsed'. Despite his sequence is not in FASTA format, > the parser didn't throw an exception saying so. It just skips the for loop ( > maybe treats the SeqIO.parse as None ). > > User3 will be shown 'Sequence1' and 'Parsed', although his second sequence > is not correctly formatted. > > User4 will be shown 'SequenceOops' and 'Parsed', although there is a 1 in > the sequence ( which is not a valid character for any sequence ). > > My question is basically: is there a way to do a sanity check to a file to > see if it really contains proper FASTA sequences? The way I'm doing it works > ok but it seems to be a bit too messy to be the best solution. > > I'm first checking if the first character of the user input is a '>'. If it > is, I'm then passing the whole input to the Biopython parser. I probably do something similar - but I would first strip the white space. After all, "\n\n\n>ID\nACGT\n\n\n" is a valid FASTA file with one record. If the sequence lacks the ">", then I would either raise an error, or add something like ">Default\n" to the start automatically. Do whatever the BLAST webpage does to make it consistent for your users. > For each > record the parser consumes, I get the sequence back, or what the parser > thinks is a sequence, and then I check to see if there are any numbers, > blankspaces, etc, in the sequence. If there are, I'll raise an exception. Again, I might do the same (but see below). > With those 4 examples: > > User 1 passes everything ok > User 2 fails the first check. > User 3 and 4 fail the second check because of blank spaces and numbers. > > This might sound a bit stupid on my part, and I apologize in advance, but > this way I don't see much of a use in SeqIO.parse function. I'd do almost > the same with user_input.split('\n>'). > > Is this clearer? My code is here: http://pastebin.com/m4d993239 The problem is your definition of "valid FASTA" and Biopython's differ. This is largely because the FASTA file format has never been strictly defined. You'll find lots of differences in different tools (e.g. some like ClustalW can't cope with long description lines; some tools allow comment lines; in some cases characters like "." and "*" are allowed but not all). Also, you appear to want something very narrow - protein FASTA files with a limited character set (some but not all of the full IUPAC set) plus the minus sign (as a gap). Bio.SeqIO is not trying to do file format validation - it is trying to do file parsing, and for your needs it is being too tolerant. In this situation then yes, doing your own validation (without using Biopython) might be simplest. How I would like to "fix" this is to implement Bug 2597 (strict alphabet checking in the Seq object). Then, when you call Bio.SeqIO.parse, include the expected alphabet which should specify the allowed letters (and exclude numbers etc). See: http://bugzilla.open-bio.org/show_bug.cgi?id=2597 Peter P.S. In your code, using a set should be faster for checking membership: allowed = set('ABCDEFGHIKLMNPQRSTUVWYZX-') In fact, I would make the allowed list include both cases, then you don't have to make all those calls to upper. I would also double check to see if the latest version of BLAST does in fact accept O (Pyrrolysine) or J (Leucine or Isoleucine), and if need be contact the NCBI to update this webpage: http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml Peter From anaryin at gmail.com Mon Nov 23 06:34:53 2009 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 23 Nov 2009 03:34:53 -0800 Subject: [Biopython] SeqIO.parse Question In-Reply-To: <320fb6e00911230319n661c3075y664f848e4f14d271@mail.gmail.com> References: <320fb6e00911230218i346d104cr81fb46710d4fb8a4@mail.gmail.com> <320fb6e00911230319n661c3075y664f848e4f14d271@mail.gmail.com> Message-ID: My definition of FASTA is actually what BLASTp requires. It's quite a picky tool :) I had already understood that FASTA is quite... lax. But I thought I was missing something, thus asking the list. Is the alphabet patch already included? Thanks for the tip on the leading white space, had missed that :) Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ On Mon, Nov 23, 2009 at 3:19 AM, Peter wrote: > Thanks for clarifying Jo?o :) > > On Mon, Nov 23, 2009 at 10:49 AM, Jo?o Rodrigues > wrote: > > Sorry for the clouded explanation :x I'll try to show you an example: > > > > I have a server that runs BLAST queries from user deposited sequences. > Those > > sequences have to in FASTA format. 4 Users deposit their sequences > > > > User 1: > >>SequenceName > > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA > > Valid record, fine. > > > User2: > > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA > > Missing ">" header, this contains no FASTA records. > > > User3: > >>Sequence1 > > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA > > Sequence2 > > BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB > > Assuming you don't mind numbers in your sequence (which > do get used in some situations), this is a valid FASTA file > with a single record, equivalent to identifier "Sequence1" > and sequence: > > > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAASequence2BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB > > > User4: > >>SequenceOops > > AAAAAAAAAAAAAA1AAAAAAAAAAAAAAAAAAAAAAA > > As in example 1, assuming you don't mind numbers in your > sequence, this is a valid FASTA file. > > > Now, if I run this through a python script that has simply something like > > this: > > > > user_input = getInput() # Gets input from the user (can be single or > > multiple sequences) > > > > for record in SeqIO.parse(StringIO(user_input),'fasta'): # Parses each > > sequence on at a time > > print record.id > > print "Parsed" > > > > This will happen for each of the users up there: > > > > User1 will run smoothly and 'SequenceName' will be displayed. 'Parsed' > will > > also be displayed. > > > > User2 will be shown 'Parsed'. Despite his sequence is not in FASTA > format, > > the parser didn't throw an exception saying so. It just skips the for > loop ( > > maybe treats the SeqIO.parse as None ). > > > > User3 will be shown 'Sequence1' and 'Parsed', although his second > sequence > > is not correctly formatted. > > > > User4 will be shown 'SequenceOops' and 'Parsed', although there is a 1 in > > the sequence ( which is not a valid character for any sequence ). > > > > My question is basically: is there a way to do a sanity check to a file > to > > see if it really contains proper FASTA sequences? The way I'm doing it > works > > ok but it seems to be a bit too messy to be the best solution. > > > > I'm first checking if the first character of the user input is a '>'. If > it > > is, I'm then passing the whole input to the Biopython parser. > > I probably do something similar - but I would first strip the white space. > After all, "\n\n\n>ID\nACGT\n\n\n" is a valid FASTA file with one record. > > If the sequence lacks the ">", then I would either raise an error, or > add something like ">Default\n" to the start automatically. Do whatever > the BLAST webpage does to make it consistent for your users. > > > For each > > record the parser consumes, I get the sequence back, or what the parser > > thinks is a sequence, and then I check to see if there are any numbers, > > blankspaces, etc, in the sequence. If there are, I'll raise an exception. > > Again, I might do the same (but see below). > > > With those 4 examples: > > > > User 1 passes everything ok > > User 2 fails the first check. > > User 3 and 4 fail the second check because of blank spaces and numbers. > > > > This might sound a bit stupid on my part, and I apologize in advance, but > > this way I don't see much of a use in SeqIO.parse function. I'd do almost > > the same with user_input.split('\n>'). > > > > Is this clearer? My code is here: http://pastebin.com/m4d993239 > > The problem is your definition of "valid FASTA" and Biopython's differ. > This is largely because the FASTA file format has never been strictly > defined. You'll find lots of differences in different tools (e.g. some like > ClustalW can't cope with long description lines; some tools allow > comment lines; in some cases characters like "." and "*" are allowed > but not all). > > Also, you appear to want something very narrow - protein FASTA > files with a limited character set (some but not all of the full IUPAC > set) plus the minus sign (as a gap). > > Bio.SeqIO is not trying to do file format validation - it is trying to do > file parsing, and for your needs it is being too tolerant. In this > situation > then yes, doing your own validation (without using Biopython) might > be simplest. > > How I would like to "fix" this is to implement Bug 2597 (strict alphabet > checking in the Seq object). Then, when you call Bio.SeqIO.parse, > include the expected alphabet which should specify the allowed > letters (and exclude numbers etc). See: > http://bugzilla.open-bio.org/show_bug.cgi?id=2597 > > Peter > > P.S. In your code, using a set should be faster for checking membership: > > allowed = set('ABCDEFGHIKLMNPQRSTUVWYZX-') > > In fact, I would make the allowed list include both cases, then > you don't have to make all those calls to upper. > > I would also double check to see if the latest version of BLAST > does in fact accept O (Pyrrolysine) or J (Leucine or Isoleucine), > and if need be contact the NCBI to update this webpage: > http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml > > Peter > From biopython at maubp.freeserve.co.uk Mon Nov 23 06:40:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 23 Nov 2009 11:40:56 +0000 Subject: [Biopython] SeqIO.parse Question In-Reply-To: References: <320fb6e00911230218i346d104cr81fb46710d4fb8a4@mail.gmail.com> <320fb6e00911230319n661c3075y664f848e4f14d271@mail.gmail.com> Message-ID: <320fb6e00911230340k71c57338v84e9d832a71dae99@mail.gmail.com> On Mon, Nov 23, 2009 at 11:34 AM, Jo?o Rodrigues wrote: > My definition of FASTA is actually what BLASTp requires. It's quite a picky > tool :) I had already understood that FASTA is quite... lax. But I thought I > was missing something, thus asking the list. Is the alphabet patch already > included? No, the strict alphabet checking in the Seq object is not merged (yet). This is potentially a contentious issue, and may break existing scripts which really on the current lax behaviour. I am wondering about making this trigger a warning in the next release of Biopython as a step towards making the strict check the default, but this needs further debate. > Thanks for the tip on the leading white space, had missed that :) Sure. Peter From biopython at maubp.freeserve.co.uk Mon Nov 23 11:53:20 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 23 Nov 2009 16:53:20 +0000 Subject: [Biopython] Seq object ungap method In-Reply-To: <320fb6e00911200629s3fe49d0di256d5a0dbc24ffc0@mail.gmail.com> References: <320fb6e00911061100g2122ef77se80f5bc6198cb030@mail.gmail.com> <320fb6e00911200629s3fe49d0di256d5a0dbc24ffc0@mail.gmail.com> Message-ID: <320fb6e00911230853r4a3f95dbk49a830e9e16c9246@mail.gmail.com> On Fri, Nov 20, 2009 at 2:29 PM, Peter wrote: > Hi all, > > Something we discussed last year was adding an ungap method > to the Seq object. e.g. > http://lists.open-bio.org/pipermail/biopython/2008-September/004523.html > http://lists.open-bio.org/pipermail/biopython/2008-September/004527.html > > As mentioned earlier this month on the dev mailing list, > http://lists.open-bio.org/pipermail/biopython-dev/2009-November/006983.html > I actually made the time to implement this, and posted it on a > github branch - you can see the updated Bio/Seq.py file here: > http://github.com/peterjc/biopython/blob/ungap/Bio/Seq.py > > I've included a copy of the proposed docstring for the new Seq object > ungap method at the end of this email, which tries to illustrate how this > would be used. In the absence of any further comments (thanks Eric for your reply on the dev list), I've made an executive decision to check this into the trunk. This will make it much easier for people to test the new ungap method. I remain open to feedback (e.g. naming of the method) and we can even remove this before the next release if that turns out to be the consensus. Peter From jblanca at btc.upv.es Tue Nov 24 05:32:55 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 24 Nov 2009 11:32:55 +0100 Subject: [Biopython] Subclassing Seq and SeqRecord Message-ID: <200911241132.55922.jblanca@btc.upv.es> Hi: I'm "Biopythoniing" my utilities. I want to subclass Seq and SeqRecord to modify a little its behaviour. for instance I'm doing: from Bio.Seq import Seq as BioSeq class Seq(BioSeq): 'A biopython Seq with some extra functionality' def __eq__(self, seq): 'It checks if the given seq is equal to this one' return str(self) == str(seq) The problem is that to modify this behaviour I have to copy a lot of Seq methods because this methods create new Seq instances to return. This instances are created like: return Seq(str(self).replace('T','U').replace('t','u'), alphabet) would it be possible to change that to: return self.__class__(str(self).replace('T','U').replace('t','u'), alphabet) In that way the new instance would be created using the subclassed class and not the Seq class. Is that a reasonable change? In that case I could prepare a patch for Seq and SeqRecord. Regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From p.j.a.cock at googlemail.com Tue Nov 24 05:53:40 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Nov 2009 10:53:40 +0000 Subject: [Biopython] Subclassing Seq and SeqRecord In-Reply-To: <200911241132.55922.jblanca@btc.upv.es> References: <200911241132.55922.jblanca@btc.upv.es> Message-ID: <320fb6e00911240253g10986fcfj311c8a2adc12afd5@mail.gmail.com> On Tue, Nov 24, 2009 at 10:32 AM, Jose Blanca wrote: > Hi: > I'm "Biopythoniing" my utilities. I want to subclass Seq and SeqRecord to > modify a little its behaviour. for instance I'm doing: > > from Bio.Seq import Seq as BioSeq > > class Seq(BioSeq): > ? ?'A biopython Seq with some extra functionality' > ? ?def __eq__(self, seq): > ? ? ? ?'It checks if the given seq is equal to this one' > ? ? ? ?return str(self) == str(seq) That is something I have been meaning to bring up on the list. I started chatting to Brad about this at BOSC2009. The details get quite hairy with hashes and dictionaries and so on, so I will leave it to another email. > The problem is that to modify this behaviour I have to copy a lot of Seq > methods because this methods create new Seq instances to return. This > instances are created like: > return Seq(str(self).replace('T','U').replace('t','u'), alphabet) > > would it be possible to change that to: > return self.__class__(str(self).replace('T','U').replace('t','u'), alphabet) > > In that way the new instance would be created using the subclassed > class and not the Seq class. Is that a reasonable change? In that > case I could prepare a patch for Seq and SeqRecord. It is a reasonable change, but ONLY if all the subclasses support the same __init__ method, which isn't true. For example, the Bio.Seq.UnknownSeq subclasses Seq and uses a different __init__ method signature. This means any change would at a minimum have to include lots of fixes to the UnknownSeq From cmckay at u.washington.edu Tue Nov 24 18:12:08 2009 From: cmckay at u.washington.edu (Cedar McKay) Date: Tue, 24 Nov 2009 15:12:08 -0800 Subject: [Biopython] allow ambiguities is sequence matching? In-Reply-To: <320fb6e00911200203t1bc09df6jc0dac7d619a644f0@mail.gmail.com> References: <320fb6e00911200203t1bc09df6jc0dac7d619a644f0@mail.gmail.com> Message-ID: <7577CD69-1CD0-428A-B271-DA39F8718EA9@u.washington.edu> Thanks for the advice, I'll check out that bug, and see what I see. best, Cedar On Nov 20, 2009, at 2:03 AM, Peter wrote: > On Thu, Nov 19, 2009 at 11:42 PM, Cedar McKay > wrote: >> Hello all, >> Apologies if this is covered in the tutorial anywhere, if so I >> didn't see >> it. >> >> I am trying to test whether sequence A appears anywhere in sequence >> B. The >> catch is I want to allow n mismatches. Right now my code looks like: >> >> #record is a SeqRecord >> #query_seq is a string >> if query_seq in record.seq: >> do something >> >> >> If I want query_seq to match despite n nucleotide mismatches, how >> should I >> do that? It seems like something that would be pretty common for >> people >> working with DNA probes. Is this even a biopython problem? Or is it >> just a >> general python problem? > > We have in general tried to keep the Seq object API as much like > that of > the Python string as is reasonable, for example the find, startswith > and > endswith methos. Likewise, the "in" operator on the Seq object also > works > like a python string, it uses plain string matching (see Bug 2853, > this was > added in Biopython 1.51). > > It sounds like you want some kind of fuzzy find... one solution would > be regular expressions, another might be to use the Bio.Motif module. > There have been similar discussions on the mailing list before, but no > clear consensus - see for example Bug 2601. > > Peter From mitlox at op.pl Wed Nov 25 19:49:15 2009 From: mitlox at op.pl (xyz) Date: Thu, 26 Nov 2009 10:49:15 +1000 Subject: [Biopython] fastq-solexa index Message-ID: <4B0DD08B.6070607@op.pl> Hello, On this page ( http://www.biopython.org/wiki/SeqIO ) I have found that biopython can use fastq-solexa index. What does it means and are there any examples? Thank you in advance. Best regards, From anaryin at gmail.com Wed Nov 25 21:24:04 2009 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 25 Nov 2009 18:24:04 -0800 Subject: [Biopython] Turning PDBConstructionWarning off Message-ID: Dear All, Is there a way to make the PDBParser not to display Warnings when it reads structures? Like a flag that we pass somewhere? Regards! Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ From rodrigo_faccioli at uol.com.br Thu Nov 26 05:41:26 2009 From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli) Date: Thu, 26 Nov 2009 08:41:26 -0200 Subject: [Biopython] Turning PDBConstructionWarning off In-Reply-To: References: Message-ID: <3715adb70911260241x48678fe9ma80a246b7d73033b@mail.gmail.com> You can use the command line: python -O file.py I could execute the biopython without warning message when I put that command line. I understood that these warning messages show because we execute the biopython in debug mode. If you look the source code, you'll see: if __debug__: warning message So, I thought if set false in this variable (__debug__) I'll not warning message. I don't know if there is other way. Regards, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 On Thu, Nov 26, 2009 at 12:24 AM, Jo?o Rodrigues wrote: > Dear All, > > Is there a way to make the PDBParser not to display Warnings when it reads > structures? Like a flag that we pass somewhere? > > Regards! > > Jo?o [...] Rodrigues > @ http://stanford.edu/~joaor/ > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Thu Nov 26 05:42:59 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Nov 2009 10:42:59 +0000 Subject: [Biopython] Turning PDBConstructionWarning off In-Reply-To: References: Message-ID: <320fb6e00911260242y6d989df8wd5271cda71f3ed08@mail.gmail.com> On Thu, Nov 26, 2009 at 2:24 AM, Jo?o Rodrigues wrote: > Dear All, > > Is there a way to make the PDBParser not to display Warnings when it reads > structures? Like a flag that we pass somewhere? > > Regards! Yes, use the Python warnings module to ignore PDBConstructionWarning, see: http://docs.python.org/library/warnings.html Peter From biopython at maubp.freeserve.co.uk Thu Nov 26 05:48:49 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Nov 2009 10:48:49 +0000 Subject: [Biopython] fastq-solexa index In-Reply-To: <4B0DD08B.6070607@op.pl> References: <4B0DD08B.6070607@op.pl> Message-ID: <320fb6e00911260248w1f6a29b1ucc0bfecec897c67b@mail.gmail.com> On Thu, Nov 26, 2009 at 12:49 AM, xyz wrote: > Hello, > On this page ( http://www.biopython.org/wiki/SeqIO ) I have found that > biopython can use fastq-solexa index. What does it means and are there any > examples? > > Thank you in advance. In Bio.SeqIO we give each file format a name, in this case "fastq-solexa" means the old Solexa FASTQ files (also used by Illumina up to and including pipeline 1.2) which use Solexa scores with an ASCII offset of 64 (not PHRED scores). The table on the SeqIO wiki page tries to summarise this. See also: http://en.wikipedia.org/wiki/FASTQ_format The "index" column on that table on the SeqIO wiki page indicates if each file format can be used with the Bio.SeqIO.index(...) function included in Biopython 1.52 onwards. See: http://news.open-bio.org/news/2009/09/biopython-seqio-index/ There are also examples in the main Tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf And in the Bio.SeqIO module's built in help, online here: http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html >From within Python: >>> from Bio import SeqIO >>> help(SeqIO) ... >>> help(SeqIO.index) ... Peter From anaryin at gmail.com Thu Nov 26 05:49:26 2009 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 26 Nov 2009 02:49:26 -0800 Subject: [Biopython] Turning PDBConstructionWarning off In-Reply-To: <320fb6e00911260242y6d989df8wd5271cda71f3ed08@mail.gmail.com> References: <320fb6e00911260242y6d989df8wd5271cda71f3ed08@mail.gmail.com> Message-ID: Thanks Rodrigo for the tip. I'd tried to set __debug__ to False as well but it didn't work as I wanted (unless I actually edited the module file). Peter's suggestion is what I wanted. I was completely unaware of the "warning" module so I thought it was a BioPython thing. Thanks! Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ From biopython at maubp.freeserve.co.uk Thu Nov 26 08:10:33 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Nov 2009 13:10:33 +0000 Subject: [Biopython] Biopython and Twitter followers Message-ID: <320fb6e00911260510w77fbce0dsbbd76ad4d4892221@mail.gmail.com> Hi all, We've had a Biopython twitter account over six months now, and it seems to be a nice extra channel for promoting the project and keeping people up to date. Right now Biopython has 123 twitter followers: http://twitter.com/Biopython [Don't forget we have RSS and Atom news feeds too - see http://biopython.org/wiki/News for links] Right now, Biopython only follows the OBF, BioPerl and Guido van Rossum. I'm happy to add other related projects like BioRuby or BioJava if/when they setup twitter accounts. Given we have quite a few Biopython developers and regular contributors on twitter now - should we be following them too? Leighton had some valid reservations: http://lists.open-bio.org/pipermail/biopython-dev/2009-March/005626.html Any thoughts? Peter From lpritc at scri.ac.uk Thu Nov 26 10:32:32 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Thu, 26 Nov 2009 15:32:32 +0000 Subject: [Biopython] Biopython and Twitter followers In-Reply-To: <320fb6e00911260510w77fbce0dsbbd76ad4d4892221@mail.gmail.com> Message-ID: Howdy, I'm less concerned about following individuals now, than I was. It was a new mode of communication to me, and I might have been being a bit oversensitive to some comments on [bip] and blogs ;) Whatever makes the community happy is fine by me so, as long as we don't end up looking like a Masonic cult, I think following individuals is fair game. Cheers, L. On 26/11/2009 13:10, "Peter" wrote: > Hi all, > > We've had a Biopython twitter account over six months now, > and it seems to be a nice extra channel for promoting the > project and keeping people up to date. Right now Biopython > has 123 twitter followers: http://twitter.com/Biopython > > [Don't forget we have RSS and Atom news feeds too - see > http://biopython.org/wiki/News for links] > > Right now, Biopython only follows the OBF, BioPerl and > Guido van Rossum. I'm happy to add other related > projects like BioRuby or BioJava if/when they setup > twitter accounts. Given we have quite a few Biopython > developers and regular contributors on twitter now - > should we be following them too? Leighton had some > valid reservations: > http://lists.open-bio.org/pipermail/biopython-dev/2009-March/005626.html > > Any thoughts? > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > ______________________________________________________________________ > This email has been scanned by the MessageLabs Email Security System. > For more information please visit http://www.messagelabs.com/email > ______________________________________________________________________ -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From rodrigo_faccioli at uol.com.br Thu Nov 26 11:01:10 2009 From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli) Date: Thu, 26 Nov 2009 13:01:10 -0300 Subject: [Biopython] Turning PDBConstructionWarning off In-Reply-To: References: <320fb6e00911260242y6d989df8wd5271cda71f3ed08@mail.gmail.com> Message-ID: <3715adb70911260801r2c897c8ev35c3396df700251f@mail.gmail.com> Thanks Peter. Your suggestion was good. I?ll try it. Regards, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 On Thu, Nov 26, 2009 at 7:49 AM, Jo?o Rodrigues wrote: > Thanks Rodrigo for the tip. I'd tried to set __debug__ to False as well but > it didn't work as I wanted (unless I actually edited the module file). > > Peter's suggestion is what I wanted. I was completely unaware of the > "warning" module so I thought it was a BioPython thing. > > Thanks! > > Jo?o [...] Rodrigues > @ http://stanford.edu/~joaor/ > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Thu Nov 26 11:02:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Nov 2009 16:02:50 +0000 Subject: [Biopython] Fwd: [DAS] DAS workshop 7th-9th April 2010 In-Reply-To: References: Message-ID: <320fb6e00911260802wb98b28fic8a193c125e29d9c@mail.gmail.com> This might be of interest to some of you. Peter ---------- Forwarded message ---------- From: Jonathan Warren Date: Thu, Nov 26, 2009 at 2:57 PM Subject: [DAS] DAS workshop 7th-9th April 2010 To: das at biodas.org, das_registry_announce at sanger.ac.uk, biojava-dev , BioJava , BioPerl , all at sanger.ac.uk, all at ebi.ac.uk, ensembldev We are considering running a Distributed Annotation System workshop here at the Sanger/EBI in the UK subject to decent demand. The workshop will be held from Wednesday 7th-Friday 9th April 2010. If you would be interested in attending either to present or just take part then please email me jw12 at sanger.ac.uk The format of the workshop is likely to be similar to last years (1st day for beginners, 2nd for both beginners and advanced users, 3rd day for advanced), information for which can be found here: http://www.dasregistry.org/course.jsp If you would like to present then please send a short summary of what you would like to talk about. Thanks Jonathan. Jonathan Warren Senior Developer and DAS coordinator jw12 at sanger.ac.uk -- The Wellcome Trust Sanger Institute is operated by Genome ResearchLimited, a charity registered in England with number 1021457 and acompany registered in England with number 2742969, whose registeredoffice is 215 Euston Road, London, NW1 2BE._______________________________________________ DAS mailing list DAS at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/das From eric.talevich at gmail.com Thu Nov 26 15:31:40 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 26 Nov 2009 15:31:40 -0500 Subject: [Biopython] Turning PDBConstructionWarning off Message-ID: <3f6baf360911261231q712933a2g834025ce4690d4e6@mail.gmail.com> From: Jo?o Rodrigues : > > Thanks Rodrigo for the tip. I'd tried to set __debug__ to False as well but > it didn't work as I wanted (unless I actually edited the module file). > > Peter's suggestion is what I wanted. I was completely unaware of the > "warning" module so I thought it was a BioPython thing. > > Thanks! > > Jo?o [...] Rodrigues > @ http://stanford.edu/~joaor/ > The __debug__ check in Biopython's source code isn't really necessary; it's set internally by Python. By default it's True, but running Python with optimizations on (-O on the command line) sets it to False and automatically skips all warnings. As Peter suggested, the usual way to hide specific warnings in your applications is with the warnings module's simplefilter(). Cheers, Eric From pengyu.ut at gmail.com Fri Nov 27 17:56:17 2009 From: pengyu.ut at gmail.com (Peng Yu) Date: Fri, 27 Nov 2009 16:56:17 -0600 Subject: [Biopython] How to get intron/exon boundaries? Message-ID: <366c6f340911271456t594e3256i93d184f72215e6e7@mail.gmail.com> I'm wondering how to get intron exon boundaires for all the genes. Could somebody show me what functions I should use? From biopython at maubp.freeserve.co.uk Fri Nov 27 18:03:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Nov 2009 23:03:12 +0000 Subject: [Biopython] How to get intron/exon boundaries? In-Reply-To: <366c6f340911271456t594e3256i93d184f72215e6e7@mail.gmail.com> References: <366c6f340911271456t594e3256i93d184f72215e6e7@mail.gmail.com> Message-ID: <320fb6e00911271503j273fbb9bs865eaed9ea92af80@mail.gmail.com> On Fri, Nov 27, 2009 at 10:56 PM, Peng Yu wrote: > I'm wondering how to get intron exon boundaires for all the genes. > Could somebody show me what functions I should use? What do you want to know? The co-ordinates of the intron/exons, or just to get the coding sequence? What kind of data are you looking at? For GenBank or EMBL files this is encoded in the CDS feature locations. For GFF files I think this information is given explicitly, Peter From pengyu.ut at gmail.com Fri Nov 27 19:18:06 2009 From: pengyu.ut at gmail.com (Peng Yu) Date: Fri, 27 Nov 2009 18:18:06 -0600 Subject: [Biopython] How to get intron/exon boundaries? In-Reply-To: <320fb6e00911271503j273fbb9bs865eaed9ea92af80@mail.gmail.com> References: <366c6f340911271456t594e3256i93d184f72215e6e7@mail.gmail.com> <320fb6e00911271503j273fbb9bs865eaed9ea92af80@mail.gmail.com> Message-ID: <366c6f340911271618u4ff965f0u729fa0a1f70bd028@mail.gmail.com> On Fri, Nov 27, 2009 at 5:03 PM, Peter wrote: > On Fri, Nov 27, 2009 at 10:56 PM, Peng Yu wrote: >> I'm wondering how to get intron exon boundaires for all the genes. >> Could somebody show me what functions I should use? > > What do you want to know? The co-ordinates of the intron/exons, > or just to get the coding sequence? I want the co-ordinates. > What kind of data are you looking at? For GenBank or EMBL > files this is encoded in the CDS feature locations. For GFF > files I think this information is given explicitly, Would you please let me know how to get the CDS feature locations from GenBank and EMBL? What are GFF files? From sdavis2 at mail.nih.gov Sun Nov 29 19:55:03 2009 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Sun, 29 Nov 2009 19:55:03 -0500 Subject: [Biopython] How to get intron/exon boundaries? In-Reply-To: <366c6f340911271618u4ff965f0u729fa0a1f70bd028@mail.gmail.com> References: <366c6f340911271456t594e3256i93d184f72215e6e7@mail.gmail.com> <320fb6e00911271503j273fbb9bs865eaed9ea92af80@mail.gmail.com> <366c6f340911271618u4ff965f0u729fa0a1f70bd028@mail.gmail.com> Message-ID: <264855a00911291655u6a9cf012v75e02d398c89c833@mail.gmail.com> On Fri, Nov 27, 2009 at 7:18 PM, Peng Yu wrote: > On Fri, Nov 27, 2009 at 5:03 PM, Peter wrote: >> On Fri, Nov 27, 2009 at 10:56 PM, Peng Yu wrote: >>> I'm wondering how to get intron exon boundaires for all the genes. >>> Could somebody show me what functions I should use? >> >> What do you want to know? The co-ordinates of the intron/exons, >> or just to get the coding sequence? > > I want the co-ordinates. You are talking about coordinates in genomic space or on the transcript? What organism? And what annotation system do you want to use--Ensembl, UCSC, or NCBI? >> What kind of data are you looking at? For GenBank or EMBL >> files this is encoded in the CDS feature locations. For GFF >> files I think this information is given explicitly, > > Would you please let me know how to get the CDS feature locations from > GenBank and EMBL? What are GFF files? For GFF, google will get you a long way ("GFF format"). Sean From pengyu.ut at gmail.com Sun Nov 29 22:30:52 2009 From: pengyu.ut at gmail.com (Peng Yu) Date: Sun, 29 Nov 2009 21:30:52 -0600 Subject: [Biopython] How to get intron/exon boundaries? In-Reply-To: <264855a00911291655u6a9cf012v75e02d398c89c833@mail.gmail.com> References: <366c6f340911271456t594e3256i93d184f72215e6e7@mail.gmail.com> <320fb6e00911271503j273fbb9bs865eaed9ea92af80@mail.gmail.com> <366c6f340911271618u4ff965f0u729fa0a1f70bd028@mail.gmail.com> <264855a00911291655u6a9cf012v75e02d398c89c833@mail.gmail.com> Message-ID: <366c6f340911291930x5ba0b7co1ae84809d5b72b9b@mail.gmail.com> On Sun, Nov 29, 2009 at 6:55 PM, Sean Davis wrote: > On Fri, Nov 27, 2009 at 7:18 PM, Peng Yu wrote: >> On Fri, Nov 27, 2009 at 5:03 PM, Peter wrote: >>> On Fri, Nov 27, 2009 at 10:56 PM, Peng Yu wrote: >>>> I'm wondering how to get intron exon boundaires for all the genes. >>>> Could somebody show me what functions I should use? >>> >>> What do you want to know? The co-ordinates of the intron/exons, >>> or just to get the coding sequence? >> >> I want the co-ordinates. > > You are talking about coordinates in genomic space or on the > transcript? ?What organism? ?And what annotation system do you want to > use--Ensembl, UCSC, or NCBI? The coordinates in genomic space. Mouse. UCSC. >>> What kind of data are you looking at? For GenBank or EMBL >>> files this is encoded in the CDS feature locations. For GFF >>> files I think this information is given explicitly, >> >> Would you please let me know how to get the CDS feature locations from >> GenBank and EMBL? What are GFF files? > > For GFF, google will get you a long way ("GFF format"). > > Sean > From sdavis2 at mail.nih.gov Mon Nov 30 08:59:35 2009 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Mon, 30 Nov 2009 08:59:35 -0500 Subject: [Biopython] How to get intron/exon boundaries? In-Reply-To: <366c6f340911291930x5ba0b7co1ae84809d5b72b9b@mail.gmail.com> References: <366c6f340911271456t594e3256i93d184f72215e6e7@mail.gmail.com> <320fb6e00911271503j273fbb9bs865eaed9ea92af80@mail.gmail.com> <366c6f340911271618u4ff965f0u729fa0a1f70bd028@mail.gmail.com> <264855a00911291655u6a9cf012v75e02d398c89c833@mail.gmail.com> <366c6f340911291930x5ba0b7co1ae84809d5b72b9b@mail.gmail.com> Message-ID: <264855a00911300559i7fe3ce88x5e2e339182a7ef36@mail.gmail.com> On Sun, Nov 29, 2009 at 10:30 PM, Peng Yu wrote: > On Sun, Nov 29, 2009 at 6:55 PM, Sean Davis wrote: >> On Fri, Nov 27, 2009 at 7:18 PM, Peng Yu wrote: >>> On Fri, Nov 27, 2009 at 5:03 PM, Peter wrote: >>>> On Fri, Nov 27, 2009 at 10:56 PM, Peng Yu wrote: >>>>> I'm wondering how to get intron exon boundaires for all the genes. >>>>> Could somebody show me what functions I should use? >>>> >>>> What do you want to know? The co-ordinates of the intron/exons, >>>> or just to get the coding sequence? >>> >>> I want the co-ordinates. >> >> You are talking about coordinates in genomic space or on the >> transcript? ?What organism? ?And what annotation system do you want to >> use--Ensembl, UCSC, or NCBI? > > The coordinates in genomic space. Mouse. UCSC. http://genome.ucsc.edu/cgi-bin/hgTables?org=Mouse Choose the track that you like. UCSC Known Genes is the typical default. There are numerous output format options. Again, choose whatever you think is convenient. The outputs are almost all tab-delimited text, so you should be able to use them easily in the scripting language of your choice. If you prefer gff, then consider GTF format. If you have questions for the UCSC folks, they have their own mailing list accessible from the top of the page on their website. Sean >>>> What kind of data are you looking at? For GenBank or EMBL >>>> files this is encoded in the CDS feature locations. For GFF >>>> files I think this information is given explicitly, >>> >>> Would you please let me know how to get the CDS feature locations from >>> GenBank and EMBL? What are GFF files? >> >> For GFF, google will get you a long way ("GFF format"). >> >> Sean >> > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From bassbabyface at yahoo.com Sun Nov 1 00:58:10 2009 From: bassbabyface at yahoo.com (Ben O'Loghlin) Date: Sun, 1 Nov 2009 11:58:10 +1100 Subject: [Biopython] Entrez.read return value is typed as a string?? In-Reply-To: <320fb6e00910290837w5861226dsb0a9f4a9fb4acd1f@mail.gmail.com> References: <001901ca5846$96f69d60$c4e3d820$@com> <109726.94290.qm@web62408.mail.re1.yahoo.com> <005001ca58a8$75a41cc0$60ec5640$@com> <320fb6e00910290837w5861226dsb0a9f4a9fb4acd1f@mail.gmail.com> Message-ID: <016101ca5a8e$633f0210$29bd0630$@com> Thanks Peter, another small step up the learning curve! Ben -----Original Message----- From: p.j.a.cock at googlemail.com [mailto:p.j.a.cock at googlemail.com] On Behalf Of Peter Sent: Friday, 30 October 2009 2:37 AM To: Ben O'Loghlin Cc: Michiel de Hoon; biopython at biopython.org Subject: Re: [Biopython] Entrez.read return value is typed as a string?? On Thu, Oct 29, 2009 at 2:59 PM, Ben O'Loghlin wrote: > Thanks Michiel. > > What is the function of the 'u' in the string discussed below? > That's the bit that's got me confused. > > Best regards, > Ben > > p.s. assistance on this list is fast and useful. Nice! Again, its a bit of Python basics rather than anything Biopython specific. The u is for unicode, thus "fred" gives a normal string while u"fred" gives a unicode string. Unless you are messing about with odd foreign characters (e.g. letters with accents) you won't have to worry about this. Python 3 gets rid of the dichotomy by using unicode for all strings. Peter From kellrott at gmail.com Mon Nov 2 20:06:37 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Mon, 2 Nov 2009 12:06:37 -0800 Subject: [Biopython] Using SeqLocation to extract subsequence Message-ID: This should be a relatively simple question, but I didn't find any google hits... I'm parsing a genbank file of a chromosome, and I want to take the FeatureLocation data from a SeqFeature and extract the referenced DNA. Basically take a 'CDS' feature and get the gene DNA that coded it. Is there a function that I can pass the location data from a feature record and it will extract the DNA, including doing segment joining and reverse translation? I could write this myself, but it seems like a better idea to use something that has been well tested. Kyle From biopython at maubp.freeserve.co.uk Mon Nov 2 20:24:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 2 Nov 2009 20:24:36 +0000 Subject: [Biopython] Using SeqLocation to extract subsequence In-Reply-To: References: Message-ID: <320fb6e00911021224k5ddc5e3blb1b92b3cbc103355@mail.gmail.com> On Mon, Nov 2, 2009 at 8:06 PM, Kyle Ellrott wrote: > This should be a relatively simple question, but I didn't find any google > hits... > > I'm parsing a genbank file of a chromosome, and I want to take the > FeatureLocation data from a SeqFeature and extract the referenced DNA. > Basically take a 'CDS' feature and get the gene DNA that coded it. ?Is there > a function that I can pass the location data from a feature record and it > will extract the DNA, including doing segment joining and reverse > translation? > > I could write this myself, but it seems like a better idea to use something > that has been well tested. You missed this thread earlier this month: http://lists.open-bio.org/pipermail/biopython/2009-October/005695.html Are you on the dev mailing list? I was hoping to get a little discussion going there, before moving over to the discussion list for more general comment. The code mentioned there is the best tested bit of code I can suggest for now: http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006922.html Note there is no such thing as a SeqLocation object. There is a FeatureLocation, but you need the strand information - hence my code requires a SeqFeature object to fully describe the location. Peter From kellrott at gmail.com Mon Nov 2 21:31:28 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Mon, 2 Nov 2009 13:31:28 -0800 Subject: [Biopython] Using SeqLocation to extract subsequence In-Reply-To: <320fb6e00911021224k5ddc5e3blb1b92b3cbc103355@mail.gmail.com> References: <320fb6e00911021224k5ddc5e3blb1b92b3cbc103355@mail.gmail.com> Message-ID: > > You missed this thread earlier this month: > http://lists.open-bio.org/pipermail/biopython/2009-October/005695.html > > Are you on the dev mailing list? I was hoping to get a little discussion > going there, before moving over to the discussion list for more general > comment. I didn't need to do it when the original discussion came through, so it got 'filtered' ;-) I guess if multiple people are asking the same question independently, it's probably a timely issue. I'll probably go ahead and pull the SeqRecord fork into my git fork and start playing around with it. Kyle From biopython at maubp.freeserve.co.uk Mon Nov 2 22:30:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 2 Nov 2009 22:30:37 +0000 Subject: [Biopython] Using SeqLocation to extract subsequence In-Reply-To: References: <320fb6e00911021224k5ddc5e3blb1b92b3cbc103355@mail.gmail.com> Message-ID: <320fb6e00911021430n5704cecaue2c1d1010f7e91cc@mail.gmail.com> On Mon, Nov 2, 2009 at 9:31 PM, Kyle Ellrott wrote: >> >> You missed this thread earlier this month: >> http://lists.open-bio.org/pipermail/biopython/2009-October/005695.html >> >> Are you on the dev mailing list? I was hoping to get a little discussion >> going there, before moving over to the discussion list for more general >> comment. > > I didn't need to do it when the original discussion came through, so it got > 'filtered' ;-) ?I guess if multiple people are asking the same question > independently, it's probably a timely issue. > > I'll probably go ahead and pull the SeqRecord fork into my git fork and > start playing around with it. Cool - sorry if the previous email was brusque - I was in the middle of dinner preparation and shouldn't have been checking emails. If you just want to try the sequence extraction for a SeqFeature, the code is on the trunk (as noted, as a function in a unit test). My SeqRecord github branch is looking at other issues. Peter From biopython at maubp.freeserve.co.uk Tue Nov 3 12:52:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 3 Nov 2009 12:52:54 +0000 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <737542.47267.qm@web62401.mail.re1.yahoo.com> References: <1FA32B70-84CC-4012-97F5-9B67D56BADC6@gmail.com> <737542.47267.qm@web62401.mail.re1.yahoo.com> Message-ID: <320fb6e00911030452o616c0b4cycceb53f3561e6a42@mail.gmail.com> On Fri, Oct 16, 2009 at 1:04 AM, Michiel de Hoon wrote: > > Last time I checked (which was a few weeks ago), a multiple-query PSIBlast > search gives a file consisting of concatenated XML files. The problem is in > the design of Blast XML output. For a single-query PSIBlast, the fields under > are used to store the output of the PSIBlast iterations. > For multiple-query regular Blast, the same fields are used to store the search > results of each query. With multiple-query PSIBlast, there is then no way to > store the output in the current XML format. I've been meaning to write to NCBI > about this, but I haven't gotten round to it yet. Will do so this weekend. > > --Michiel. Did you get any reply? Peter From mjldehoon at yahoo.com Tue Nov 3 12:56:23 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 3 Nov 2009 04:56:23 -0800 (PST) Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <320fb6e00911030452o616c0b4cycceb53f3561e6a42@mail.gmail.com> Message-ID: <234359.27566.qm@web62401.mail.re1.yahoo.com> > Did you get any reply? > Yes, but just that they'll look into it. Nothing concrete yet, but I guess changing the Blast XML output is something that needs to be done very carefully, so it may take a while. Will keep you guys posted if I get a reply. --Michiel. --- On Tue, 11/3/09, Peter wrote: > From: Peter > Subject: Re: [Biopython] Problems parsing with PSIBlastParser > To: "Michiel de Hoon" > Cc: "Biopython Mailing List" > Date: Tuesday, November 3, 2009, 7:52 AM > On Fri, Oct 16, 2009 at 1:04 AM, > Michiel de Hoon > wrote: > > > > Last time I checked (which was a few weeks ago), a > multiple-query PSIBlast > > search gives a file consisting of concatenated XML > files. The problem is in > > the design of Blast XML output. For a single-query > PSIBlast, the fields under > > are used to store the > output of the PSIBlast iterations. > > For multiple-query regular Blast, the same fields are > used to store the search > > results of each query. With multiple-query PSIBlast, > there is then no way to > > store the output in the current XML format. I've been > meaning to write to NCBI > > about this, but I haven't gotten round to it yet. Will > do so this weekend. > > > > --Michiel. > > Did you get any reply? > > Peter > From biopython at maubp.freeserve.co.uk Tue Nov 3 13:32:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 3 Nov 2009 13:32:55 +0000 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <87FF6F4A-E169-44D2-8DD1-549F6BA07574@illinois.edu> References: <1FA32B70-84CC-4012-97F5-9B67D56BADC6@gmail.com> <737542.47267.qm@web62401.mail.re1.yahoo.com> <320fb6e00911030452o616c0b4cycceb53f3561e6a42@mail.gmail.com> <87FF6F4A-E169-44D2-8DD1-549F6BA07574@illinois.edu> Message-ID: <320fb6e00911030532r38e2fb2bjf7fffe0aa3686cc8@mail.gmail.com> On Tue, Nov 3, 2009 at 1:16 PM, Chris Fields wrote: > > We had the same problem w/ the BioPerl XML parser and ended up preprocessing > the data into separate XML files, carrying over the relevant information > into each file (yes, there is a better way, but it essentially involves a > redesign of the XML parser and related objects). > > BTW, the same thing happens if one runs multiple queries in the same file. > ?All individual report XML are in one single XML file, and information > relevant to all reports is only found into the first report. ?I think this > has been known for a while. ?I've repeatedly tried contacting NCBI but > haven't had a response re: this problem. > > chris Hi Chris, Old versions of blastall (also) used to produce concatenated XML files for multiple queries, but from about 2.2.14 they started (ab)using the iteration fields originally for PSI-BLAST output to hold multiple queries (there was some discussion of this on Biopython Bugs 1933 and 1970 - Biopython *should* cope with either). Apparently pgpblast was left producing concatenated XML files. The upshot of this is multi-query BLASTP etc XML files look just like single query multi-round PSI-BLAST XML files. This means having a single BLAST XML parser that automatically treats the two differently is tricky. Does that fit with your experience? Peter From cjfields at illinois.edu Tue Nov 3 13:16:02 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 3 Nov 2009 07:16:02 -0600 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <320fb6e00911030452o616c0b4cycceb53f3561e6a42@mail.gmail.com> References: <1FA32B70-84CC-4012-97F5-9B67D56BADC6@gmail.com> <737542.47267.qm@web62401.mail.re1.yahoo.com> <320fb6e00911030452o616c0b4cycceb53f3561e6a42@mail.gmail.com> Message-ID: <87FF6F4A-E169-44D2-8DD1-549F6BA07574@illinois.edu> Peter, On Nov 3, 2009, at 6:52 AM, Peter wrote: > On Fri, Oct 16, 2009 at 1:04 AM, Michiel de Hoon > wrote: >> >> Last time I checked (which was a few weeks ago), a multiple-query >> PSIBlast >> search gives a file consisting of concatenated XML files. The >> problem is in >> the design of Blast XML output. For a single-query PSIBlast, the >> fields under >> are used to store the output of the >> PSIBlast iterations. >> For multiple-query regular Blast, the same fields are used to store >> the search >> results of each query. With multiple-query PSIBlast, there is then >> no way to >> store the output in the current XML format. I've been meaning to >> write to NCBI >> about this, but I haven't gotten round to it yet. Will do so this >> weekend. >> >> --Michiel. > > Did you get any reply? > > Peter We had the same problem w/ the BioPerl XML parser and ended up preprocessing the data into separate XML files, carrying over the relevant information into each file (yes, there is a better way, but it essentially involves a redesign of the XML parser and related objects). BTW, the same thing happens if one runs multiple queries in the same file. All individual report XML are in one single XML file, and information relevant to all reports is only found into the first report. I think this has been known for a while. I've repeatedly tried contacting NCBI but haven't had a response re: this problem. chris From cjfields at illinois.edu Tue Nov 3 13:40:53 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 3 Nov 2009 07:40:53 -0600 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <320fb6e00911030532r38e2fb2bjf7fffe0aa3686cc8@mail.gmail.com> References: <1FA32B70-84CC-4012-97F5-9B67D56BADC6@gmail.com> <737542.47267.qm@web62401.mail.re1.yahoo.com> <320fb6e00911030452o616c0b4cycceb53f3561e6a42@mail.gmail.com> <87FF6F4A-E169-44D2-8DD1-549F6BA07574@illinois.edu> <320fb6e00911030532r38e2fb2bjf7fffe0aa3686cc8@mail.gmail.com> Message-ID: <93C54341-8762-4EA6-8D0F-3388E5005A5C@illinois.edu> On Nov 3, 2009, at 7:32 AM, Peter wrote: > On Tue, Nov 3, 2009 at 1:16 PM, Chris Fields > wrote: >> >> We had the same problem w/ the BioPerl XML parser and ended up >> preprocessing >> the data into separate XML files, carrying over the relevant >> information >> into each file (yes, there is a better way, but it essentially >> involves a >> redesign of the XML parser and related objects). >> >> BTW, the same thing happens if one runs multiple queries in the >> same file. >> All individual report XML are in one single XML file, and >> information >> relevant to all reports is only found into the first report. I >> think this >> has been known for a while. I've repeatedly tried contacting NCBI >> but >> haven't had a response re: this problem. >> >> chris > > Hi Chris, > > Old versions of blastall (also) used to produce concatenated XML > files for > multiple queries, but from about 2.2.14 they started (ab)using the > iteration > fields originally for PSI-BLAST output to hold multiple queries > (there was > some discussion of this on Biopython Bugs 1933 and 1970 - Biopython > *should* cope with either). > > Apparently pgpblast was left producing concatenated XML files. > The upshot of this is multi-query BLASTP etc XML files look just like > single query multi-round PSI-BLAST XML files. This means having a > single BLAST XML parser that automatically treats the two differently > is tricky. > > Does that fit with your experience? > > Peter Yes, pretty much. Ours now handles both report types w/o problems. We have a pluggable XML parser that is switched out based on whether one expects normal BLAST XML (the default) or PSI-BLAST XML (has to be indicated). With text reports we can determine this on the fly b/c the blast type should indicate whether it is PSI BLAST or not, but IIRC this wasn't the case with XML. I haven't checked to see if this has been fixed yet on NCBI's end, but I'm assuming it hasn't. chris From biopython at maubp.freeserve.co.uk Tue Nov 3 13:52:20 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 3 Nov 2009 13:52:20 +0000 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <93C54341-8762-4EA6-8D0F-3388E5005A5C@illinois.edu> References: <1FA32B70-84CC-4012-97F5-9B67D56BADC6@gmail.com> <737542.47267.qm@web62401.mail.re1.yahoo.com> <320fb6e00911030452o616c0b4cycceb53f3561e6a42@mail.gmail.com> <87FF6F4A-E169-44D2-8DD1-549F6BA07574@illinois.edu> <320fb6e00911030532r38e2fb2bjf7fffe0aa3686cc8@mail.gmail.com> <93C54341-8762-4EA6-8D0F-3388E5005A5C@illinois.edu> Message-ID: <320fb6e00911030552g5913f641obe7d8075e6c15d2b@mail.gmail.com> On Tue, Nov 3, 2009 at 1:40 PM, Chris Fields wrote: > > On Nov 3, 2009, at 7:32 AM, Peter wrote: >> ... >> The upshot of this is multi-query BLASTP etc XML files look just like >> single query multi-round PSI-BLAST XML files. This means having a >> single BLAST XML parser that automatically treats the two differently >> is tricky. >> >> Does that fit with your experience? >> >> Peter > > Yes, pretty much. ?Ours now handles both report types w/o problems. ?We have > a pluggable XML parser that is switched out based on whether one expects > normal BLAST XML (the default) or PSI-BLAST XML (has to be indicated). ?With > text reports we can determine this on the fly b/c the blast type should > indicate whether it is PSI BLAST or not, but IIRC this wasn't the case with > XML. ?I haven't checked to see if this has been fixed yet on NCBI's end, but > I'm assuming it hasn't. Certainly with 2.2.18 (where I have an example handy), the XML from pgpblast is practically identical to that from blastall. You *may* be able to infer this from looking at the complete file (e.g. any iteration messages). Having the user specify if they are expecting PSI-BLAST output (as you do in BioPerl) seems like the best option. We might do this via an optional argument to the existing Bio.Blast.NCBIXML parser, or add a second PSI-Blast specific parser. The later might be best for dealing with multi-query PSI-BLAST XML files, and using the same PSI BLAST specific objects as the old plain text parser. For plain text output, the Biopython use must already explicitly choose our PSI-BLAST parser over the default parser. Peter From kellrott at gmail.com Wed Nov 4 23:06:40 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Wed, 4 Nov 2009 15:06:40 -0800 Subject: [Biopython] Biopython on Jython In-Reply-To: <320fb6e00910220215u323dd237r28e2b8a15b651e43@mail.gmail.com> References: <320fb6e00910211555m4379dd34wfecdd9373cfd0498@mail.gmail.com> <320fb6e00910220215u323dd237r28e2b8a15b651e43@mail.gmail.com> Message-ID: > >> You probably noticed I merged some of your fixes to get (the non C and > >> non NumPy bits of) Biopython to work on Jython, but not all. Could you > >> update your github branch to the trunk at some point? That would help > >> in picking up more of your fixes. > > > > I've tried to keep my branch up to speed with the mainline. But I didn't > > branch my work from master, so it may harder to extract... > > True, but I can probably manage. > I just rebased my jython related work into a seperate fork. So it should be easier to pull out patches now. I think there is still some work in Bio/Data/CodonTable.py, Bio/SubsMat/MatrixInfo.py and some of the unit tests that should make jython work a bit better. Kyle From stran104 at chapman.edu Thu Nov 5 00:25:35 2009 From: stran104 at chapman.edu (Matthew Strand) Date: Wed, 4 Nov 2009 16:25:35 -0800 Subject: [Biopython] Get Organism Given Bio.Blast.Record.Blast Object Message-ID: <2a63cc350911041625y1d7fe1eamd2dc8a51cc9d5074@mail.gmail.com> Dear users, Given a Bio.Blast.Record.Blast object is it possible to recover the organism name without using entrez to query the NCBI servers? Often the organism is listed in Bio.Blast.Record.Alignment.title but I do not see a way to reliably extract it from this data. I have reviewed the API and the UML diagram in the cookbook: http://www.biopython.org/DIST/docs/api/Bio.Blast.Record-module.html and http://www.biopython.org/DIST/docs/tutorial/Tutorial.html#fig:blastrecordrespectively Input is appreciated, --Matthew Strand From biopython at maubp.freeserve.co.uk Thu Nov 5 10:44:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 5 Nov 2009 10:44:54 +0000 Subject: [Biopython] Get Organism Given Bio.Blast.Record.Blast Object In-Reply-To: <2a63cc350911041625y1d7fe1eamd2dc8a51cc9d5074@mail.gmail.com> References: <2a63cc350911041625y1d7fe1eamd2dc8a51cc9d5074@mail.gmail.com> Message-ID: <320fb6e00911050244y5f1c2dcbs53b2a1e34a07c01b@mail.gmail.com> On Thu, Nov 5, 2009 at 12:25 AM, Matthew Strand wrote: > Dear users, > Given a Bio.Blast.Record.Blast object is it possible to recover the organism > name without using entrez to query the NCBI servers? > > Often the organism is listed in Bio.Blast.Record.Alignment.title but I do > not see a way to reliably extract it from this data. The organism is not explicitly given in BLAST results. This is nothing to do with the Biopython parser. However, ... The NCBI tend to encode the organism in the match title within square brackets (and where redundant sequences have been merged, you probably can have two organisms). You might rely on this. Alternatively, most (all?) of the NCBI BLAST databases will use GI numbers, so you could use that to map to the organism. This can be done via Entrez (online), or offline by downloading the mapping. See: http://lists.open-bio.org/pipermail/biopython/2009-June/005304.html If you are using a custom BLAST database, then it all depends on how the database was created. Peter From konrad.koehler at mac.com Thu Nov 5 13:21:39 2009 From: konrad.koehler at mac.com (Konrad Koehler) Date: Thu, 05 Nov 2009 14:21:39 +0100 Subject: [Biopython] Bio.PDB: parsing PDB files for ATOM records Message-ID: <153781524001004023681590133497734615609-Webmail@me.com> Hello everyone, I wanted to use Bio:PDB to retrieve the atom element symbol from columns 77-78 of the PDB file. This is apparently not possible with the lastest version of Biopython 1.52. Some time ago, Macro Zhu posted the following fix: http://osdir.com/ml/python.bio.general/2008-04/msg00038.html which I have tried to implement in the current 1.52 version, however I cannot seem to get this to work. Is there any way to retrieve the element symbol using the current version of Biopython? If not, I would like to request that this functionality be added to Bio.PDB. Best regards, Konrad From konrad.koehler at mac.com Thu Nov 5 15:51:08 2009 From: konrad.koehler at mac.com (Konrad Koehler) Date: Thu, 05 Nov 2009 16:51:08 +0100 Subject: [Biopython] Fwd: Bio.PDB: parsing PDB files for ATOM records Message-ID: <13397911330439330334250671630377437584-Webmail@me.com> Contray to my first post, the modifications to Bio.PDB outlined below: http://osdir.com/ml/python.bio.general/2008-04/msg00038.html do work with the lastest version of Bio.PDB. (I must have introduced a typo in my first try, on the second try it worked perfectly). I would however request that these changes be incorporated into the production version of Bio.PDB. Best regards, Konrad >From: "Konrad Koehler" >To: >Date: November 05, 2009 03:23:05 PM CET >Subject: [Biopython] Bio.PDB: parsing PDB files for ATOM records > >Hello everyone, > >I wanted to use Bio:PDB to retrieve the atom element symbol from columns 77-78 of the PDB file. This is apparently not possible with the lastest version of Biopython 1.52. > >Some time ago, Macro Zhu posted the following fix: > >http://osdir.com/ml/python.bio.general/2008-04/msg00038.html > >which I have tried to implement in the current 1.52 version, however I cannot seem to get this to work. > >Is there any way to retrieve the element symbol using the current version of Biopython? If not, I would like to request that this functionality be added to Bio.PDB. > >Best regards, > >Konrad > >_______________________________________________ >Biopython mailing list - Biopython at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/biopython > > From jon.brate at bio.uio.no Fri Nov 6 00:30:14 2009 From: jon.brate at bio.uio.no (=?ISO-8859-1?Q?Jon_Br=E5te?=) Date: Fri, 6 Nov 2009 01:30:14 +0100 Subject: [Biopython] Parsing Blast results in XML format Message-ID: <4AAC53FE-86C2-4DAC-880C-A45D270B9C57@bio.uio.no> Dear all, I have a Blast output file in xml format generated by qBlast done through biopython. The Blast was performed with 22 query sequences and 50 hits were returned for each query. The result is in one single xml file. I want to extract all the sequence IDs for all the hits (22x50) and I have been checking out the BioPython cookbook page 53. I am using this code, but I am only getting the 50 hits for the 1st query sequence: from Bio.Blast import NCBIXM blast_records = NCBIXML.parse(result_handle) blast_record = blast_records.next() save_file = open("/Users/jonbra/Desktop/my_fasta_seq.fasta", 'w') for alignment in blast_record.alignments: for hsp in alignment.hsps: save_file.write('>%s\n' % (alignment.title,)) save_file.close() Can anyone help me to retrieve all the hits for all the query sequences? Best wishes Jon Br?te ------------------------------------------------------------- Jon Br?te PhD student Microbial Evolution Research Group (MERG) Department of Biology University of Oslo P.b. 1066 Blindern N-0316 Oslo Norway Phone: +47 22855083 From jon.brate at bio.uio.no Fri Nov 6 00:29:39 2009 From: jon.brate at bio.uio.no (=?ISO-8859-1?Q?Jon_Br=E5te?=) Date: Fri, 6 Nov 2009 01:29:39 +0100 Subject: [Biopython] Parsing Blast results in XML format Message-ID: Dear all, I have a Blast output file in xml format generated by qBlast done through biopython. The Blast was performed with 22 query sequences and 50 hits were returned for each query. The result is in one single xml file. I want to extract all the sequence IDs for all the hits (22x50) and I have been checking out the BioPython cookbook page 53. I am using this code, but I am only getting the 50 hits for the 1st query sequence: from Bio.Blast import NCBIXM blast_records = NCBIXML.parse(result_handle) blast_record = blast_records.next() save_file = open("/Users/jonbra/Desktop/my_fasta_seq.fasta", 'w') for alignment in blast_record.alignments: for hsp in alignment.hsps: save_file.write('>%s\n' % (alignment.title,)) save_file.close() Can anyone help me to retrieve all the hits for all the query sequences? Best wishes Jon Br?te ------------------------------------------------------------- Jon Br?te PhD student Microbial Evolution Research Group (MERG) Department of Biology University of Oslo P.b. 1066 Blindern N-0316 Oslo Norway Phone: +47 22855083 From mjldehoon at yahoo.com Fri Nov 6 04:22:17 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 5 Nov 2009 20:22:17 -0800 (PST) Subject: [Biopython] Parsing Blast results in XML format In-Reply-To: Message-ID: <820813.8529.qm@web62407.mail.re1.yahoo.com> > blast_record = blast_records.next() You're only pulling out the first Blast record. If you call blast_records.next() again, it will give you the second Blast record. And so on. Easiest solution is to have a for-loop: blast_records = NCBIXML.parse(result_handle) for blast_record in blast_records: # get the information you need from blast_record. --Michiel. --- On Thu, 11/5/09, Jon Br?te wrote: > From: Jon Br?te > Subject: [Biopython] Parsing Blast results in XML format > To: biopython at lists.open-bio.org > Date: Thursday, November 5, 2009, 7:29 PM > Dear all, > > I have a Blast output file in xml format generated by > qBlast done through biopython. The Blast was performed with > 22 query sequences and 50 hits were returned for each query. > The result is in one single xml file. I want to extract all > the sequence IDs for all the hits (22x50) and I have been > checking out the BioPython cookbook page 53. > > I am using this code, but I am only getting the 50 hits for > the 1st query sequence: > from Bio.Blast import NCBIXM > blast_records = NCBIXML.parse(result_handle) > blast_record = blast_records.next() > > save_file = > open("/Users/jonbra/Desktop/my_fasta_seq.fasta", 'w') > > for alignment in blast_record.alignments: > ? ? for hsp in alignment.hsps: > ? ? ? ? ? ? > save_file.write('>%s\n' % (alignment.title,)) > save_file.close() > Can anyone help me to retrieve all the hits for all the > query sequences? > > Best wishes > > Jon Br?te > > > ------------------------------------------------------------- > Jon Br?te > PhD student > > Microbial Evolution Research Group (MERG) > Department of Biology > University of Oslo > P.b. 1066 Blindern > N-0316 Oslo > Norway > Phone: +47 22855083 > > > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Fri Nov 6 12:22:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 6 Nov 2009 12:22:03 +0000 Subject: [Biopython] Getting the sequence for a SeqFeature Message-ID: <320fb6e00911060422u2d2742d5r7b5b1db98c991df5@mail.gmail.com> Hi all, I am planing to add a new method to the SeqFeature object, but would like a little feedback first. This email is really just the background - I'll write up a few examples later to try and make this a bit clearer... A task that comes up every so often on the mailing lists, which I have needed to do myself in the past, is getting the nucleotide sequences for features in a GenBank file, e.g. http://lists.open-bio.org/pipermail/biopython/2007-September/003703.html http://lists.open-bio.org/pipermail/biopython/2009-October/005695.html http://lists.open-bio.org/pipermail/biopython-dev/2009-May/005991.html http://lists.open-bio.org/pipermail/biopython-dev/2009-May/005997.html http://lists.open-bio.org/pipermail/biopython-dev/2009-November/006958.html Often, once you have the nucleotide sequence, you'll want to translate it, e.g. CDS features or mat_peptides as here: http://lists.open-bio.org/pipermail/bioperl-l/2009-October/031493.html If you parse a GenBank file (or an EMBL file etc) with SeqIO, you typically get a SeqRecord object with the the full nucleotide sequence (as record.seq, a Seq object) and a list of features (as record.features, a list of SeqFeature objects). For most prokaryotic features, things are fairly easy - you just need the (non fuzzy) start and end positions of the SeqFeature, and the strand. Then slice the parent sequence, and take the reverse complement if required. However, there are also rare cases like joins to consider (e.g. a ribosomal slippage), but joins are common if you deal with eukaryotes since intron/exon splicing is normal. Here you need to look at the subfeatures, and their locations - and indeed their strands, as there are a few mixed strand features in real GenBank files. In the above examples I have been thinking about genomes, or any nucleic sequence - but the same applies to proteins where the features might be the positions of domains. All the same issues apply except for strands. As noted in the linked threads, I have some working code currently on a github branch with unit tests which seems to handle all this. I would like to include this in Biopython, but first would like a little feedback on the proposed interface. What I am proposing is adding a method to the SeqFeature object taking the parent sequence (as a Seq like object, or even a string) as a required argument. This would return the region of the parent sequence described by the feature location and strands (including any subfeatures for joins). This could instead be done as a stand alone function, or as a method of the Seq object (as I suggested back in 2007). However, on reflection, I think the SeqFeature is most appropriate. http://lists.open-bio.org/pipermail/biopython/2007-September/003706.html With this basic functionality in place, it would then be much easier to take a parent SeqRecord and a child SeqFeature, and build a child SeqRecord taking the sequence from the parent SeqRecord (using the above new code), and annotation from the SeqFeature. This could (later) be added to Biopython as well, perhaps as a method of the SeqRecord. As this email is already very long, I'll delay giving any examples. Peter From biopython at maubp.freeserve.co.uk Fri Nov 6 12:47:42 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 6 Nov 2009 12:47:42 +0000 Subject: [Biopython] Getting the sequence for a SeqFeature In-Reply-To: <320fb6e00911060422u2d2742d5r7b5b1db98c991df5@mail.gmail.com> References: <320fb6e00911060422u2d2742d5r7b5b1db98c991df5@mail.gmail.com> Message-ID: <320fb6e00911060447g779f2ac2i7739a28c3f4a4077@mail.gmail.com> On Fri, Nov 6, 2009 at 12:22 PM, Peter wrote: > Hi all, > > I am planing to add a new method to the SeqFeature object, but > would like a little feedback first. This email is really just the > background - I'll write up a few examples later to try and make > this a bit clearer... OK, here is a non-trivial example - the first CDS feature in the GenBank file NC_000932.gb (included as a Biopython unit test), which is a three part join on the reverse strand. In this case, the GenBank file includes the protein translation for the CDS features so we can use it to check our results. We can parse this GenBank file into a SeqRecord with: from Bio import SeqIO record = SeqIO.read(open("../biopython/Tests/GenBank/NC_000932.gb"), "gb") Let's have a look at the first CDS feature (index 2): f = record.features[2] print f.type, f.location, f.strand, f.location_operator for sub_f in f.sub_features : print " - ", sub_f.location, sub_f.strand table = f.qualifiers.get("transl_table",[1])[0] # List of one int print "Table", table Giving: CDS [97998:69724] -1 join - [97998:98024] -1 - [98561:98793] -1 - [69610:69724] -1 Table 11 Looking at the raw GenBank file, this feature has location string: complement(join(97999..98024,98562..98793,69611..69724)) i.e. To get the sequence you need to do this (note zero based Python counting as in the output above): print (record.seq[97998:98024] + record.seq[98561:98793] + record.seq[69610:69724]).reverse_complement() And then translate it using NCBI genetic code table 11, print "Manual translation:" print (record.seq[97998:98024] + record.seq[98561:98793] + record.seq[69610:69724]).reverse_complement().translate(table=11, cds=True) print "Given translation:" print f.qualifiers["translation"][0] # List of one string print "Biopython translation (with proposed code):" print f.extract(record.seq).translate(table, cds=True) And the output, together with the provided translation in the feature annotation, and the shortcut with the new code I am proposing to include in Biopython: Manual translation: MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTRVYTITPKKPNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTLDAVGVKDRQQGRSKYGVKKPK Given translation: MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTRVYTITPKKPNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTLDAVGVKDRQQGRSKYGVKKPK Biopython translation: MPTIKQLIRNTRQPIRNVTKSPALRGCPQRRGTCTRVYTITPKKPNSALRKVARVRLTSGFEITAYIPGIGHNLQEHSVVLVRGGRVKDLPGVRYHIVRGTLDAVGVKDRQQGRSKYGVKKPK The point of all this was with the proposed new extract method, you just need: feature_seq = f.extract(record.seq) instead of: feature_seq = (record.seq[97998:98024] + record.seq[98561:98793] + record.seq[69610:69724]).reverse_complement() which is in itself a slight simplification since you'd need to get the those coordinates from the sub features, worry about strands, etc. Peter From biopython at maubp.freeserve.co.uk Mon Nov 9 11:21:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 9 Nov 2009 11:21:21 +0000 Subject: [Biopython] Biopython & p3d In-Reply-To: <7278B893-A99D-45A9-B96E-F20653EB25AD@uni-muenster.de> References: <2034B111-0AB9-4FE8-B1F9-62E753DA0FE8@uni-muenster.de> <320fb6e00910210218n39b746bu6c5fd826efe6097f@mail.gmail.com> <905F76E9-4BA4-4297-86E0-E6BDDDEEC624@uni-muenster.de> <320fb6e00910210401l737252deg78de117143395279@mail.gmail.com> <320fb6e00910211514r952732fw6c6d48f9be0e46c9@mail.gmail.com> <7278B893-A99D-45A9-B96E-F20653EB25AD@uni-muenster.de> Message-ID: <320fb6e00911090321w2af06272x18d59615942d8dae@mail.gmail.com> On Mon, Nov 9, 2009 at 10:57 AM, Christian Fufezan wrote: > > back ! :) > > lets get back into the discussion (or sum it up) > > The consensus was > a) both packages (biopython.pdb and p3d) have advantages > b) possibly merge both modules while keeping the best of both of them could > be an interesting step forward. Hi Christian - thanks for getting back to us. That seems like a fair summary. For those that missed it, the thread is archived here: http://lists.open-bio.org/pipermail/biopython/2009-October/005721.html > On 22 Oct 2009, at 00:14, Peter wrote: > >> On Wed, Oct 21, 2009 at 7:22 PM, Christian Fufezan wrote: >>>> >>>> Biopython might be improved by defining an atom >>>> property (list or iterator?) instead of the get_atoms() method. >>> >>> agree. ?I would argue that p3d's atom/vector class seems the way to go. >> >> We can probably have similar things for chains etc. Any other >> views on this? I never liked the get_* and set_* methods in >> Bio.PDB myself, and using Python properties seem more >> natural here (they may not have existing when Bio.PDB was >> first started - I'd have to check). >> >> [We should probably break out specific suggestions like this >> into new mailing list threads, and CC people like Thomas H.] I must do that... without looking into the details, it seems like a relatively straightforward addition which should make Bio.PDB easier to use. >> The drill down is great for selecting a particular residue or >> chain (or for NMR, a particular model). It is also good for >> looping over these structures - e.g. to process psi/phi >> angles along a protein backbone. > > cannot really see an advantage here. If one can directly access all the > atoms one's interested in with one line and then just collect phi,psi > angles, why would one need to drill down through the structures? > > Looping over structure elements is even more refined with the natural > human language interface: > imagine: residues_of_interest = protein.query('alpha and residue > 12..51 and model 2') > > if you like looking you can also do for model in models: > protein.query('alpha and residue 12..51 and model',model) > > or > > for residue in range (12,51): > ?protein.query('alpha and residue' , residue , 'and model 2') > > but looping over each residue and then do a conditional check if the residue > is in range (12-51) and if atom type is alpha carbon seems for me a bit of > an overhead. In fact that's one of the point I like about p3d most. one can > define the query in a way that nested loops are rarely need. Imagine you > want to collect chi1 angles of all His... In psuedo code, I would picture something like this: [residue.chi1 for residue in model.residues if residue.name="His"] (That almost certainly won't work as is with Bio.PDB, I'm just tying to convey how I would expect to be able to tackle the problem with a list comprehension) > from the following (I chopped some bits ... ), I can read that biopythons > pdb module (with numpy) works similar to p3d - or to be more correct > p3d works like biopython in combination with numpy, in the sense that one > can use atoms as vectors. That seems like a fair summary. In p3d, the atoms are (also) vector like objects, while in Biopython, the atoms have a numpy coord property. As long as you are happy with numpy, this allows fast and efficient vector operations. >>> so writing an structural alignment script is straight forward >>> (see e.g. http://p3d.fufezan.net/index.php?title=alignByATP). >> >> Structural alignment is not so different in Biopython - just the details. >> e.g. >> http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ >> > very nice - like the Bio.PDB.Superimposer(). It does all the vector > operations needed to align structures, nice. Involvement of numpy certainly > makes it powerful. Indeed - numpy is *very* powerful. > The nested loops to find all alpha carbons is a biopython.pdb classic ;) I would probably write that with a list comprehension nowadays, but they are essentially just syntactic sugar for (nested) loops. > to round thinks up: > p3ds strength comes with the natural human user interface that allows the > combination of sets and the spatial information (less nested loops). > However, I am not sure if the biopython's community wants such an extension. > Biopython.pdb has a long history, it works like it is and users are > comfortable with it, so maybe there is not much to merge after all. That seems fair, although that doesn't mean there aren't things we can improve in Bio.PDB (moving from get/set methods to properties for example). My personal view (and I did not write Bio.PDB and have only made relatively light usage of it) is that working with the nested structures (of the flattened lists) it provides is fairly natural with Python lists, or list comprehensions. The p3d "natural language" interface is an interesting abstraction, and may be easier for some, but to me is just another layer on top of the raw functionality - and another query syntax to learn. That said, it probably would be possible to layer something like this on top of the existing Bio.PDB objects (but I personally have no interest in doing this, and no need for it - keeping on top of the sequence side of things in Biopython is enough to keep me busy!). I would be delighted if other people on the people on the mailing list who *do* work with PDB files could comment. e.g. Thomas and Kristian, cc'd. Peter From fufezan at uni-muenster.de Mon Nov 9 10:57:13 2009 From: fufezan at uni-muenster.de (Christian Fufezan) Date: Mon, 9 Nov 2009 11:57:13 +0100 Subject: [Biopython] Biopython & p3d In-Reply-To: <320fb6e00910211514r952732fw6c6d48f9be0e46c9@mail.gmail.com> References: <2034B111-0AB9-4FE8-B1F9-62E753DA0FE8@uni-muenster.de> <320fb6e00910210218n39b746bu6c5fd826efe6097f@mail.gmail.com> <905F76E9-4BA4-4297-86E0-E6BDDDEEC624@uni-muenster.de> <320fb6e00910210401l737252deg78de117143395279@mail.gmail.com> <320fb6e00910211514r952732fw6c6d48f9be0e46c9@mail.gmail.com> Message-ID: <7278B893-A99D-45A9-B96E-F20653EB25AD@uni-muenster.de> back ! :) lets get back into the discussion (or sum it up) The consensus was a) both packages (biopython.pdb and p3d) have advantages b) possibly merge both modules while keeping the best of both of them could be an interesting step forward. On 22 Oct 2009, at 00:14, Peter wrote: > On Wed, Oct 21, 2009 at 7:22 PM, Christian Fufezan wrote: >>> Biopython might be improved by defining an atom >>> property (list or iterator?) instead of the get_atoms() method. >> >> agree. I would argue that p3d's atom/vector class seems the way to >> go. > > We can probably have similar things for chains etc. Any other > views on this? I never liked the get_* and set_* methods in > Bio.PDB myself, and using Python properties seem more > natural here (they may not have existing when Bio.PDB was > first started - I'd have to check). > > [We should probably break out specific suggestions like this > into new mailing list threads, and CC people like Thomas H.] > >>> One might also ask for x, y and z properties on the atom object >>> to provide direct access to the three coordinates as floats. Do >>> you think this sort of little thing would help improve Bio.PDB? >>> >> yes indeed, that is _the_ information a pdb module should offer >> without any addition. Better would be even if the atoms are >> treatable as vectors (see below). p3d has a series of atom >> object attributes that are convenient. > > I would argue that the x-y-z triple (which Biopython has) is > more important that separate x, y, and z floats. We seem > to agree here. > What I meant is that I think the most important thing a pdb module should offer is the possibility to do vector operations directly with atom objects, i.e. before translating them. Whether the values are stored in three attributes (.x,.y,.z, p3d) or as a tuple (biopython), seems not really important as long simple vector operations are possible. > The Biopython atom's coord property is an x-y-z triple (as a > one dimensional numpy array). The Bio.PDB code also > defines its own vector objects on top of this, but my memory > of the details is hazy here. As I recall, I personally stuck > with the numpy objects in my scripts using Bio.PDB. > The version I used, one had to convert the entity into a vector. But that's already some time ago, I guess. >>> Yes, it should be possible to offer nice nested access and nice flat >>> access from the same objects. Internally the current Biopython PDB >>> structure could perhaps be handled as filtered views of a complete >>> list of all the atoms (using sets and trees or a database or >>> whatever). >>> That might make some things faster too. >> >> I agree to some extent. As above, I can only say that I >> cannot see the advantage of a nested data structure. >> Maybe you can explain with an example where drilling >> through the nested structure could come in handy. > > The drill down is great for selecting a particular residue or > chain (or for NMR, a particular model). It is also good for > looping over these structures - e.g. to process psi/phi > angles along a protein backbone. cannot really see an advantage here. If one can directly access all the atoms one's interested in with one line and then just collect phi,psi angles, why would one need to drill down through the structures? Looping over structure elements is even more refined with the natural human language interface: imagine: residues_of_interest = protein.query('alpha and residue 12..51 and model 2') if you like looking you can also do for model in models: protein.query('alpha and residue 12..51 and model',model) or for residue in range (12,51): protein.query('alpha and residue' , residue , 'and model 2') but looping over each residue and then do a conditional check if the residue is in range (12-51) and if atom type is alpha carbon seems for me a bit of an overhead. In fact that's one of the point I like about p3d most. one can define the query in a way that nested loops are rarely need. Imagine you want to collect chi1 angles of all His... > >>>> Yes that was one thing that we were really missing. Also the fact >>>> that >>>> biopython requires the unfolded entity to be converted to vectors >>>> and so >>>> forth was a bit complex and we needed fast and direct access to the >>>> coordinates, the very essence of pdb files. >>> >>> I'm not quite sure what you mean here by "vectors". Could you >>> be a little more specific? Do you want NumPy style objects or >>> something else? >> >> In p3d the atom objects are vectors, > > I don't immediately see what the intention is here. What does > "adding" or "subtracting" two atom/vector objects give you? A > new non-atom vector would be my guess? What about > multiplying by a scaler? Again, getting a non-atom vector > object back makes most sense. > Yes, right one gets a vector back. This vector can then be used in the query function. Imagine you want to survey residues that span a membrane along a given path. With p3d you can easily generate a series of vectors and more importantly, one can use these vectors in the query function. for c in [k/10.0 * (startVector-endVector) for k in range(1,10)]: pdb.query('protein and within 3 of ' c) to visualize the path in e.g. VMD one can also print those vectors in a pdb format. from the following (I chopped some bits ... ), I can read that biopythons pdb module (with numpy) works similar to p3d - or to be more correct p3d works like biopython in combination with numpy, in the sense that one can use atoms as vectors. >> so writing an structural alignment script is straight forward >> (see e.g. http://p3d.fufezan.net/index.php?title=alignByATP). > > Structural alignment is not so different in Biopython - just the > details. e.g. > http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ > very nice - like the Bio.PDB.Superimposer(). It does all the vector operations needed to align structures, nice. Involvement of numpy certainly makes it powerful. The nested loops to find all alpha carbons is a biopython.pdb classic ;) to round thinks up: p3ds strength comes with the natural human user interface that allows the combination of sets and the spatial information (less nested loops). However, I am not sure if the biopython's community wants such an extension. Biopython.pdb has a long history, it works like it is and users are comfortable with it, so maybe there is not much to merge after all. From ap12 at sanger.ac.uk Mon Nov 9 15:29:20 2009 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Mon, 9 Nov 2009 15:29:20 +0000 Subject: [Biopython] fasta-m10 al_start and al_end? In-Reply-To: <320fb6e00910260717r50974e1epb7ee94d41ff5aa6c@mail.gmail.com> References: <82A8DBBC-6964-4433-AF9C-EF049160A2DF@sanger.ac.uk> <320fb6e00910231140w46243f9bp9751b6476e3e55b0@mail.gmail.com> <1E613DD7-3CF3-4F12-8C61-D12440F5AE1D@sanger.ac.uk> <320fb6e00910260304j417fa20ep2b353b6e74ca055d@mail.gmail.com> <320fb6e00910260717r50974e1epb7ee94d41ff5aa6c@mail.gmail.com> Message-ID: Hi Peter, Thanks for adding these private variables. They are called _al_start and _al_stop. While testing the code today, I found a little bug. For the match record: alignment.add_sequence(match_descr, match_align_seq) record = alignment.get_all_seqs()[-1] assert record.id == match_descr or record.description == match_descr #assert record.seq.tostring() == match_align_seq record.id = match_descr.split(None,1)[0].strip(",") record.name = "match" record.annotations["original_length"] = int(match_annotation["sq_len"]) #TODO - handle start/end coordinates properly. Short term hack for now: record._al_start = int(query_annotation["al_start"]) record._al_stop = int(query_annotation["al_stop"]) the al_start and al_stop should be taken from match_annotation instead of query_annotation, I think. Kind regards, Anne. On 26 Oct 2009, at 14:17, Peter wrote: > On Mon, Oct 26, 2009 at 10:04 AM, Peter > wrote: >> On Fri, Oct 23, 2009 at 11:00 PM, Anne Pajon >> wrote: >>> >>> Hi Peter, >>> >>> Thanks for your fast answer. >>> >>> I've already discovered the _annotations and I am prepared to >>> update my >>> code as soon as a better solution is provided. >> >> Good. >> >>> Concerning the al_start and al_end, I am looking for a solution >>> very soon, >>> as I am working on an annotation pipeline prototype in python. >>> What would be >>> your recommendation? Writing a parser myself, using another tool >>> (but which >>> one?), or helping storing this information in SeqRecord in >>> biopython as it >>> is almost there. Thanks to let me know. >> >> I would rather not add them directly to the SeqRecord annotations >> dictionary because that will make doing something meaningful with >> slicing (the SeqRecord, or in future the Alignment) much harder. I >> think the best way to handle these is in the Alignment object, but >> this isn't really supported at the moment. >> >> Are you happy to run a development version of Biopython, or at least >> to update the file Bio/AlignIO/FastaIO.py? I'm thinking in the short >> term we can record these bits of information as private properties of >> the SeqRecord, i.e. _al_start and _al_end > > Make that _al_start and _al_end (to match the field names used in > the FASTA output). This change is in the repository now, which you > can grab via github. See http://www.biopython.org/wiki/SourceCode > > As with any "private" variables (leading underscore), they are not > really intended for public use, but should at least solve your > immediate requirement for now. > > Peter -- Dr Anne Pajon - Pathogen Genomics Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From biopython at maubp.freeserve.co.uk Mon Nov 9 15:46:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 9 Nov 2009 15:46:05 +0000 Subject: [Biopython] fasta-m10 al_start and al_end? In-Reply-To: References: <82A8DBBC-6964-4433-AF9C-EF049160A2DF@sanger.ac.uk> <320fb6e00910231140w46243f9bp9751b6476e3e55b0@mail.gmail.com> <1E613DD7-3CF3-4F12-8C61-D12440F5AE1D@sanger.ac.uk> <320fb6e00910260304j417fa20ep2b353b6e74ca055d@mail.gmail.com> <320fb6e00910260717r50974e1epb7ee94d41ff5aa6c@mail.gmail.com> Message-ID: <320fb6e00911090746udc6cfb3l5cfbc72a4cf190c8@mail.gmail.com> On Mon, Nov 9, 2009 at 3:29 PM, Anne Pajon wrote: > > Hi Peter, > > Thanks for adding these private variables. They are called _al_start and > _al_stop. > > While testing the code today, I found a little bug. For the match record: > .. > the al_start and al_stop should be taken from match_annotation instead of > query_annotation, I think. > > Kind regards, > Anne. Yes, you are absolutely right. Sorry about that - fixed now. Peter From biopython at maubp.freeserve.co.uk Thu Nov 12 12:04:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 12 Nov 2009 12:04:32 +0000 Subject: [Biopython] Additions to the SeqRecord Message-ID: <320fb6e00911120404l79106ddbs350a06252ea0831a@mail.gmail.com> Hello all, Something we added in Biopython 1.50 was the ability to slice a SeqRecord, which tries to do something sensible with all the annotation - in particular per-letter-annotation (like quality scores) and features (which have locations) are handled as you would naturally expect. Something you can look forward to in our next release (assuming no major issues crop up in testing) is adding SeqRecord objects together. Again, this will try and do something unambiguous with the annotation. I have two motivational examples in mind which combine slicing and addition of SeqRecord objects to edit a record while preserving as much annotation as possible. For example, removing a section of sequence, say letters from 100 to 200: from Bio import SeqIO record = SeqIO.read(...) deletion_mutant = record[:100] + record[200:] (The above would make sense for both protein and nucleotide records). Or, for a circular nucleotide sequence (like a plasmid or many small genomes), you might want to shift the origin, e.g. by 150 bases: shifted = record[150:] + record[:150] You can already do both these examples with the latest (unreleased) code. However, the situation with the annotation isn't ideal. When slicing a record, for non-location based annotation there is no way to know for sure if the annotation still applies to the daughter sequence. Therefore in the face of this ambiguity, when we added SeqRecord slicing in Biopython 1.50, we did not copy the dbxrefs and annotations dictionary to the daughter record. i.e. You currently have to do this manually (if required), for example: deletion_mutant = record[:100] + record[200:] deletion_mutant.dbxrefs = record.dbxrefs[:] deletion_mutant.annotations = record.annotations.copy() I would like to propose changing the SeqRecord slice behaviour to blindly copy the dbxrefs list and annotations dict to the daughter record (just like the id, name and description are already blindly copied even though they may not make sense for the daughter record). Then these slicing+addition examples will "just work" without the user having to explicitly copy the dbxrefs and annotations dict. This is a non-backwards compatible change, but with hindsight is perhaps a more natural behaviour. We would of course highlight this in the release notes (maybe with some worked examples on the blog). Does changing SeqRecord slicing like this seem like a good idea? Peter P.S. The code changes required are very small (two extra lines), see this commit on my experimental branch on github for details - most of the changes are documentation and unit tests for this work: http://github.com/peterjc/biopython/commit/41e944f338476a79bd7f8998196df21a1c06d4f7 From lpritc at scri.ac.uk Thu Nov 12 13:47:20 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Thu, 12 Nov 2009 13:47:20 +0000 Subject: [Biopython] Additions to the SeqRecord In-Reply-To: <320fb6e00911120404l79106ddbs350a06252ea0831a@mail.gmail.com> Message-ID: Hi, To avoid issues with the inadvertent propagation of inappropriate annotation, I'd be more comfortable with it being an optional feature of the slice - to be used when appropriate and with caution - than the default behaviour. One counterexample I can think of is the slicing of a sequence for which a feature or annotation applies only to a subregion of the SeqRecord. This is not an uncommon property of modular proteins. If I were to slice the N-terminal domains of a set of sequences with distinct N- and C-terminal domains, I would not want to carry through annotation for the C-terminal domains. If I did this without noticing, there may be a danger of, say, downstream use inferring inappropriate class membership if I wanted to generate a set of sequences containing that C-terminal domain, and I did this automatically based on the annotation of a SeqRecord. Another counterexample would be propagated inappropriate class membership for annotations that require a complete sequence for context. For example, many bacterial CDS annotations feature reports of BLAST matches to other databases. These are results derived from the full length feature, and the BLAST match obtained from the slice result is likely to differ. Having seen first-hand the propagation of faulty annotations (e.g. presence of a signal peptide and other functionally-related motifs) through to cloning - and the resultant waste of time, money and other resources - I would seek to avoid this kind of behaviour. As it is, the propagation of sequence ID and description without modification to indicate that a copy and potential change has been done is potentially dangerous, and needs to be done with some care to avoid 'poisoning the well'. The behaviour you describe makes most sense in the context of per-letter-annotation (as this is the natural granularity of the changes), and for relatively small changes to a large sequence containing multiple features whose annotations are reasonably self-contained. I too would like to be able to treat these specially on occasion, conserving much of the annotation. However, I think the potential pitfalls are pretty significant and would not want this to be default behaviour. A third way might be only to include those annotations with location data where the region covered by the annotation is not disrupted by the slicing. For example, a slice/addition that removed sites 200-300 would retain features/annotations that ran from 120-199 and 301-350, but not carry forward features that ran from 120-201, or from 250-301. Features and annotations that span the full record length would not be carried forward under this proposal. Best, L. On 12/11/2009 12:04, "Peter" wrote: > Hello all, > > Something we added in Biopython 1.50 was the ability to slice a SeqRecord, > which tries to do something sensible with all the annotation - in particular > per-letter-annotation (like quality scores) and features (which have > locations) > are handled as you would naturally expect. > > Something you can look forward to in our next release (assuming no > major issues crop up in testing) is adding SeqRecord objects together. > Again, this will try and do something unambiguous with the annotation. > > I have two motivational examples in mind which combine slicing and > addition of SeqRecord objects to edit a record while preserving as much > annotation as possible. For example, removing a section of sequence, > say letters from 100 to 200: > > from Bio import SeqIO > record = SeqIO.read(...) > deletion_mutant = record[:100] + record[200:] > > (The above would make sense for both protein and nucleotide records). > Or, for a circular nucleotide sequence (like a plasmid or many small > genomes), you might want to shift the origin, e.g. by 150 bases: > > shifted = record[150:] + record[:150] > > You can already do both these examples with the latest (unreleased) code. > However, the situation with the annotation isn't ideal. When slicing a record, > for non-location based annotation there is no way to know for sure if the > annotation still applies to the daughter sequence. Therefore in the face of > this ambiguity, when we added SeqRecord slicing in Biopython 1.50, we > did not copy the dbxrefs and annotations dictionary to the daughter record. > i.e. You currently have to do this manually (if required), for example: > > deletion_mutant = record[:100] + record[200:] > deletion_mutant.dbxrefs = record.dbxrefs[:] > deletion_mutant.annotations = record.annotations.copy() > > I would like to propose changing the SeqRecord slice behaviour to > blindly copy the dbxrefs list and annotations dict to the daughter record > (just like the id, name and description are already blindly copied even > though they may not make sense for the daughter record). Then these > slicing+addition examples will "just work" without the user having to > explicitly copy the dbxrefs and annotations dict. > > This is a non-backwards compatible change, but with hindsight is > perhaps a more natural behaviour. We would of course highlight this > in the release notes (maybe with some worked examples on the blog). > > Does changing SeqRecord slicing like this seem like a good idea? > > Peter > > P.S. The code changes required are very small (two extra lines), see > this commit on my experimental branch on github for details - most > of the changes are documentation and unit tests for this work: > http://github.com/peterjc/biopython/commit/41e944f338476a79bd7f8998196df21a1c0 > 6d4f7 > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > ______________________________________________________________________ > This email has been scanned by the MessageLabs Email Security System. > For more information please visit http://www.messagelabs.com/email > ______________________________________________________________________ -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From biopython at maubp.freeserve.co.uk Thu Nov 12 14:08:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 12 Nov 2009 14:08:46 +0000 Subject: [Biopython] Additions to the SeqRecord In-Reply-To: References: <320fb6e00911120404l79106ddbs350a06252ea0831a@mail.gmail.com> Message-ID: <320fb6e00911120608w235ff7e3qfbb5b07fea9d3148@mail.gmail.com> On Thu, Nov 12, 2009 at 1:47 PM, Leighton Pritchard wrote: > > Hi, > > To avoid issues with the inadvertent propagation of inappropriate > annotation, I'd be more comfortable with it being an optional feature of the > slice - to be used when appropriate and with caution - than the default > behaviour. Better safe than sorry? > One counterexample I can think of is the slicing of a sequence for which a > feature or annotation applies only to a subregion of the SeqRecord. ?This is > not an uncommon property of modular proteins. ?If I were to slice the > N-terminal domains of a set of sequences with distinct N- and C-terminal > domains, I would not want to carry through annotation for the C-terminal > domains. If I did this without noticing, there may be a danger of, say, > downstream use inferring inappropriate class membership if I wanted to > generate a set of sequences containing that C-terminal domain, and I did > this automatically based on the annotation of a SeqRecord. > > Another counterexample would be propagated inappropriate class membership > for annotations that require a complete sequence for context. ?For example, > many bacterial CDS annotations feature reports of BLAST matches to other > databases. ?These are results derived from the full length feature, and the > BLAST match obtained from the slice result is likely to differ. Both good examples. > Having seen first-hand the propagation of faulty annotations (e.g. presence > of a signal peptide and other functionally-related motifs) through to > cloning - and the resultant waste of time, money and other resources - I > would seek to avoid this kind of behaviour. ?As it is, the propagation of > sequence ID and description without modification to indicate that a copy and > potential change has been done is potentially dangerous, and needs to be > done with some care to avoid 'poisoning the well'. Yes - as already noted in the documentation, the id/name/description may not apply to the sliced record, and some caution is advisable. > The behaviour you describe makes most sense in the context of > per-letter-annotation (as this is the natural granularity of the changes), > and for relatively small changes to a large sequence containing multiple > features whose annotations are reasonably self-contained. I too would like > to be able to treat these specially on occasion, conserving much of the > annotation. ?However, I think the potential pitfalls are pretty significant > and would not want this to be default behaviour. OK. So the current behaviour on the trunk is acceptable (for annotation where we know the location), but the proposed change for location-less annotation is too risky. > A third way might be only to include those annotations with location data > where the region covered by the annotation is not disrupted by the slicing. > For example, a slice/addition that removed sites 200-300 would retain > features/annotations that ran from 120-199 and 301-350, but not carry > forward features that ran from 120-201, or from 250-301. ?Features and > annotations that span the full record length would not be carried forward > under this proposal. Exactly - SeqFeatures entirely within the sliced region are kept. Those outside the sliced region (or crossing the boundary) are lost. As a result, because GenBank-style source feature span the whole sequence, they are lost on slicing to a sub-sequence. This is the current behaviour and I wasn't suggesting any changes. General annotation in the SeqRecord's annotation dictionary has no location information - it may apply to the whole sequence (e.g from organism X) or just part (e.g. a text note it contains XXX domain). Likewise the database cross reference list. The dbxref list and annotations dict are thus the hardest to handle - the only practical automatic actions on slicing are to discard them (the current behaviour on Biopython 1.50 to date), or keep them all as per my suggestion (which as you stress, is risky). In light of Leighton's valid concerns, and weighing this against the limited benefits which only apply in special cases like the examples I gave, let's leave things as they are. i.e. Explicit is better than implicit (Zen of Python), if you want to propagate the annotations dict and dbxrefs to a sliced record, you must continue do it explicity. Thanks for the feedback! Peter From biopython at maubp.freeserve.co.uk Thu Nov 12 16:53:31 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 12 Nov 2009 16:53:31 +0000 Subject: [Biopython] Additions to the SeqRecord In-Reply-To: <65d4b7fc0911120837v1a3f2a41scd128adbd2be615e@mail.gmail.com> References: <320fb6e00911120404l79106ddbs350a06252ea0831a@mail.gmail.com> <65d4b7fc0911120837v1a3f2a41scd128adbd2be615e@mail.gmail.com> Message-ID: <320fb6e00911120853r32612646s7b80e0e3d320097c@mail.gmail.com> On Thu, Nov 12, 2009 at 4:37 PM, Carlos Javier Borroto wrote: > > On Thu, Nov 12, 2009 at 7:04 AM, Peter wrote: >> You can already do both these examples with the latest (unreleased) code. > > I'll love to test this unreleased code, is there any documentation on > how to install from git? Yes, first grab the source code from git, or via the github download link: http://biopython.org/wiki/SourceCode Then install from source - just like you would from a zip or tarball. There are instructions for this on the download page: http://biopython.org/wiki/Download#Installation_Instructions Peter From villahozbale at wisc.edu Thu Nov 12 19:51:11 2009 From: villahozbale at wisc.edu (ANGEL VILLAHOZ-BALETA) Date: Thu, 12 Nov 2009 13:51:11 -0600 Subject: [Biopython] ExPASy.get_sprot_raw giving a FASTA record instead of a Swiss-Prot/UniprotKB record? Message-ID: <6fc0f5c82c743.4afc12cf@wiscmail.wisc.edu> Hi to all, I am using Biopython 1.5.1 and it seems that I have met with a strange situation... When using ExPASy.get_sprot_raw, it gives me a FASTA record instead of a Swiss-Prot/UniProtKB record... Anyone has met the same situation? You can test the following example: from Bio import ExPASy from Bio import SeqIO handle = ExPASy.get_sprot_raw("O23729") seq_record = SeqIO.read(handle, "swiss") handle.close() print seq_record.id print seq_record.name print seq_record.description print repr(seq_record.seq) print "Length %i" % len(seq_record) print seq_record.annotations["keywords"] and write me its result... Thanks very much, Angel Villahoz-Baleta. From biopython at maubp.freeserve.co.uk Thu Nov 12 23:43:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 12 Nov 2009 23:43:35 +0000 Subject: [Biopython] ExPASy.get_sprot_raw giving a FASTA record instead of a Swiss-Prot/UniprotKB record? In-Reply-To: <6fc0f5c82c743.4afc12cf@wiscmail.wisc.edu> References: <6fc0f5c82c743.4afc12cf@wiscmail.wisc.edu> Message-ID: <320fb6e00911121543s65167605kb5541ac6f8fdd449@mail.gmail.com> On Thu, Nov 12, 2009 at 7:51 PM, ANGEL VILLAHOZ-BALETA wrote: > Hi to all, > > I am using Biopython 1.5.1 and it seems that I have met with > a strange situation... When using ExPASy.get_sprot_raw, it > gives me a FASTA record instead of a Swiss-Prot/UniProtKB > record... > > Anyone has met the same situation? I hadn't tried this recently, but you are right. It looks like ExPASy/UniProt have broken this :( The URL which Biopython requests is: http://www.expasy.ch/cgi-bin/get-sprot-raw.pl?O23729 You can check via a tool like wget that this is now being redirected to a URL giving the FASTA file: http://www.uniprot.org/uniprot/O23729.fasta Please contact ExPASy/uniprot to alert them that they have broken this old URL redirection, and ask them nicely to fix it to point here in order to get the swiss format: http://www.uniprot.org/uniprot/O23729.txt Thanks! Peter P.S. Perhaps we should also update our URLs, but that won't help people using the current version of Biopython. From chapmanb at 50mail.com Fri Nov 13 13:23:46 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 13 Nov 2009 08:23:46 -0500 Subject: [Biopython] Additions to the SeqRecord In-Reply-To: <320fb6e00911120608w235ff7e3qfbb5b07fea9d3148@mail.gmail.com> References: <320fb6e00911120404l79106ddbs350a06252ea0831a@mail.gmail.com> <320fb6e00911120608w235ff7e3qfbb5b07fea9d3148@mail.gmail.com> Message-ID: <20091113132346.GB48178@sobchak.mgh.harvard.edu> Hi Peter; [...Discussion on what to do with full length features and annotations when slicing SeqRecords...] > Exactly - SeqFeatures entirely within the sliced region are kept. Those > outside the sliced region (or crossing the boundary) are lost. As a result, > because GenBank-style source feature span the whole sequence, they > are lost on slicing to a sub-sequence. This is the current behaviour and > I wasn't suggesting any changes. > > General annotation in the SeqRecord's annotation dictionary has no > location information - it may apply to the whole sequence (e.g from > organism X) or just part (e.g. a text note it contains XXX domain). > Likewise the database cross reference list. > > The dbxref list and annotations dict are thus the hardest to handle - > the only practical automatic actions on slicing are to discard them > (the current behaviour on Biopython 1.50 to date), or keep them all > as per my suggestion (which as you stress, is risky). Good discussion. Agreed that copying may be confusing. One hybrid approach is to provide a function make makes copying them easy if someone does want to save the annotations, dbxrefs and full length feature sources: sliced = rec[:100] sliced.set_full_length_features(rec) where set_full_length_features copied over the annotations and dbxrefs, ala your code example: deletion_mutant.dbxrefs = record.dbxrefs[:] deletion_mutant.annotations = record.annotations.copy() and perhaps also added any whole sequence sequence features from the original SeqRecord. This would help with discoverability for people who do want to retain all of the source and other high level information when they slice. Brad From biopython at maubp.freeserve.co.uk Fri Nov 13 13:51:48 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 13 Nov 2009 13:51:48 +0000 Subject: [Biopython] Additions to the SeqRecord In-Reply-To: <20091113132346.GB48178@sobchak.mgh.harvard.edu> References: <320fb6e00911120404l79106ddbs350a06252ea0831a@mail.gmail.com> <320fb6e00911120608w235ff7e3qfbb5b07fea9d3148@mail.gmail.com> <20091113132346.GB48178@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00911130551h21f5c438gfbb68cc25bd9ea7b@mail.gmail.com> On Fri, Nov 13, 2009 at 1:23 PM, Brad Chapman wrote: > Hi Peter; > > [...Discussion on what to do with full length features and annotations > ?when slicing SeqRecords...] > > Good discussion. Agreed that copying may be confusing. One hybrid > approach is to provide a function make makes copying them easy if > someone does want to save the annotations, dbxrefs and full length > feature sources: > > sliced = rec[:100] > sliced.set_full_length_features(rec) > > where set_full_length_features copied over the annotations and > dbxrefs, ala your code example: > > deletion_mutant.dbxrefs = record.dbxrefs[:] > deletion_mutant.annotations = record.annotations.copy() > > and perhaps also added any whole sequence sequence features from the > original SeqRecord. This would help with discoverability for people > who do want to retain all of the source and other high level information > when they slice. > > Brad Hi Brad. Interesting idea - but I'm not sure about that name (maybe something like copy_annotation would be better?) and personally don't think it is actually any clearer than the two lines: deletion_mutant.dbxrefs = record.dbxrefs[:] deletion_mutant.annotations = record.annotations.copy() [We should in the meantime add those line to the relevant examples in the docstring and Tutorial in the repository.] Regarding the special case of the source feature in GenBank files, for tasks like removing part of the record, or doing an origin shift, you may want to recreate a new source feature reusing the old source feature annotation (e.g. NCBI taxon ID). However, the location would have to reflect the new modified sequence length. I have another idea to "solve" this problem: I am actually be tempted to remove the source SeqFeature, and instead handle it via the annotations dict. To me this seems more natural than having it as an entry in the feature table - a GenBank file format choice I never really understood. My guess is they didn't want to introduce a record level extensible annotation header block, which is what the source feature could be regarded as handling. i.e. When parsing a GenBank (or EMBL) file, the source feature information could get stored in the SeqRecord annotations dictionary. When writing to GenBank (or in future EMBL) format, if the annotations dictionary contained relevant fields, we would generate a source feature for the full sequence. Does that make sense? It requires looking at the source feature not as a feature which happens to span the whole sequence, but as annotation for the whole sequence (which happens to be in the GenBank features table due to a historical choice or accident). Peter From yvan.strahm at bccs.uib.no Fri Nov 13 15:00:42 2009 From: yvan.strahm at bccs.uib.no (Yvan Strahm) Date: Fri, 13 Nov 2009 16:00:42 +0100 Subject: [Biopython] fetch random id Message-ID: <4AFD749A.7050504@bccs.uib.no> Hello List, I have to crash test a webservice so was wondering if any one knows a way to get random sequence id from swissprot or genbank? Thank for your help. yvan From biopython at maubp.freeserve.co.uk Fri Nov 13 16:10:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 13 Nov 2009 16:10:41 +0000 Subject: [Biopython] fetch random id In-Reply-To: <4AFD749A.7050504@bccs.uib.no> References: <4AFD749A.7050504@bccs.uib.no> Message-ID: <320fb6e00911130810x417aca71id03845b0af985968@mail.gmail.com> On Fri, Nov 13, 2009 at 3:00 PM, Yvan Strahm wrote: > Hello List, > > I have to crash test a webservice so was wondering if any one knows a way to > get random sequence id from swissprot or genbank? > Thank for your help. > > yvan GI identifiers are numbers, any I would expect most 8 digit GI numbers to be valid IDs. So you could try just using random integers. Of course, some will have been deprecated etc so they may trigger real failures. Alternatively, download a list of valid IDs from the FTP site (or compile a list via an Entrez search and save this to disk), and pick a random entry to use each time in the test. Peter From chapmanb at 50mail.com Fri Nov 13 17:20:33 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 13 Nov 2009 12:20:33 -0500 Subject: [Biopython] fetch random id In-Reply-To: <320fb6e00911130810x417aca71id03845b0af985968@mail.gmail.com> References: <4AFD749A.7050504@bccs.uib.no> <320fb6e00911130810x417aca71id03845b0af985968@mail.gmail.com> Message-ID: <20091113172033.GG48178@sobchak.mgh.harvard.edu> Hi Yvan; > I have to crash test a webservice so was wondering if any one knows a way to > get random sequence id from swissprot or genbank? ExPASy can give you a random SwissProt entry: http://www.expasy.org/cgi-bin/get-random-entry.pl?S See the ExPASy documentation for all of their URLs: http://ca.expasy.org/expasy_urls.html You can use this to get the UniProt ID from the redirect: >>> import urllib2 >>> u = urllib2.urlopen("http://www.expasy.org/cgi-bin/get-random-entry.pl?S") >>> u.geturl() 'http://www.uniprot.org/uniprot/Q824C8' Hope this helps, Brad From yvan.strahm at bccs.uib.no Fri Nov 13 19:20:26 2009 From: yvan.strahm at bccs.uib.no (Yvan Strahm) Date: Fri, 13 Nov 2009 20:20:26 +0100 Subject: [Biopython] fetch random id In-Reply-To: <20091113172033.GG48178@sobchak.mgh.harvard.edu> References: <4AFD749A.7050504@bccs.uib.no> <320fb6e00911130810x417aca71id03845b0af985968@mail.gmail.com> <20091113172033.GG48178@sobchak.mgh.harvard.edu> Message-ID: <4AFDB17A.3020504@bccs.uib.no> Hello Brad and Peter, Thanks a lot for the pointers especially the expasy links Really great Cheers, yvan Brad Chapman wrote: > Hi Yvan; > >> I have to crash test a webservice so was wondering if any one knows a way to >> get random sequence id from swissprot or genbank? > > ExPASy can give you a random SwissProt entry: > > http://www.expasy.org/cgi-bin/get-random-entry.pl?S > > See the ExPASy documentation for all of their URLs: > > http://ca.expasy.org/expasy_urls.html > > You can use this to get the UniProt ID from the redirect: > >>>> import urllib2 >>>> u = urllib2.urlopen("http://www.expasy.org/cgi-bin/get-random-entry.pl?S") >>>> u.geturl() > 'http://www.uniprot.org/uniprot/Q824C8' > > Hope this helps, > Brad > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From han.chen1986 at gmail.com Sat Nov 14 13:25:21 2009 From: han.chen1986 at gmail.com (Han Chen) Date: Sat, 14 Nov 2009 21:25:21 +0800 Subject: [Biopython] About bioseqI0 Message-ID: <73a1f0de0911140525x4add0b61g7e5786d3b25c8703@mail.gmail.com> Hi, List of helpful people, could your please offer me some help about bioseqIo? here is the error message when run "python setup.py test": ====================================================================== ERROR: test_SeqIO_online ---------------------------------------------------------------------- Traceback (most recent call last): File "run_tests.py", line 248, in runTest suite = unittest.TestLoader().loadTestsFromName(name) File "/usr/local/lib/python2.6/unittest.py", line 576, in loadTestsFromName module = __import__('.'.join(parts_copy)) File "test_SeqIO_online.py", line 42, in records = list(SeqIO.parse(handle, "swiss")) File "/home/ch/biopython-1.51/build/lib.linux-x86_64-2.6/Bio/SeqIO/SwissIO.py", line 39, in SwissIterator for swiss_record in swiss_records: File "/home/ch/biopython-1.51/build/lib.linux-x86_64-2.6/Bio/SwissProt/__init__.py", line 113, in parse record = _read(handle) File "/home/ch/biopython-1.51/build/lib.linux-x86_64-2.6/Bio/SwissProt/__init__.py", line 240, in _read raise ValueError("Unknown keyword '%s' found" % key) ValueError: Unknown keyword '>s' found ---------------------------------------------------------------------- Ran 124 tests in 60.180 seconds FAILED (failures = 1) Is there anything wrong with SeqIO? I meet the following error when using other package: DeprecationWarning: Bio.Fasta is deprecated. Please use the "fasta" support in Bio.SeqIO (or Bio.AlignIO) instead could you please help me about this? thank you very much! sincerely yours, Han From biopython at maubp.freeserve.co.uk Sat Nov 14 13:38:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 14 Nov 2009 13:38:45 +0000 Subject: [Biopython] About bioseqI0 In-Reply-To: <73a1f0de0911140525x4add0b61g7e5786d3b25c8703@mail.gmail.com> References: <73a1f0de0911140525x4add0b61g7e5786d3b25c8703@mail.gmail.com> Message-ID: <320fb6e00911140538w3edbc7efr491c5aa9420c0ac4@mail.gmail.com> 2009/11/14 Han Chen : > Hi, List of helpful people, > > could your please offer me some help about bioseqIo? > > here is the error message when run "python setup.py test": > > ====================================================================== > ERROR: test_SeqIO_online > ---------------------------------------------------------------------- > Traceback (most recent call last): > ... > ValueError: Unknown keyword '>s' found > > Is there anything wrong with SeqIO? That is due to the ExPAYs website problem just recently reported: http://lists.open-bio.org/pipermail/biopython/2009-November/005823.html We can update Biopython to use the new URL, but it would be nice if ExPASy can fix their redirection as well. > I meet the following error when using other package: > > DeprecationWarning: Bio.Fasta is deprecated. Please use the "fasta" support > in Bio.SeqIO (or Bio.AlignIO) instead > > could you please help me about this? thank you very much! Which bit of Biopython are you trying to use? As the message says, Bio.Fasta is deprecated. This is just a warning message for now, but Bio.Fasta will one day be removed. Peter From mitlox at op.pl Sun Nov 15 04:26:38 2009 From: mitlox at op.pl (xyz) Date: Sun, 15 Nov 2009 14:26:38 +1000 Subject: [Biopython] SeqIO.convert Message-ID: <4AFF82FE.404@op.pl> Hello, I have to convert fastq to fasta and to trim the sequence. I have found SeqIO.convert: from Bio import SeqIO count = SeqIO.convert("a.fastq", "fastq", "a.fasta", "fasta") But I do not know how can I trim the sequence. Thank you in advance. Best regards, From chapmanb at 50mail.com Sun Nov 15 14:38:55 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 15 Nov 2009 09:38:55 -0500 Subject: [Biopython] SeqIO.convert In-Reply-To: <4AFF82FE.404@op.pl> References: <4AFF82FE.404@op.pl> Message-ID: <20091115143826.GA2712@kunkel> Hello; > I have to convert fastq to fasta and to trim the sequence. I have found > SeqIO.convert: > > from Bio import SeqIO > count = SeqIO.convert("a.fastq", "fastq", "a.fasta", "fasta") > > But I do not know how can I trim the sequence. SeqIO.convert is a format converter only, but you can use it along with other Biopython modules to trim adaptors. Here's a description: http://bcbio.wordpress.com/2009/08/09/trimming-adaptors-from-short-read-sequences/ along with code: http://github.com/chapmanb/bcbb/blob/master/align/adaptor_trim.py Hope this helps, Brad From biopython at maubp.freeserve.co.uk Sun Nov 15 14:55:48 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 15 Nov 2009 14:55:48 +0000 Subject: [Biopython] SeqIO.convert In-Reply-To: <20091115143826.GA2712@kunkel> References: <4AFF82FE.404@op.pl> <20091115143826.GA2712@kunkel> Message-ID: <320fb6e00911150655p53afb9b2y35086efbb2f355a5@mail.gmail.com> On Sun, Nov 15, 2009 at 2:38 PM, Brad Chapman wrote: > Hello; > >> I have to convert fastq to fasta and to trim the sequence. I have found >> SeqIO.convert: >> >> from Bio import SeqIO >> count = SeqIO.convert("a.fastq", "fastq", "a.fasta", "fasta") >> >> But I do not know how can I trim the sequence. > > SeqIO.convert is a format converter only, but you can use it along > with other Biopython modules to trim adaptors. ... It all depends on what you mean by "trim" the sequence. In addition to Brad's examples, there are some simpler ones in the Tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:FASTQ-slicing-off-primer http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:FASTQ-slicing-off-adaptor Peter From biopython at maubp.freeserve.co.uk Mon Nov 16 10:23:16 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 16 Nov 2009 10:23:16 +0000 Subject: [Biopython] ExPASy.get_sprot_raw giving a FASTA record instead of a Swiss-Prot/UniprotKB record? In-Reply-To: <320fb6e00911121543s65167605kb5541ac6f8fdd449@mail.gmail.com> References: <6fc0f5c82c743.4afc12cf@wiscmail.wisc.edu> <320fb6e00911121543s65167605kb5541ac6f8fdd449@mail.gmail.com> Message-ID: <320fb6e00911160223q6a3eb5a3l49229903a9de482@mail.gmail.com> On Thu, Nov 12, 2009 at 11:43 PM, Peter wrote: > > P.S. Perhaps we should also update our URLs, but that > won't help people using the current version of Biopython. > I checked the ExPASy page http://www.expasy.ch/expasy_urls.html then updated our code to use the currently recommended URL, http://www.uniprot.org/uniprot/XXX.txt instead of the old URL, http://www.expasy.ch/cgi-bin/get-sprot-raw.pl?XXX If anyone is curious about the details, see: http://github.com/biopython/biopython/commit/6689bf8657d9515965d63f9c77e6348233472046 This means the next release of Biopython will not depend on the old ExPASy URL, but it would still be ideal if ExPASy/Uniprot could fix that for the benefit of users of older versions of Biopython and other scripts. Did you try and contact them about this yet? Thanks, Peter From chapmanb at 50mail.com Tue Nov 17 13:24:17 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 17 Nov 2009 08:24:17 -0500 Subject: [Biopython] Additions to the SeqRecord In-Reply-To: <320fb6e00911130551h21f5c438gfbb68cc25bd9ea7b@mail.gmail.com> References: <320fb6e00911120404l79106ddbs350a06252ea0831a@mail.gmail.com> <320fb6e00911120608w235ff7e3qfbb5b07fea9d3148@mail.gmail.com> <20091113132346.GB48178@sobchak.mgh.harvard.edu> <320fb6e00911130551h21f5c438gfbb68cc25bd9ea7b@mail.gmail.com> Message-ID: <20091117132417.GE68691@sobchak.mgh.harvard.edu> Hi Peter; > > [...Discussion on what to do with full length features and annotations > > ?when slicing SeqRecords...] > > [...Proposal to have a function that does the copying...] > > Interesting idea - but I'm not sure about that name (maybe something like > copy_annotation would be better?) and personally don't think it is actually > any clearer than the two lines: > > deletion_mutant.dbxrefs = record.dbxrefs[:] > deletion_mutant.annotations = record.annotations.copy() Yes, I am terrible at thinking up function names -- copy_annotation is great. Here I'm not as worried about clarity as I am about discoverability. It's another way for people to realize that the annotations were not copied. > Regarding the special case of the source feature in GenBank files, for > tasks like removing part of the record, or doing an origin shift, you may > want to recreate a new source feature reusing the old source feature > annotation (e.g. NCBI taxon ID). However, the location would have to > reflect the new modified sequence length. > > I have another idea to "solve" this problem: > > I am actually be tempted to remove the source SeqFeature, and instead > handle it via the annotations dict. To me this seems more natural than > having it as an entry in the feature table - a GenBank file format choice I > never really understood. My guess is they didn't want to introduce a record > level extensible annotation header block, which is what the source feature > could be regarded as handling. > > i.e. When parsing a GenBank (or EMBL) file, the source feature information > could get stored in the SeqRecord annotations dictionary. When writing to > GenBank (or in future EMBL) format, if the annotations dictionary contained > relevant fields, we would generate a source feature for the full sequence. > > Does that make sense? It requires looking at the source feature not as > a feature which happens to span the whole sequence, but as annotation > for the whole sequence (which happens to be in the GenBank features > table due to a historical choice or accident). I like that. You're right that those full length features are really annotations in disguise. Instead of removing the source SeqFeature, I would suggest making it available in both places. This way you mimic what GenBank is doing, but also make it available in a more accessible and natural place. So for something like: source 1..4411532 /organism="Mycobacterium tuberculosis H37Rv" /mol_type="genomic DNA" /strain="H37Rv" /db_xref="taxon:83332" you would have the source SeqFeature, but also the organism, mol_type and strain in the annotations dictionary, and the cross reference in dbxrefs. Nice idea. Brad From biopython at maubp.freeserve.co.uk Tue Nov 17 14:53:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Nov 2009 14:53:44 +0000 Subject: [Biopython] Additions to the SeqRecord In-Reply-To: <20091117132417.GE68691@sobchak.mgh.harvard.edu> References: <320fb6e00911120404l79106ddbs350a06252ea0831a@mail.gmail.com> <320fb6e00911120608w235ff7e3qfbb5b07fea9d3148@mail.gmail.com> <20091113132346.GB48178@sobchak.mgh.harvard.edu> <320fb6e00911130551h21f5c438gfbb68cc25bd9ea7b@mail.gmail.com> <20091117132417.GE68691@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00911170653p223b671aj68459868348fe524@mail.gmail.com> Peter wrote: >> >> Regarding the special case of the source feature in GenBank files, for >> tasks like removing part of the record, or doing an origin shift, you may >> want to recreate a new source feature reusing the old source feature >> annotation (e.g. NCBI taxon ID). However, the location would have to >> reflect the new modified sequence length. >> >> I have another idea to "solve" this problem: >> >> I am actually be tempted to remove the source SeqFeature, and instead >> handle it via the annotations dict. To me this seems more natural than >> having it as an entry in the feature table - a GenBank file format choice I >> never really understood. My guess is they didn't want to introduce a record >> level extensible annotation header block, which is what the source feature >> could be regarded as handling. >> >> i.e. When parsing a GenBank (or EMBL) file, the source feature information >> could get stored in the SeqRecord annotations dictionary. When writing to >> GenBank (or in future EMBL) format, if the annotations dictionary contained >> relevant fields, we would generate a source feature for the full sequence. >> >> Does that make sense? It requires looking at the source feature not as >> a feature which happens to span the whole sequence, but as annotation >> for the whole sequence (which happens to be in the GenBank features >> table due to a historical choice or accident). Brad Chapman wrote: > > I like that. You're right that those full length features are really > annotations in disguise. Good :) > Instead of removing the source SeqFeature, > I would suggest making it available in both places. This way you > mimic what GenBank is doing, but also make it available in a more > accessible and natural place. So for something like: > > ? ? source ? ? ? ? ?1..4411532 > ? ? ? ? ? ? ? ? ? ? /organism="Mycobacterium tuberculosis H37Rv" > ? ? ? ? ? ? ? ? ? ? /mol_type="genomic DNA" > ? ? ? ? ? ? ? ? ? ? /strain="H37Rv" > ? ? ? ? ? ? ? ? ? ? /db_xref="taxon:83332" > > you would have the source SeqFeature, but also the organism, > mol_type and strain in the annotations dictionary, and the cross > reference in dbxrefs. Nice idea. Good point about the dbxrefs - that makes sense :) Interesting idea about having the parser record the source feature in both the SeqFeature (as it does now) and the SeqRecord annotations dict (as I suggested). That would certainly make sense in the short term for a transition period, but in the long term we should deprecate using a source SeqFeature. After all, for accessing this information "There should be one-- and preferably only one -- obvious way to do it" (Zen of Python). This also applies to the code for writing out GenBank files - if the information is in two places, which takes priority? Peter From biopython at maubp.freeserve.co.uk Tue Nov 17 16:55:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 17 Nov 2009 16:55:15 +0000 Subject: [Biopython] ExPASy.get_sprot_raw giving a FASTA record instead of a Swiss-Prot/UniprotKB record? In-Reply-To: <6fa094f832a30.4b027fef@wiscmail.wisc.edu> References: <6fc0f5c82c743.4afc12cf@wiscmail.wisc.edu> <320fb6e00911121543s65167605kb5541ac6f8fdd449@mail.gmail.com> <320fb6e00911160223q6a3eb5a3l49229903a9de482@mail.gmail.com> <6fa094f832a30.4b027fef@wiscmail.wisc.edu> Message-ID: <320fb6e00911170855y614a6cd3oad794d6314bc1512@mail.gmail.com> On Tue, Nov 17, 2009 at 4:50 PM, ANGEL VILLAHOZ-BALETA wrote: > > Yes, Peter, I isolated it from my source code and I chose another programming way since I preferred to be a bit less dependent from the ExPASy server. > > Anyway, I have just emailed all this information to the help desk of ExPASy to get a potential benefit for our Biopython community. > > Thanks, > > Angel Villahoz-Baleta > Bioinformatics Programmer > University of Wisconsin-Madison Thanks, Peter From animesh.agrawal at anu.edu.au Wed Nov 18 08:19:20 2009 From: animesh.agrawal at anu.edu.au (Animesh Agrawal) Date: Wed, 18 Nov 2009 19:19:20 +1100 Subject: [Biopython] Divergent sequence data set Message-ID: <000001ca6827$d4174af0$7c45e0d0$@agrawal@anu.edu.au> Hi, I have been trying to develop a divergent sequence data set for a phylogenetic analysis. Do we have something in Biopython, where for a given set of sequences we can choose identity threshold to reduce redundancy in the dataset. Cheers, Animesh From biopython at maubp.freeserve.co.uk Wed Nov 18 10:24:48 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Nov 2009 10:24:48 +0000 Subject: [Biopython] Divergent sequence data set In-Reply-To: <2207210305477723158@unknownmsgid> References: <2207210305477723158@unknownmsgid> Message-ID: <320fb6e00911180224u4de6e30bsa121b11ac60c0ce3@mail.gmail.com> On Wed, Nov 18, 2009 at 8:19 AM, Animesh Agrawal wrote: > > Hi, > > I have been trying to develop a divergent sequence data set for a > phylogenetic analysis. Do we have something in Biopython, where for a given > set of ?sequences we can choose identity threshold to reduce redundancy in > the dataset. > > Cheers, > > Animesh Hi Animesh, There are probably 100s of ways to do this. I think you should consult the literature as the the best approach (in terms of the algorithm), or talk to a phylogeneticist. Once you have an algorithm in mind, it can probably be done with python. For example, you could do pairwise BLAST alignments (e.g. using the NCBI standalone tools) or maybe pairwise Needleman-Wunsch global alignment (e.g. using the EMBOSS needle tool) and construct a distance matrix in terms of percentage identity. You could build a rough phylogenetic tree (perhaps using NJ if your starting dataset is very large), and use that to sample the nodes to get a fairly uniform distribution w.r.t. the phylogenetic space. These are just rough ideas - I am not a phylogenetics specialist. I have a vague recollection that one of the sequence alignment tools includes an option to do something like this for you... but I can't remember the details. Peter From biopython at maubp.freeserve.co.uk Wed Nov 18 11:31:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Nov 2009 11:31:40 +0000 Subject: [Biopython] Additions to the SeqRecord In-Reply-To: <320fb6e00911170653p223b671aj68459868348fe524@mail.gmail.com> References: <320fb6e00911120404l79106ddbs350a06252ea0831a@mail.gmail.com> <320fb6e00911120608w235ff7e3qfbb5b07fea9d3148@mail.gmail.com> <20091113132346.GB48178@sobchak.mgh.harvard.edu> <320fb6e00911130551h21f5c438gfbb68cc25bd9ea7b@mail.gmail.com> <20091117132417.GE68691@sobchak.mgh.harvard.edu> <320fb6e00911170653p223b671aj68459868348fe524@mail.gmail.com> Message-ID: <320fb6e00911180331o53ddfc4bi4ccdb0f1972221f5@mail.gmail.com> Peter wrote: >>> Regarding the special case of the source feature in GenBank files, for >>> tasks like removing part of the record, or doing an origin shift, you may >>> want to recreate a new source feature reusing the old source feature >>> annotation (e.g. NCBI taxon ID). However, the location would have to >>> reflect the new modified sequence length. >>> >>> I have another idea to "solve" this problem: >>> >>> I am actually be tempted to remove the source SeqFeature, and instead >>> handle it via the annotations dict. To me this seems more natural than >>> having it as an entry in the feature table - a GenBank file format choice I >>> never really understood. My guess is they didn't want to introduce a record >>> level extensible annotation header block, which is what the source feature >>> could be regarded as handling. >>> >>> i.e. When parsing a GenBank (or EMBL) file, the source feature information >>> could get stored in the SeqRecord annotations dictionary. When writing to >>> GenBank (or in future EMBL) format, if the annotations dictionary contained >>> relevant fields, we would generate a source feature for the full sequence. >>> >>> Does that make sense? It requires looking at the source feature not as >>> a feature which happens to span the whole sequence, but as annotation >>> for the whole sequence (which happens to be in the GenBank features >>> table due to a historical choice or accident). Let's call that idea Plan(B). I've started a thread on the BioSQL mailing list, as this possible change would have implications for Biopython's use of BioSQL for storing this information. Unless we put some special case handling code in our BioSQL wrapper, it would mean Biopython would treat the "source" features differently to all the other Bio* interfaces for BioSQL. That would be bad. http://lists.open-bio.org/pipermail/biosql-l/2009-November/001642.html In thinking about this, perhaps there is another less invasive change, which I'm going to call Plan(C): We expect (and could even enforce this assumption) there to be at most one "source" feature in a GenBank/EMBL file, and that it should span the full length of the sequence. Taking this a special case, when slicing a SeqRecord, we could also slice the "source" SeqFeature to match the new reduced sequence. Furthermore, when adding two SeqRecord objects, we would try to combine the two "source" SeqFeatures - taking only common annotation information. And I'll use Plan(A) for leaving things as they stand, pros and cons: * pro - no code changes at all * con - "source" annotation remains a bit hidden * con - still lose "source" features on slicing Plan(B) pros and cons ("source" as top level annotation): * pro - elegant handling of "source" annotation * pro - no changes in SeqRecord * con - special case code in GenBank/EMBL input/output * con - may need special case code in BioSQL wrapper * con - fairly big break to backwards compatibility (affecting any scripts accessing or creating "source" features), depending on how such a transition was made. Place(C) pros and cons (special "source" slicing/adding): * con - "source" annotation remains a bit hidden * con - special case code in SeqRecord * pro - no changes in GenBank/EMBL input/output * pro - no changes in BioSQL wrapper * pro - minor break to backwards compatibility (affecting slicing of "source" features only - remember SeqRecord addition hasn't been released yet). Any thoughts? I've probably missed some advantages and disadvantages, and alternative ideas are welcome. This new idea to just special case slicing/adding of the "source" feature (Plan C) lacks the elegance of moving the "source" annotation to the top level (Plan B). However, it is much less invasive and looks quite practical and intuitive. Peter From schafer at rostlab.org Wed Nov 18 13:14:06 2009 From: schafer at rostlab.org (=?ISO-8859-1?Q?Christian_Sch=E4fer?=) Date: Wed, 18 Nov 2009 08:14:06 -0500 Subject: [Biopython] Divergent sequence data set In-Reply-To: <000001ca6827$d4174af0$7c45e0d0$@agrawal@anu.edu.au> References: <000001ca6827$d4174af0$7c45e0d0$@agrawal@anu.edu.au> Message-ID: <4B03F31E.9040901@rostlab.org> There are stand-alone tools out there like cd-hit or uniqueProt for the purpose of creating sequence-unique subsets on particular thresholds. If you want to access them from within your python code, it's easy to do so via commands.getoutput() or similar means and then parsing the result. Chris Animesh Agrawal wrote: > Hi, > > I have been trying to develop a divergent sequence data set for a > phylogenetic analysis. Do we have something in Biopython, where for a given > set of sequences we can choose identity threshold to reduce redundancy in > the dataset. > > > > Cheers, > > Animesh > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Wed Nov 18 13:30:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Nov 2009 13:30:35 +0000 Subject: [Biopython] Additions to the SeqRecord In-Reply-To: <320fb6e00911180331o53ddfc4bi4ccdb0f1972221f5@mail.gmail.com> References: <320fb6e00911120404l79106ddbs350a06252ea0831a@mail.gmail.com> <320fb6e00911120608w235ff7e3qfbb5b07fea9d3148@mail.gmail.com> <20091113132346.GB48178@sobchak.mgh.harvard.edu> <320fb6e00911130551h21f5c438gfbb68cc25bd9ea7b@mail.gmail.com> <20091117132417.GE68691@sobchak.mgh.harvard.edu> <320fb6e00911170653p223b671aj68459868348fe524@mail.gmail.com> <320fb6e00911180331o53ddfc4bi4ccdb0f1972221f5@mail.gmail.com> Message-ID: <320fb6e00911180530h54544fbby1b7b284cabc874da@mail.gmail.com> Peter wrote: > In thinking about this, perhaps there is another less invasive change, > which I'm going to call Plan(C): > > We expect (and could even enforce this assumption) there to be at > most one "source" feature in a GenBank/EMBL file, and that it should > span the full length of the sequence. Taking this a special case, when > slicing a SeqRecord, we could also slice the "source" SeqFeature to > match the new reduced sequence. Furthermore, when adding two > SeqRecord objects, we would try to combine the two "source" > SeqFeatures - taking only common annotation information. Here is an outline of what I have in mind here (incomplete, but does the basics). If we want to talk about the implementation, perhaps we should move this to the dev list... http://github.com/peterjc/biopython/commit/a074919b9925cb908935abf3161a50758f21f607 However, the point is that "Plan C" looks possible, and seems to have potential for dealing with SeqRecord slicing and addition where there is a "source" SeqFeature fairly nicely (i.e. preserving it for things like removing part of a sequence, or doing an origin shift). Peter From biopython at maubp.freeserve.co.uk Wed Nov 18 13:40:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Nov 2009 13:40:18 +0000 Subject: [Biopython] Additions to the SeqRecord In-Reply-To: <320fb6e00911180530h54544fbby1b7b284cabc874da@mail.gmail.com> References: <320fb6e00911120404l79106ddbs350a06252ea0831a@mail.gmail.com> <320fb6e00911120608w235ff7e3qfbb5b07fea9d3148@mail.gmail.com> <20091113132346.GB48178@sobchak.mgh.harvard.edu> <320fb6e00911130551h21f5c438gfbb68cc25bd9ea7b@mail.gmail.com> <20091117132417.GE68691@sobchak.mgh.harvard.edu> <320fb6e00911170653p223b671aj68459868348fe524@mail.gmail.com> <320fb6e00911180331o53ddfc4bi4ccdb0f1972221f5@mail.gmail.com> <320fb6e00911180530h54544fbby1b7b284cabc874da@mail.gmail.com> Message-ID: <320fb6e00911180540y4bd82f09l5f6fbf5eed9e8ce1@mail.gmail.com> Hi all, Over on the BioSQL mailing list, Chris Fields just made an interesting point - there are real GenBank files with multiple source features: Chris Fields wrote: > > Just to note, there are a few cases where there are two or more > source features. This pops up mainly with chimeric sequences, > for example: > > http://www.ncbi.nlm.nih.gov/nuccore/21727885 > > We have run into this a couple of times on the bioperl list. In this > case, each feature is limited to specific locations on the sequence > and doesn't pertain to the entire sequence. NCBI only notes the > first source on the ORGANISM line; last time I checked, EMBL > used both. > > chris At very least, this will make an excellent example of the unit tests! Peter From biopython at maubp.freeserve.co.uk Wed Nov 18 15:47:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Nov 2009 15:47:43 +0000 Subject: [Biopython] Fwd: Bio.PDB: parsing PDB files for ATOM records In-Reply-To: <13397911330439330334250671630377437584-Webmail@me.com> References: <13397911330439330334250671630377437584-Webmail@me.com> Message-ID: <320fb6e00911180747s5ab221c7tef1a6c83a749ab75@mail.gmail.com> On Thu, Nov 5, 2009 at 3:51 PM, Konrad Koehler wrote: > Contray to my first post, the modifications to Bio.PDB outlined below: > > http://osdir.com/ml/python.bio.general/2008-04/msg00038.html > > do work with the lastest version of Bio.PDB. ?(I must have introduced a > typo in my first try, on the second try it worked perfectly). > > I would however request that these changes be incorporated into the > production version of Bio.PDB. > > Best regards, > > Konrad I just found your email in my spam folder :( This was filed as Bug 2495, http://bugzilla.open-bio.org/show_bug.cgi?id=2495 Peter From Jose.Lacal at OpenPHI.com Wed Nov 18 23:19:09 2009 From: Jose.Lacal at OpenPHI.com (Jose C. Lacal) Date: Wed, 18 Nov 2009 18:19:09 -0500 Subject: [Biopython] Parsing records off PubMed vs. PubMedCentral. Message-ID: <1258586349.25095.52.camel@DESK01> Greetings: I'm just starting to use BioPython and this may be a dumb question. I've been following the excellent tutorial at http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc88 My question refers to section 8.11.1 a.) I am able to query, retrieve and parse files from db="pubmed" as per the code below. This works. from Bio import Entrez, Medline Entrez.email = "Jose.Lacal at OpenPHI.com" handle = handle = Entrez.esearch(db="pubmed", term="hypertension[all]&George+Mason+University[affl]", rettype="medline", retmode="text") record = Entrez.read(handle) print record["IdList"] idlist = record["IdList"] handle = Entrez.efetch(db="pubmed",id=idlist,rettype="medline",retmode="text") records = Medline.parse(handle) for record in records: print record["AU"] b.) But when I change db="pubmed" to db="pmc" I get an error message: KeyError: 'AU' It looks like "pmc" does not have the same keys as "pubmed" And I've been unable to find the equivalent format to parse files downloaded from "pmc" Pointers and suggestions most appreciated. regards. -- ----- ----- ----- Jose C. Lacal, Founder & Chief Vision Officer Open Personalized Health Informatics "OpenPHI" 15625 NW 15th Avenue; Suite 15 Miami, FL 33169-5601 USA www.OpenPHI.com [M] +1 (954) 553-1984 Jose.Lacal at OpenPHI.com OpenPHI is an information management company. We acquire, compile, and manage mailing lists in the global academic & bio-medical spaces. See: http://www.openphi.com/healthmining.html From biopython at maubp.freeserve.co.uk Thu Nov 19 11:03:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 19 Nov 2009 11:03:04 +0000 Subject: [Biopython] Parsing records off PubMed vs. PubMedCentral. In-Reply-To: <1258586349.25095.52.camel@DESK01> References: <1258586349.25095.52.camel@DESK01> Message-ID: <320fb6e00911190303y54ee4957q7a59217deea47c1f@mail.gmail.com> On Wed, Nov 18, 2009 at 11:19 PM, Jose C. Lacal wrote: > Greetings: > > I'm just starting to use BioPython and this may be a dumb question. > > I've been following the excellent tutorial at > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc88 > > My question refers to section 8.11.1 > > > a.) I am able to query, retrieve and parse files from db="pubmed" as per > the code below. This works. > > > from Bio import Entrez, Medline > Entrez.email = "Jose.Lacal at OpenPHI.com" > > handle = handle = Entrez.esearch(db="pubmed", > term="hypertension[all]&George+Mason+University[affl]", > rettype="medline", retmode="text") > > record = Entrez.read(handle) > print record["IdList"] > > idlist = record["IdList"] > handle = > Entrez.efetch(db="pubmed",id=idlist,rettype="medline",retmode="text") > > records = Medline.parse(handle) > for record in records: > ? ? ? ?print record["AU"] > OK, good :) > b.) But when I change db="pubmed" to db="pmc" I get an error message: > KeyError: 'AU' > > It looks like "pmc" does not have the same keys as "pubmed" And I've > been unable to find the equivalent format to parse files downloaded from > "pmc" > > Pointers and suggestions most appreciated. regards. Correct - PubMed and PubMedCentral are different databases and use different identifiers. You can use Entrez ELink to map between them. e.g. The Biopython application note has PMID 19304878, but its PMCID is 2682512. >>> from Bio import Entrez >>> print Entrez.efetch(db="pubmed",id="19304878",rettype="medline",retmode="text").read() PMID- 19304878 OWN - NLM STAT- MEDLINE DA - 20090515 DCOM- 20090709 LR - 20091104 IS - 1367-4811 (Electronic) VI - 25 IP - 11 DP - 2009 Jun 1 TI - Biopython: freely available Python tools for computational molecular biology and bioinformatics. PG - 1422-3 ... Now, according to the documentation for EFetch, PMC should support rettype="medline" (just like PubMed): http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchlit_help.html >>> print Entrez.efetch(db="pmc",id="2682512", retmode="medline", rettype="text").read()

      Error occurred: Report 'text' not found in 'pmc' presentation


        ... Odd. I also tried the XML from EFetch for PMC, but it fails to validate. I wonder if this in an NCBI glitch? I have emailed them about this. In the meantime, I would suggest you just use PubMed not PMC - it covers more journals but in less depth. Peter From cmckay at u.washington.edu Thu Nov 19 23:42:12 2009 From: cmckay at u.washington.edu (Cedar McKay) Date: Thu, 19 Nov 2009 15:42:12 -0800 Subject: [Biopython] allow ambiguities is sequence matching? Message-ID: Hello all, Apologies if this is covered in the tutorial anywhere, if so I didn't see it. I am trying to test whether sequence A appears anywhere in sequence B. The catch is I want to allow n mismatches. Right now my code looks like: #record is a SeqRecord #query_seq is a string if query_seq in record.seq: do something If I want query_seq to match despite n nucleotide mismatches, how should I do that? It seems like something that would be pretty common for people working with DNA probes. Is this even a biopython problem? Or is it just a general python problem? thanks, Cedar From biopython at maubp.freeserve.co.uk Fri Nov 20 10:03:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 20 Nov 2009 10:03:15 +0000 Subject: [Biopython] allow ambiguities is sequence matching? In-Reply-To: References: Message-ID: <320fb6e00911200203t1bc09df6jc0dac7d619a644f0@mail.gmail.com> On Thu, Nov 19, 2009 at 11:42 PM, Cedar McKay wrote: > Hello all, > Apologies if this is covered in the tutorial anywhere, if so I didn't see > it. > > I am trying to test whether sequence A appears anywhere in sequence B. The > catch is I want to allow n mismatches. Right now my code looks like: > > #record is a SeqRecord > #query_seq is a string > if query_seq in record.seq: > ? ? ? ?do something > > > If I want query_seq to match despite n nucleotide mismatches, how should I > do that? It seems like something that would be pretty common for people > working with DNA probes. Is this even a biopython problem? Or is it just a > general python problem? We have in general tried to keep the Seq object API as much like that of the Python string as is reasonable, for example the find, startswith and endswith methos. Likewise, the "in" operator on the Seq object also works like a python string, it uses plain string matching (see Bug 2853, this was added in Biopython 1.51). It sounds like you want some kind of fuzzy find... one solution would be regular expressions, another might be to use the Bio.Motif module. There have been similar discussions on the mailing list before, but no clear consensus - see for example Bug 2601. Peter From biopython at maubp.freeserve.co.uk Fri Nov 20 10:49:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 20 Nov 2009 10:49:09 +0000 Subject: [Biopython] Parsing records off PubMed vs. PubMedCentral. In-Reply-To: <320fb6e00911190303y54ee4957q7a59217deea47c1f@mail.gmail.com> References: <1258586349.25095.52.camel@DESK01> <320fb6e00911190303y54ee4957q7a59217deea47c1f@mail.gmail.com> Message-ID: <320fb6e00911200249o4c5c736bia43dd0b586c32ccd@mail.gmail.com> On Thu, Nov 19, 2009 at 11:03 AM, Peter wrote: > > Now, according to the documentation for EFetch, PMC should support > rettype="medline" (just like PubMed): > http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetchlit_help.html > >>>> print Entrez.efetch(db="pmc",id="2682512", retmode="medline", rettype="text").read() > > >

        Error occurred: Report 'text' not found in 'pmc' > presentation


          > ... > > > Odd. I also tried the XML from EFetch for PMC, but it fails to > validate. I wonder if this in an NCBI glitch? I have emailed them > about this. > I had a reply from someone at the NCBI, who had also noticed a problem, and has reported this to the EFetch developers. > In the meantime, I would suggest you just use PubMed not PMC - it > covers more journals but in less depth. Peter From biopython at maubp.freeserve.co.uk Fri Nov 20 14:29:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 20 Nov 2009 14:29:25 +0000 Subject: [Biopython] Seq object ungap method In-Reply-To: <320fb6e00911061100g2122ef77se80f5bc6198cb030@mail.gmail.com> References: <320fb6e00911061100g2122ef77se80f5bc6198cb030@mail.gmail.com> Message-ID: <320fb6e00911200629s3fe49d0di256d5a0dbc24ffc0@mail.gmail.com> Hi all, Something we discussed last year was adding an ungap method to the Seq object. e.g. http://lists.open-bio.org/pipermail/biopython/2008-September/004523.html http://lists.open-bio.org/pipermail/biopython/2008-September/004527.html As mentioned earlier this month on the dev mailing list, http://lists.open-bio.org/pipermail/biopython-dev/2009-November/006983.html I actually made the time to implement this, and posted it on a github branch - you can see the updated Bio/Seq.py file here: http://github.com/peterjc/biopython/blob/ungap/Bio/Seq.py I've included a copy of the proposed docstring for the new Seq object ungap method at the end of this email, which tries to illustrate how this would be used. I'd like some comments - is this worth including in Biopython? Thanks, Peter -- This is the proposed docstring for the new Seq object ungap method, the examples double as doctest unit tests: Return a copy of the sequence without the gap character(s). The gap character can be specified in two ways - as an explicit argument, or via the sequence's alphabet. For example: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_dna >>> my_dna = Seq("ATA--TGAAAT-TTGAAAA", generic_dna) >>> my_dna Seq('ATA--TGAAAT-TTGAAAA', DNAAlphabet()) >>> my_dna.ungap("-") Seq('ATATGAAATTTGAAAA', DNAAlphabet()) If the gap character is not given as an argument, it will be taken from the sequence's alphabet (if defined). Notice that the returned sequence's alphabet is adjusted since it no longer requires a gapped alphabet: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC, Gapped, HasStopCodon >>> my_pro = Seq("MVVLE=AD*", HasStopCodon(Gapped(IUPAC.protein, "="))) >>> my_pro Seq('MVVLE=AD*', HasStopCodon(Gapped(IUPACProtein(), '='), '*')) >>> my_pro.ungap() Seq('MVVLEAD*', HasStopCodon(IUPACProtein(), '*')) Or, with a simpler gapped DNA example: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC, Gapped >>> my_seq = Seq("CGGGTAG=AAAAAA", Gapped(IUPAC.unambiguous_dna, "=")) >>> my_seq Seq('CGGGTAG=AAAAAA', Gapped(IUPACUnambiguousDNA(), '=')) >>> my_seq.ungap() Seq('CGGGTAGAAAAAA', IUPACUnambiguousDNA()) As long as it is consistent with the alphabet, although it is redundant, you can stil supply the gap character as an argument to this method: >>> my_seq Seq('CGGGTAG=AAAAAA', Gapped(IUPACUnambiguousDNA(), '=')) >>> my_seq.ungap("=") Seq('CGGGTAGAAAAAA', IUPACUnambiguousDNA()) However, if the gap character given as the argument disagrees with that declared in the alphabet, an exception is raised: >>> my_seq Seq('CGGGTAG=AAAAAA', Gapped(IUPACUnambiguousDNA(), '=')) >>> my_seq.ungap("-") Traceback (most recent call last): ... ValueError: Gap '-' does not match '=' from alphabet Finally, if a gap character is not supplied, and the alphabet does not define one, an exception is raised: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import generic_dna >>> my_dna = Seq("ATA--TGAAAT-TTGAAAA", generic_dna) >>> my_dna Seq('ATA--TGAAAT-TTGAAAA', DNAAlphabet()) >>> my_dna.ungap() Traceback (most recent call last): ... ValueError: Gap character not given and not defined in alphabet From schafer at rostlab.org Fri Nov 20 16:55:58 2009 From: schafer at rostlab.org (Christian Schaefer) Date: Fri, 20 Nov 2009 11:55:58 -0500 Subject: [Biopython] allow ambiguities is sequence matching? In-Reply-To: References: Message-ID: <4B06CA1E.1010000@rostlab.org> Hey Cedar, I'm currently doing something similar on protein sequences. A simple brute force method could work like this: Slide the short sequence 'underneath' the long sequence. After each step translate the current overlap into a bit-string where 1 indicates a match and 0 a mismatch. Now you can easily apply a regex on this bit-string to look for particular patterns like 'n mismatches allowed'. Hope that helps. Chris Cedar McKay wrote: > Hello all, > Apologies if this is covered in the tutorial anywhere, if so I didn't > see it. > > I am trying to test whether sequence A appears anywhere in sequence B. > The catch is I want to allow n mismatches. Right now my code looks like: > > #record is a SeqRecord > #query_seq is a string > if query_seq in record.seq: > do something > > > If I want query_seq to match despite n nucleotide mismatches, how should > I do that? It seems like something that would be pretty common for > people working with DNA probes. Is this even a biopython problem? Or is > it just a general python problem? > > thanks, > Cedar > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From anaryin at gmail.com Mon Nov 23 09:02:06 2009 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 23 Nov 2009 01:02:06 -0800 Subject: [Biopython] SeqIO.parse Question Message-ID: Dear all, This is merely a suggestion. I've been using SeqIO.parse on some user input I receive from a server. I'm using the following code: for num, record in enumerate(SeqIO.parse(StringIO(FASTA_sequence), 'fasta')): req_seq = record.seq.tostring() req_name = record.id Since I have no clue what the user might introduce, regarding the number of sequences, I have to user parse, instead of read. If I introduce only one sequence and it is a valid FASTA sequence, it does its work flawlessly. If I insert several FASTA sequences and one of them is wrongly formatted, it won't complain at all. If I insert a single wrong sequence, it doesn't complain either. Is there a convenient way for me to check FASTA formats? The usual startswith('>') doesn't work for multiple sequences. And the user might have spaces in the sequence so a split('\n') is also ruled out to split the sequences. At the moment, I'm checking if the first sequence of the input starts with '>', and if it does, the parser kicks in and for every req_seq object I check if there is any character that is not valid (a number or an otherwise weird character). If I get a mis-formatted sequence in there it will complain because spaces, newlines, and numbers ( often found in sequence names ) are not in my allowed list. However, if there's an easier way, it would save me some if checks and for loops :) Suggestions? Best regards to all, Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ From biopython at maubp.freeserve.co.uk Mon Nov 23 10:18:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 23 Nov 2009 10:18:24 +0000 Subject: [Biopython] SeqIO.parse Question In-Reply-To: References: Message-ID: <320fb6e00911230218i346d104cr81fb46710d4fb8a4@mail.gmail.com> On Mon, Nov 23, 2009 at 9:02 AM, Jo?o Rodrigues wrote: > Dear all, > > This is merely a suggestion. I've been using SeqIO.parse on some user input > I receive from a server. > > I'm using the following code: > > for num, record in enumerate(SeqIO.parse(StringIO(FASTA_sequence), > 'fasta')): > > ? ?req_seq = record.seq.tostring() > ? ?req_name = record.id > > Since I have no clue what the user might introduce, regarding the number of > sequences, I have to user parse, instead of read. If I introduce only one > sequence and it is a valid FASTA sequence, it does its work flawlessly. If I > insert several FASTA sequences and one of them is wrongly formatted, it > won't complain at all. If I insert a single wrong sequence, it doesn't > complain either. Can you give us an example? > Is there a convenient way for me to check FASTA formats? The usual > startswith('>') doesn't work for multiple sequences. And the user might have > spaces in the sequence so a split('\n') is also ruled out to split the > sequences. You could do something like ("\n"+FASTA_sequence).count("\n>") to get the number of records. > At the moment, I'm checking if the first sequence of the input starts with > '>', and if it does, the parser kicks in and for every req_seq object I > check if there is any character that is not valid (a number or an otherwise > weird character). If I get a mis-formatted sequence in there it will > complain because spaces, newlines, and numbers ( often found in sequence > names ) are not in my allowed list. > > However, if there's an easier way, it would save me some if checks and for > loops :) Suggestions? I'm not 100% sure what you are tying to do - some examples should help. Peter From anaryin at gmail.com Mon Nov 23 10:49:14 2009 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 23 Nov 2009 02:49:14 -0800 Subject: [Biopython] SeqIO.parse Question In-Reply-To: <320fb6e00911230218i346d104cr81fb46710d4fb8a4@mail.gmail.com> References: <320fb6e00911230218i346d104cr81fb46710d4fb8a4@mail.gmail.com> Message-ID: Sorry for the clouded explanation :x I'll try to show you an example: I have a server that runs BLAST queries from user deposited sequences. Those sequences have to in FASTA format. 4 Users deposit their sequences User 1: >SequenceName AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA User2: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA User3: >Sequence1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Sequence2 BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB User4: >SequenceOops AAAAAAAAAAAAAA1AAAAAAAAAAAAAAAAAAAAAAA Now, if I run this through a python script that has simply something like this: user_input = getInput() # Gets input from the user (can be single or multiple sequences) for record in SeqIO.parse(StringIO(user_input),'fasta'): # Parses each sequence on at a time print record.id print "Parsed" This will happen for each of the users up there: User1 will run smoothly and 'SequenceName' will be displayed. 'Parsed' will also be displayed. User2 will be shown 'Parsed'. Despite his sequence is not in FASTA format, the parser didn't throw an exception saying so. It just skips the for loop ( maybe treats the SeqIO.parse as None ). User3 will be shown 'Sequence1' and 'Parsed', although his second sequence is not correctly formatted. User4 will be shown 'SequenceOops' and 'Parsed', although there is a 1 in the sequence ( which is not a valid character for any sequence ). My question is basically: is there a way to do a sanity check to a file to see if it really contains proper FASTA sequences? The way I'm doing it works ok but it seems to be a bit too messy to be the best solution. I'm first checking if the first character of the user input is a '>'. If it is, I'm then passing the whole input to the Biopython parser. For each record the parser consumes, I get the sequence back, or what the parser thinks is a sequence, and then I check to see if there are any numbers, blankspaces, etc, in the sequence. If there are, I'll raise an exception. With those 4 examples: User 1 passes everything ok User 2 fails the first check. User 3 and 4 fail the second check because of blank spaces and numbers. This might sound a bit stupid on my part, and I apologize in advance, but this way I don't see much of a use in SeqIO.parse function. I'd do almost the same with user_input.split('\n>'). Is this clearer? My code is here: http://pastebin.com/m4d993239 From biopython at maubp.freeserve.co.uk Mon Nov 23 11:19:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 23 Nov 2009 11:19:46 +0000 Subject: [Biopython] SeqIO.parse Question In-Reply-To: References: <320fb6e00911230218i346d104cr81fb46710d4fb8a4@mail.gmail.com> Message-ID: <320fb6e00911230319n661c3075y664f848e4f14d271@mail.gmail.com> Thanks for clarifying Jo?o :) On Mon, Nov 23, 2009 at 10:49 AM, Jo?o Rodrigues wrote: > Sorry for the clouded explanation :x I'll try to show you an example: > > I have a server that runs BLAST queries from user deposited sequences. Those > sequences have to in FASTA format. 4 Users deposit their sequences > > User 1: >>SequenceName > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Valid record, fine. > User2: > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA Missing ">" header, this contains no FASTA records. > User3: >>Sequence1 > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA > Sequence2 > BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB Assuming you don't mind numbers in your sequence (which do get used in some situations), this is a valid FASTA file with a single record, equivalent to identifier "Sequence1" and sequence: AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAASequence2BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB > User4: >>SequenceOops > AAAAAAAAAAAAAA1AAAAAAAAAAAAAAAAAAAAAAA As in example 1, assuming you don't mind numbers in your sequence, this is a valid FASTA file. > Now, if I run this through a python script that has simply something like > this: > > user_input = getInput() # Gets input from the user (can be single or > multiple sequences) > > for record in SeqIO.parse(StringIO(user_input),'fasta'): # Parses each > sequence on at a time > ? print record.id > print "Parsed" > > This will happen for each of the users up there: > > User1 will run smoothly and 'SequenceName' will be displayed. 'Parsed' will > also be displayed. > > User2 will be shown 'Parsed'. Despite his sequence is not in FASTA format, > the parser didn't throw an exception saying so. It just skips the for loop ( > maybe treats the SeqIO.parse as None ). > > User3 will be shown 'Sequence1' and 'Parsed', although his second sequence > is not correctly formatted. > > User4 will be shown 'SequenceOops' and 'Parsed', although there is a 1 in > the sequence ( which is not a valid character for any sequence ). > > My question is basically: is there a way to do a sanity check to a file to > see if it really contains proper FASTA sequences? The way I'm doing it works > ok but it seems to be a bit too messy to be the best solution. > > I'm first checking if the first character of the user input is a '>'. If it > is, I'm then passing the whole input to the Biopython parser. I probably do something similar - but I would first strip the white space. After all, "\n\n\n>ID\nACGT\n\n\n" is a valid FASTA file with one record. If the sequence lacks the ">", then I would either raise an error, or add something like ">Default\n" to the start automatically. Do whatever the BLAST webpage does to make it consistent for your users. > For each > record the parser consumes, I get the sequence back, or what the parser > thinks is a sequence, and then I check to see if there are any numbers, > blankspaces, etc, in the sequence. If there are, I'll raise an exception. Again, I might do the same (but see below). > With those 4 examples: > > User 1 passes everything ok > User 2 fails the first check. > User 3 and 4 fail the second check because of blank spaces and numbers. > > This might sound a bit stupid on my part, and I apologize in advance, but > this way I don't see much of a use in SeqIO.parse function. I'd do almost > the same with user_input.split('\n>'). > > Is this clearer? My code is here: http://pastebin.com/m4d993239 The problem is your definition of "valid FASTA" and Biopython's differ. This is largely because the FASTA file format has never been strictly defined. You'll find lots of differences in different tools (e.g. some like ClustalW can't cope with long description lines; some tools allow comment lines; in some cases characters like "." and "*" are allowed but not all). Also, you appear to want something very narrow - protein FASTA files with a limited character set (some but not all of the full IUPAC set) plus the minus sign (as a gap). Bio.SeqIO is not trying to do file format validation - it is trying to do file parsing, and for your needs it is being too tolerant. In this situation then yes, doing your own validation (without using Biopython) might be simplest. How I would like to "fix" this is to implement Bug 2597 (strict alphabet checking in the Seq object). Then, when you call Bio.SeqIO.parse, include the expected alphabet which should specify the allowed letters (and exclude numbers etc). See: http://bugzilla.open-bio.org/show_bug.cgi?id=2597 Peter P.S. In your code, using a set should be faster for checking membership: allowed = set('ABCDEFGHIKLMNPQRSTUVWYZX-') In fact, I would make the allowed list include both cases, then you don't have to make all those calls to upper. I would also double check to see if the latest version of BLAST does in fact accept O (Pyrrolysine) or J (Leucine or Isoleucine), and if need be contact the NCBI to update this webpage: http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml Peter From anaryin at gmail.com Mon Nov 23 11:34:53 2009 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Mon, 23 Nov 2009 03:34:53 -0800 Subject: [Biopython] SeqIO.parse Question In-Reply-To: <320fb6e00911230319n661c3075y664f848e4f14d271@mail.gmail.com> References: <320fb6e00911230218i346d104cr81fb46710d4fb8a4@mail.gmail.com> <320fb6e00911230319n661c3075y664f848e4f14d271@mail.gmail.com> Message-ID: My definition of FASTA is actually what BLASTp requires. It's quite a picky tool :) I had already understood that FASTA is quite... lax. But I thought I was missing something, thus asking the list. Is the alphabet patch already included? Thanks for the tip on the leading white space, had missed that :) Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ On Mon, Nov 23, 2009 at 3:19 AM, Peter wrote: > Thanks for clarifying Jo?o :) > > On Mon, Nov 23, 2009 at 10:49 AM, Jo?o Rodrigues > wrote: > > Sorry for the clouded explanation :x I'll try to show you an example: > > > > I have a server that runs BLAST queries from user deposited sequences. > Those > > sequences have to in FASTA format. 4 Users deposit their sequences > > > > User 1: > >>SequenceName > > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA > > Valid record, fine. > > > User2: > > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA > > Missing ">" header, this contains no FASTA records. > > > User3: > >>Sequence1 > > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA > > Sequence2 > > BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB > > Assuming you don't mind numbers in your sequence (which > do get used in some situations), this is a valid FASTA file > with a single record, equivalent to identifier "Sequence1" > and sequence: > > > AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAASequence2BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB > > > User4: > >>SequenceOops > > AAAAAAAAAAAAAA1AAAAAAAAAAAAAAAAAAAAAAA > > As in example 1, assuming you don't mind numbers in your > sequence, this is a valid FASTA file. > > > Now, if I run this through a python script that has simply something like > > this: > > > > user_input = getInput() # Gets input from the user (can be single or > > multiple sequences) > > > > for record in SeqIO.parse(StringIO(user_input),'fasta'): # Parses each > > sequence on at a time > > print record.id > > print "Parsed" > > > > This will happen for each of the users up there: > > > > User1 will run smoothly and 'SequenceName' will be displayed. 'Parsed' > will > > also be displayed. > > > > User2 will be shown 'Parsed'. Despite his sequence is not in FASTA > format, > > the parser didn't throw an exception saying so. It just skips the for > loop ( > > maybe treats the SeqIO.parse as None ). > > > > User3 will be shown 'Sequence1' and 'Parsed', although his second > sequence > > is not correctly formatted. > > > > User4 will be shown 'SequenceOops' and 'Parsed', although there is a 1 in > > the sequence ( which is not a valid character for any sequence ). > > > > My question is basically: is there a way to do a sanity check to a file > to > > see if it really contains proper FASTA sequences? The way I'm doing it > works > > ok but it seems to be a bit too messy to be the best solution. > > > > I'm first checking if the first character of the user input is a '>'. If > it > > is, I'm then passing the whole input to the Biopython parser. > > I probably do something similar - but I would first strip the white space. > After all, "\n\n\n>ID\nACGT\n\n\n" is a valid FASTA file with one record. > > If the sequence lacks the ">", then I would either raise an error, or > add something like ">Default\n" to the start automatically. Do whatever > the BLAST webpage does to make it consistent for your users. > > > For each > > record the parser consumes, I get the sequence back, or what the parser > > thinks is a sequence, and then I check to see if there are any numbers, > > blankspaces, etc, in the sequence. If there are, I'll raise an exception. > > Again, I might do the same (but see below). > > > With those 4 examples: > > > > User 1 passes everything ok > > User 2 fails the first check. > > User 3 and 4 fail the second check because of blank spaces and numbers. > > > > This might sound a bit stupid on my part, and I apologize in advance, but > > this way I don't see much of a use in SeqIO.parse function. I'd do almost > > the same with user_input.split('\n>'). > > > > Is this clearer? My code is here: http://pastebin.com/m4d993239 > > The problem is your definition of "valid FASTA" and Biopython's differ. > This is largely because the FASTA file format has never been strictly > defined. You'll find lots of differences in different tools (e.g. some like > ClustalW can't cope with long description lines; some tools allow > comment lines; in some cases characters like "." and "*" are allowed > but not all). > > Also, you appear to want something very narrow - protein FASTA > files with a limited character set (some but not all of the full IUPAC > set) plus the minus sign (as a gap). > > Bio.SeqIO is not trying to do file format validation - it is trying to do > file parsing, and for your needs it is being too tolerant. In this > situation > then yes, doing your own validation (without using Biopython) might > be simplest. > > How I would like to "fix" this is to implement Bug 2597 (strict alphabet > checking in the Seq object). Then, when you call Bio.SeqIO.parse, > include the expected alphabet which should specify the allowed > letters (and exclude numbers etc). See: > http://bugzilla.open-bio.org/show_bug.cgi?id=2597 > > Peter > > P.S. In your code, using a set should be faster for checking membership: > > allowed = set('ABCDEFGHIKLMNPQRSTUVWYZX-') > > In fact, I would make the allowed list include both cases, then > you don't have to make all those calls to upper. > > I would also double check to see if the latest version of BLAST > does in fact accept O (Pyrrolysine) or J (Leucine or Isoleucine), > and if need be contact the NCBI to update this webpage: > http://www.ncbi.nlm.nih.gov/BLAST/blastcgihelp.shtml > > Peter > From biopython at maubp.freeserve.co.uk Mon Nov 23 11:40:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 23 Nov 2009 11:40:56 +0000 Subject: [Biopython] SeqIO.parse Question In-Reply-To: References: <320fb6e00911230218i346d104cr81fb46710d4fb8a4@mail.gmail.com> <320fb6e00911230319n661c3075y664f848e4f14d271@mail.gmail.com> Message-ID: <320fb6e00911230340k71c57338v84e9d832a71dae99@mail.gmail.com> On Mon, Nov 23, 2009 at 11:34 AM, Jo?o Rodrigues wrote: > My definition of FASTA is actually what BLASTp requires. It's quite a picky > tool :) I had already understood that FASTA is quite... lax. But I thought I > was missing something, thus asking the list. Is the alphabet patch already > included? No, the strict alphabet checking in the Seq object is not merged (yet). This is potentially a contentious issue, and may break existing scripts which really on the current lax behaviour. I am wondering about making this trigger a warning in the next release of Biopython as a step towards making the strict check the default, but this needs further debate. > Thanks for the tip on the leading white space, had missed that :) Sure. Peter From biopython at maubp.freeserve.co.uk Mon Nov 23 16:53:20 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 23 Nov 2009 16:53:20 +0000 Subject: [Biopython] Seq object ungap method In-Reply-To: <320fb6e00911200629s3fe49d0di256d5a0dbc24ffc0@mail.gmail.com> References: <320fb6e00911061100g2122ef77se80f5bc6198cb030@mail.gmail.com> <320fb6e00911200629s3fe49d0di256d5a0dbc24ffc0@mail.gmail.com> Message-ID: <320fb6e00911230853r4a3f95dbk49a830e9e16c9246@mail.gmail.com> On Fri, Nov 20, 2009 at 2:29 PM, Peter wrote: > Hi all, > > Something we discussed last year was adding an ungap method > to the Seq object. e.g. > http://lists.open-bio.org/pipermail/biopython/2008-September/004523.html > http://lists.open-bio.org/pipermail/biopython/2008-September/004527.html > > As mentioned earlier this month on the dev mailing list, > http://lists.open-bio.org/pipermail/biopython-dev/2009-November/006983.html > I actually made the time to implement this, and posted it on a > github branch - you can see the updated Bio/Seq.py file here: > http://github.com/peterjc/biopython/blob/ungap/Bio/Seq.py > > I've included a copy of the proposed docstring for the new Seq object > ungap method at the end of this email, which tries to illustrate how this > would be used. In the absence of any further comments (thanks Eric for your reply on the dev list), I've made an executive decision to check this into the trunk. This will make it much easier for people to test the new ungap method. I remain open to feedback (e.g. naming of the method) and we can even remove this before the next release if that turns out to be the consensus. Peter From jblanca at btc.upv.es Tue Nov 24 10:32:55 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Tue, 24 Nov 2009 11:32:55 +0100 Subject: [Biopython] Subclassing Seq and SeqRecord Message-ID: <200911241132.55922.jblanca@btc.upv.es> Hi: I'm "Biopythoniing" my utilities. I want to subclass Seq and SeqRecord to modify a little its behaviour. for instance I'm doing: from Bio.Seq import Seq as BioSeq class Seq(BioSeq): 'A biopython Seq with some extra functionality' def __eq__(self, seq): 'It checks if the given seq is equal to this one' return str(self) == str(seq) The problem is that to modify this behaviour I have to copy a lot of Seq methods because this methods create new Seq instances to return. This instances are created like: return Seq(str(self).replace('T','U').replace('t','u'), alphabet) would it be possible to change that to: return self.__class__(str(self).replace('T','U').replace('t','u'), alphabet) In that way the new instance would be created using the subclassed class and not the Seq class. Is that a reasonable change? In that case I could prepare a patch for Seq and SeqRecord. Regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From p.j.a.cock at googlemail.com Tue Nov 24 10:53:40 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 Nov 2009 10:53:40 +0000 Subject: [Biopython] Subclassing Seq and SeqRecord In-Reply-To: <200911241132.55922.jblanca@btc.upv.es> References: <200911241132.55922.jblanca@btc.upv.es> Message-ID: <320fb6e00911240253g10986fcfj311c8a2adc12afd5@mail.gmail.com> On Tue, Nov 24, 2009 at 10:32 AM, Jose Blanca wrote: > Hi: > I'm "Biopythoniing" my utilities. I want to subclass Seq and SeqRecord to > modify a little its behaviour. for instance I'm doing: > > from Bio.Seq import Seq as BioSeq > > class Seq(BioSeq): > ? ?'A biopython Seq with some extra functionality' > ? ?def __eq__(self, seq): > ? ? ? ?'It checks if the given seq is equal to this one' > ? ? ? ?return str(self) == str(seq) That is something I have been meaning to bring up on the list. I started chatting to Brad about this at BOSC2009. The details get quite hairy with hashes and dictionaries and so on, so I will leave it to another email. > The problem is that to modify this behaviour I have to copy a lot of Seq > methods because this methods create new Seq instances to return. This > instances are created like: > return Seq(str(self).replace('T','U').replace('t','u'), alphabet) > > would it be possible to change that to: > return self.__class__(str(self).replace('T','U').replace('t','u'), alphabet) > > In that way the new instance would be created using the subclassed > class and not the Seq class. Is that a reasonable change? In that > case I could prepare a patch for Seq and SeqRecord. It is a reasonable change, but ONLY if all the subclasses support the same __init__ method, which isn't true. For example, the Bio.Seq.UnknownSeq subclasses Seq and uses a different __init__ method signature. This means any change would at a minimum have to include lots of fixes to the UnknownSeq From cmckay at u.washington.edu Tue Nov 24 23:12:08 2009 From: cmckay at u.washington.edu (Cedar McKay) Date: Tue, 24 Nov 2009 15:12:08 -0800 Subject: [Biopython] allow ambiguities is sequence matching? In-Reply-To: <320fb6e00911200203t1bc09df6jc0dac7d619a644f0@mail.gmail.com> References: <320fb6e00911200203t1bc09df6jc0dac7d619a644f0@mail.gmail.com> Message-ID: <7577CD69-1CD0-428A-B271-DA39F8718EA9@u.washington.edu> Thanks for the advice, I'll check out that bug, and see what I see. best, Cedar On Nov 20, 2009, at 2:03 AM, Peter wrote: > On Thu, Nov 19, 2009 at 11:42 PM, Cedar McKay > wrote: >> Hello all, >> Apologies if this is covered in the tutorial anywhere, if so I >> didn't see >> it. >> >> I am trying to test whether sequence A appears anywhere in sequence >> B. The >> catch is I want to allow n mismatches. Right now my code looks like: >> >> #record is a SeqRecord >> #query_seq is a string >> if query_seq in record.seq: >> do something >> >> >> If I want query_seq to match despite n nucleotide mismatches, how >> should I >> do that? It seems like something that would be pretty common for >> people >> working with DNA probes. Is this even a biopython problem? Or is it >> just a >> general python problem? > > We have in general tried to keep the Seq object API as much like > that of > the Python string as is reasonable, for example the find, startswith > and > endswith methos. Likewise, the "in" operator on the Seq object also > works > like a python string, it uses plain string matching (see Bug 2853, > this was > added in Biopython 1.51). > > It sounds like you want some kind of fuzzy find... one solution would > be regular expressions, another might be to use the Bio.Motif module. > There have been similar discussions on the mailing list before, but no > clear consensus - see for example Bug 2601. > > Peter From mitlox at op.pl Thu Nov 26 00:49:15 2009 From: mitlox at op.pl (xyz) Date: Thu, 26 Nov 2009 10:49:15 +1000 Subject: [Biopython] fastq-solexa index Message-ID: <4B0DD08B.6070607@op.pl> Hello, On this page ( http://www.biopython.org/wiki/SeqIO ) I have found that biopython can use fastq-solexa index. What does it means and are there any examples? Thank you in advance. Best regards, From anaryin at gmail.com Thu Nov 26 02:24:04 2009 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 25 Nov 2009 18:24:04 -0800 Subject: [Biopython] Turning PDBConstructionWarning off Message-ID: Dear All, Is there a way to make the PDBParser not to display Warnings when it reads structures? Like a flag that we pass somewhere? Regards! Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ From rodrigo_faccioli at uol.com.br Thu Nov 26 10:41:26 2009 From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli) Date: Thu, 26 Nov 2009 08:41:26 -0200 Subject: [Biopython] Turning PDBConstructionWarning off In-Reply-To: References: Message-ID: <3715adb70911260241x48678fe9ma80a246b7d73033b@mail.gmail.com> You can use the command line: python -O file.py I could execute the biopython without warning message when I put that command line. I understood that these warning messages show because we execute the biopython in debug mode. If you look the source code, you'll see: if __debug__: warning message So, I thought if set false in this variable (__debug__) I'll not warning message. I don't know if there is other way. Regards, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 On Thu, Nov 26, 2009 at 12:24 AM, Jo?o Rodrigues wrote: > Dear All, > > Is there a way to make the PDBParser not to display Warnings when it reads > structures? Like a flag that we pass somewhere? > > Regards! > > Jo?o [...] Rodrigues > @ http://stanford.edu/~joaor/ > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Thu Nov 26 10:42:59 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Nov 2009 10:42:59 +0000 Subject: [Biopython] Turning PDBConstructionWarning off In-Reply-To: References: Message-ID: <320fb6e00911260242y6d989df8wd5271cda71f3ed08@mail.gmail.com> On Thu, Nov 26, 2009 at 2:24 AM, Jo?o Rodrigues wrote: > Dear All, > > Is there a way to make the PDBParser not to display Warnings when it reads > structures? Like a flag that we pass somewhere? > > Regards! Yes, use the Python warnings module to ignore PDBConstructionWarning, see: http://docs.python.org/library/warnings.html Peter From biopython at maubp.freeserve.co.uk Thu Nov 26 10:48:49 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Nov 2009 10:48:49 +0000 Subject: [Biopython] fastq-solexa index In-Reply-To: <4B0DD08B.6070607@op.pl> References: <4B0DD08B.6070607@op.pl> Message-ID: <320fb6e00911260248w1f6a29b1ucc0bfecec897c67b@mail.gmail.com> On Thu, Nov 26, 2009 at 12:49 AM, xyz wrote: > Hello, > On this page ( http://www.biopython.org/wiki/SeqIO ) I have found that > biopython can use fastq-solexa index. What does it means and are there any > examples? > > Thank you in advance. In Bio.SeqIO we give each file format a name, in this case "fastq-solexa" means the old Solexa FASTQ files (also used by Illumina up to and including pipeline 1.2) which use Solexa scores with an ASCII offset of 64 (not PHRED scores). The table on the SeqIO wiki page tries to summarise this. See also: http://en.wikipedia.org/wiki/FASTQ_format The "index" column on that table on the SeqIO wiki page indicates if each file format can be used with the Bio.SeqIO.index(...) function included in Biopython 1.52 onwards. See: http://news.open-bio.org/news/2009/09/biopython-seqio-index/ There are also examples in the main Tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf And in the Bio.SeqIO module's built in help, online here: http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html >From within Python: >>> from Bio import SeqIO >>> help(SeqIO) ... >>> help(SeqIO.index) ... Peter From anaryin at gmail.com Thu Nov 26 10:49:26 2009 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 26 Nov 2009 02:49:26 -0800 Subject: [Biopython] Turning PDBConstructionWarning off In-Reply-To: <320fb6e00911260242y6d989df8wd5271cda71f3ed08@mail.gmail.com> References: <320fb6e00911260242y6d989df8wd5271cda71f3ed08@mail.gmail.com> Message-ID: Thanks Rodrigo for the tip. I'd tried to set __debug__ to False as well but it didn't work as I wanted (unless I actually edited the module file). Peter's suggestion is what I wanted. I was completely unaware of the "warning" module so I thought it was a BioPython thing. Thanks! Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ From biopython at maubp.freeserve.co.uk Thu Nov 26 13:10:33 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Nov 2009 13:10:33 +0000 Subject: [Biopython] Biopython and Twitter followers Message-ID: <320fb6e00911260510w77fbce0dsbbd76ad4d4892221@mail.gmail.com> Hi all, We've had a Biopython twitter account over six months now, and it seems to be a nice extra channel for promoting the project and keeping people up to date. Right now Biopython has 123 twitter followers: http://twitter.com/Biopython [Don't forget we have RSS and Atom news feeds too - see http://biopython.org/wiki/News for links] Right now, Biopython only follows the OBF, BioPerl and Guido van Rossum. I'm happy to add other related projects like BioRuby or BioJava if/when they setup twitter accounts. Given we have quite a few Biopython developers and regular contributors on twitter now - should we be following them too? Leighton had some valid reservations: http://lists.open-bio.org/pipermail/biopython-dev/2009-March/005626.html Any thoughts? Peter From lpritc at scri.ac.uk Thu Nov 26 15:32:32 2009 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Thu, 26 Nov 2009 15:32:32 +0000 Subject: [Biopython] Biopython and Twitter followers In-Reply-To: <320fb6e00911260510w77fbce0dsbbd76ad4d4892221@mail.gmail.com> Message-ID: Howdy, I'm less concerned about following individuals now, than I was. It was a new mode of communication to me, and I might have been being a bit oversensitive to some comments on [bip] and blogs ;) Whatever makes the community happy is fine by me so, as long as we don't end up looking like a Masonic cult, I think following individuals is fair game. Cheers, L. On 26/11/2009 13:10, "Peter" wrote: > Hi all, > > We've had a Biopython twitter account over six months now, > and it seems to be a nice extra channel for promoting the > project and keeping people up to date. Right now Biopython > has 123 twitter followers: http://twitter.com/Biopython > > [Don't forget we have RSS and Atom news feeds too - see > http://biopython.org/wiki/News for links] > > Right now, Biopython only follows the OBF, BioPerl and > Guido van Rossum. I'm happy to add other related > projects like BioRuby or BioJava if/when they setup > twitter accounts. Given we have quite a few Biopython > developers and regular contributors on twitter now - > should we be following them too? Leighton had some > valid reservations: > http://lists.open-bio.org/pipermail/biopython-dev/2009-March/005626.html > > Any thoughts? > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > ______________________________________________________________________ > This email has been scanned by the MessageLabs Email Security System. > For more information please visit http://www.messagelabs.com/email > ______________________________________________________________________ -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From rodrigo_faccioli at uol.com.br Thu Nov 26 16:01:10 2009 From: rodrigo_faccioli at uol.com.br (Rodrigo faccioli) Date: Thu, 26 Nov 2009 13:01:10 -0300 Subject: [Biopython] Turning PDBConstructionWarning off In-Reply-To: References: <320fb6e00911260242y6d989df8wd5271cda71f3ed08@mail.gmail.com> Message-ID: <3715adb70911260801r2c897c8ev35c3396df700251f@mail.gmail.com> Thanks Peter. Your suggestion was good. I?ll try it. Regards, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 On Thu, Nov 26, 2009 at 7:49 AM, Jo?o Rodrigues wrote: > Thanks Rodrigo for the tip. I'd tried to set __debug__ to False as well but > it didn't work as I wanted (unless I actually edited the module file). > > Peter's suggestion is what I wanted. I was completely unaware of the > "warning" module so I thought it was a BioPython thing. > > Thanks! > > Jo?o [...] Rodrigues > @ http://stanford.edu/~joaor/ > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Thu Nov 26 16:02:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 26 Nov 2009 16:02:50 +0000 Subject: [Biopython] Fwd: [DAS] DAS workshop 7th-9th April 2010 In-Reply-To: References: Message-ID: <320fb6e00911260802wb98b28fic8a193c125e29d9c@mail.gmail.com> This might be of interest to some of you. Peter ---------- Forwarded message ---------- From: Jonathan Warren Date: Thu, Nov 26, 2009 at 2:57 PM Subject: [DAS] DAS workshop 7th-9th April 2010 To: das at biodas.org, das_registry_announce at sanger.ac.uk, biojava-dev , BioJava , BioPerl , all at sanger.ac.uk, all at ebi.ac.uk, ensembldev We are considering running a Distributed Annotation System workshop here at the Sanger/EBI in the UK subject to decent demand. The workshop will be held from Wednesday 7th-Friday 9th April 2010. If you would be interested in attending either to present or just take part then please email me jw12 at sanger.ac.uk The format of the workshop is likely to be similar to last years (1st day for beginners, 2nd for both beginners and advanced users, 3rd day for advanced), information for which can be found here: http://www.dasregistry.org/course.jsp If you would like to present then please send a short summary of what you would like to talk about. Thanks Jonathan. Jonathan Warren Senior Developer and DAS coordinator jw12 at sanger.ac.uk -- The Wellcome Trust Sanger Institute is operated by Genome ResearchLimited, a charity registered in England with number 1021457 and acompany registered in England with number 2742969, whose registeredoffice is 215 Euston Road, London, NW1 2BE._______________________________________________ DAS mailing list DAS at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/das From eric.talevich at gmail.com Thu Nov 26 20:31:40 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 26 Nov 2009 15:31:40 -0500 Subject: [Biopython] Turning PDBConstructionWarning off Message-ID: <3f6baf360911261231q712933a2g834025ce4690d4e6@mail.gmail.com> From: Jo?o Rodrigues : > > Thanks Rodrigo for the tip. I'd tried to set __debug__ to False as well but > it didn't work as I wanted (unless I actually edited the module file). > > Peter's suggestion is what I wanted. I was completely unaware of the > "warning" module so I thought it was a BioPython thing. > > Thanks! > > Jo?o [...] Rodrigues > @ http://stanford.edu/~joaor/ > The __debug__ check in Biopython's source code isn't really necessary; it's set internally by Python. By default it's True, but running Python with optimizations on (-O on the command line) sets it to False and automatically skips all warnings. As Peter suggested, the usual way to hide specific warnings in your applications is with the warnings module's simplefilter(). Cheers, Eric From pengyu.ut at gmail.com Fri Nov 27 22:56:17 2009 From: pengyu.ut at gmail.com (Peng Yu) Date: Fri, 27 Nov 2009 16:56:17 -0600 Subject: [Biopython] How to get intron/exon boundaries? Message-ID: <366c6f340911271456t594e3256i93d184f72215e6e7@mail.gmail.com> I'm wondering how to get intron exon boundaires for all the genes. Could somebody show me what functions I should use? From biopython at maubp.freeserve.co.uk Fri Nov 27 23:03:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 27 Nov 2009 23:03:12 +0000 Subject: [Biopython] How to get intron/exon boundaries? In-Reply-To: <366c6f340911271456t594e3256i93d184f72215e6e7@mail.gmail.com> References: <366c6f340911271456t594e3256i93d184f72215e6e7@mail.gmail.com> Message-ID: <320fb6e00911271503j273fbb9bs865eaed9ea92af80@mail.gmail.com> On Fri, Nov 27, 2009 at 10:56 PM, Peng Yu wrote: > I'm wondering how to get intron exon boundaires for all the genes. > Could somebody show me what functions I should use? What do you want to know? The co-ordinates of the intron/exons, or just to get the coding sequence? What kind of data are you looking at? For GenBank or EMBL files this is encoded in the CDS feature locations. For GFF files I think this information is given explicitly, Peter From pengyu.ut at gmail.com Sat Nov 28 00:18:06 2009 From: pengyu.ut at gmail.com (Peng Yu) Date: Fri, 27 Nov 2009 18:18:06 -0600 Subject: [Biopython] How to get intron/exon boundaries? In-Reply-To: <320fb6e00911271503j273fbb9bs865eaed9ea92af80@mail.gmail.com> References: <366c6f340911271456t594e3256i93d184f72215e6e7@mail.gmail.com> <320fb6e00911271503j273fbb9bs865eaed9ea92af80@mail.gmail.com> Message-ID: <366c6f340911271618u4ff965f0u729fa0a1f70bd028@mail.gmail.com> On Fri, Nov 27, 2009 at 5:03 PM, Peter wrote: > On Fri, Nov 27, 2009 at 10:56 PM, Peng Yu wrote: >> I'm wondering how to get intron exon boundaires for all the genes. >> Could somebody show me what functions I should use? > > What do you want to know? The co-ordinates of the intron/exons, > or just to get the coding sequence? I want the co-ordinates. > What kind of data are you looking at? For GenBank or EMBL > files this is encoded in the CDS feature locations. For GFF > files I think this information is given explicitly, Would you please let me know how to get the CDS feature locations from GenBank and EMBL? What are GFF files? From sdavis2 at mail.nih.gov Mon Nov 30 00:55:03 2009 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Sun, 29 Nov 2009 19:55:03 -0500 Subject: [Biopython] How to get intron/exon boundaries? In-Reply-To: <366c6f340911271618u4ff965f0u729fa0a1f70bd028@mail.gmail.com> References: <366c6f340911271456t594e3256i93d184f72215e6e7@mail.gmail.com> <320fb6e00911271503j273fbb9bs865eaed9ea92af80@mail.gmail.com> <366c6f340911271618u4ff965f0u729fa0a1f70bd028@mail.gmail.com> Message-ID: <264855a00911291655u6a9cf012v75e02d398c89c833@mail.gmail.com> On Fri, Nov 27, 2009 at 7:18 PM, Peng Yu wrote: > On Fri, Nov 27, 2009 at 5:03 PM, Peter wrote: >> On Fri, Nov 27, 2009 at 10:56 PM, Peng Yu wrote: >>> I'm wondering how to get intron exon boundaires for all the genes. >>> Could somebody show me what functions I should use? >> >> What do you want to know? The co-ordinates of the intron/exons, >> or just to get the coding sequence? > > I want the co-ordinates. You are talking about coordinates in genomic space or on the transcript? What organism? And what annotation system do you want to use--Ensembl, UCSC, or NCBI? >> What kind of data are you looking at? For GenBank or EMBL >> files this is encoded in the CDS feature locations. For GFF >> files I think this information is given explicitly, > > Would you please let me know how to get the CDS feature locations from > GenBank and EMBL? What are GFF files? For GFF, google will get you a long way ("GFF format"). Sean From pengyu.ut at gmail.com Mon Nov 30 03:30:52 2009 From: pengyu.ut at gmail.com (Peng Yu) Date: Sun, 29 Nov 2009 21:30:52 -0600 Subject: [Biopython] How to get intron/exon boundaries? In-Reply-To: <264855a00911291655u6a9cf012v75e02d398c89c833@mail.gmail.com> References: <366c6f340911271456t594e3256i93d184f72215e6e7@mail.gmail.com> <320fb6e00911271503j273fbb9bs865eaed9ea92af80@mail.gmail.com> <366c6f340911271618u4ff965f0u729fa0a1f70bd028@mail.gmail.com> <264855a00911291655u6a9cf012v75e02d398c89c833@mail.gmail.com> Message-ID: <366c6f340911291930x5ba0b7co1ae84809d5b72b9b@mail.gmail.com> On Sun, Nov 29, 2009 at 6:55 PM, Sean Davis wrote: > On Fri, Nov 27, 2009 at 7:18 PM, Peng Yu wrote: >> On Fri, Nov 27, 2009 at 5:03 PM, Peter wrote: >>> On Fri, Nov 27, 2009 at 10:56 PM, Peng Yu wrote: >>>> I'm wondering how to get intron exon boundaires for all the genes. >>>> Could somebody show me what functions I should use? >>> >>> What do you want to know? The co-ordinates of the intron/exons, >>> or just to get the coding sequence? >> >> I want the co-ordinates. > > You are talking about coordinates in genomic space or on the > transcript? ?What organism? ?And what annotation system do you want to > use--Ensembl, UCSC, or NCBI? The coordinates in genomic space. Mouse. UCSC. >>> What kind of data are you looking at? For GenBank or EMBL >>> files this is encoded in the CDS feature locations. For GFF >>> files I think this information is given explicitly, >> >> Would you please let me know how to get the CDS feature locations from >> GenBank and EMBL? What are GFF files? > > For GFF, google will get you a long way ("GFF format"). > > Sean > From sdavis2 at mail.nih.gov Mon Nov 30 13:59:35 2009 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Mon, 30 Nov 2009 08:59:35 -0500 Subject: [Biopython] How to get intron/exon boundaries? In-Reply-To: <366c6f340911291930x5ba0b7co1ae84809d5b72b9b@mail.gmail.com> References: <366c6f340911271456t594e3256i93d184f72215e6e7@mail.gmail.com> <320fb6e00911271503j273fbb9bs865eaed9ea92af80@mail.gmail.com> <366c6f340911271618u4ff965f0u729fa0a1f70bd028@mail.gmail.com> <264855a00911291655u6a9cf012v75e02d398c89c833@mail.gmail.com> <366c6f340911291930x5ba0b7co1ae84809d5b72b9b@mail.gmail.com> Message-ID: <264855a00911300559i7fe3ce88x5e2e339182a7ef36@mail.gmail.com> On Sun, Nov 29, 2009 at 10:30 PM, Peng Yu wrote: > On Sun, Nov 29, 2009 at 6:55 PM, Sean Davis wrote: >> On Fri, Nov 27, 2009 at 7:18 PM, Peng Yu wrote: >>> On Fri, Nov 27, 2009 at 5:03 PM, Peter wrote: >>>> On Fri, Nov 27, 2009 at 10:56 PM, Peng Yu wrote: >>>>> I'm wondering how to get intron exon boundaires for all the genes. >>>>> Could somebody show me what functions I should use? >>>> >>>> What do you want to know? The co-ordinates of the intron/exons, >>>> or just to get the coding sequence? >>> >>> I want the co-ordinates. >> >> You are talking about coordinates in genomic space or on the >> transcript? ?What organism? ?And what annotation system do you want to >> use--Ensembl, UCSC, or NCBI? > > The coordinates in genomic space. Mouse. UCSC. http://genome.ucsc.edu/cgi-bin/hgTables?org=Mouse Choose the track that you like. UCSC Known Genes is the typical default. There are numerous output format options. Again, choose whatever you think is convenient. The outputs are almost all tab-delimited text, so you should be able to use them easily in the scripting language of your choice. If you prefer gff, then consider GTF format. If you have questions for the UCSC folks, they have their own mailing list accessible from the top of the page on their website. Sean >>>> What kind of data are you looking at? For GenBank or EMBL >>>> files this is encoded in the CDS feature locations. For GFF >>>> files I think this information is given explicitly, >>> >>> Would you please let me know how to get the CDS feature locations from >>> GenBank and EMBL? What are GFF files? >> >> For GFF, google will get you a long way ("GFF format"). >> >> Sean >> > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython >