From biopython at maubp.freeserve.co.uk Thu Oct 1 04:06:22 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 1 Oct 2009 09:06:22 +0100 Subject: [Biopython] get back raw records with SeqIO? In-Reply-To: <4995FD92-374F-4B4F-947B-CC9175B06BCD@u.washington.edu> References: <44353069-545B-4248-BAD1-958363EE8F29@u.washington.edu> <320fb6e00909250250q417516ffu5f75ceea7732ff72@mail.gmail.com> <4995FD92-374F-4B4F-947B-CC9175B06BCD@u.washington.edu> Message-ID: <320fb6e00910010106t4126292bs2c9fac1db85fbd32@mail.gmail.com> On Thu, Oct 1, 2009 at 12:14 AM, Cedar McKay wrote: > >> Why do you want to do this? I'd like to understand the desired >> usage. > > I didn't have a specific technical reason. OK - if you come up with a good use case example, please let us know. > It just seemed like everything was going towards using SeqIO and things > like Bio.Fasta were being deprecated, so I wanted to get ahead of the > curve there. But if Bio.Genbank is going to be around for a long time, > I don't have any problem with doing it that way. For more complicated file formats (e.g. GenBank, SwissProt, ACE, PHRED, ...) mapping the data into SeqRecord objects isn't 100% perfect. Here Bio.SeqIO really is just a unifing API sitting on top of file format specific parsers (which live in other modules), which is good enough for most tasks. Unless/until the SeqRecord objects are a full mapping, any more file format specific data-structure still has its uses - and thus I see no immediate pressure to remove Bio.GenBank etc. Unlike some of the Bio.SeqIO parsers, for "fasta" we don't use an underlying module (such as Bio.Fasta), and the SeqRecord can capture all of the annotation in the raw file. One reason for this is at the time, Bio.Fasta still used Martel and was noticeably slower than the pure python code I adopted for FASTA files in SeqIO. Since then Bio.Fasta has lost all the Martel dependencies (which meant the loss of the old indexing code, indirectly leading to the Bio.SeqIO.index() function as per our previous discussions). This means that the remaining code in Bio.Fasta is now redundant. Maybe we could have just left Bio.Fasta alone, sitting quietly but tagged obsolete, but it is clearer to remove redundancy. Peter P.S. For the record, Bio.Fasta was declared obsolete in Biopython 1.48 (Sept 2008), and deprecated in Biopython 1.51 (Aug 2009). From denzel.dz.li at gmail.com Mon Oct 5 13:38:38 2009 From: denzel.dz.li at gmail.com (Denzel Li) Date: Mon, 5 Oct 2009 13:38:38 -0400 Subject: [Biopython] Combine nexus files but not concatenating them Message-ID: Hi all: I notice there is a solution for combining nexus files as appeared in the cookbook (http://biopython.org/wiki/Concatenate_nexus ). However, in the example the alignments are concatenated. What if I want is, for example, the following two files are combined into one file as shown in "combinedFile.nex". # file1.nex b1 GGG b2 GGT # file2.nex b1 AAA b2 AAT # combinedFile.nex begin data; dimensions ntax=2 nchar=6 [alignment from file1.nex] b1 GGG b2 GGT [alignment from file2.nex] b1 AAA b2 AAT ;end; begin sets; charset a1=1-3; charset a2=4-6; end; Any suggestion is highly appreciated. Thank you. Best, Denzel From biopython at maubp.freeserve.co.uk Mon Oct 5 15:42:48 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 5 Oct 2009 20:42:48 +0100 Subject: [Biopython] Combine nexus files but not concatenating them In-Reply-To: References: Message-ID: <320fb6e00910051242n54d6229fyacc23119401715e@mail.gmail.com> On Mon, Oct 5, 2009 at 6:38 PM, Denzel Li wrote: > Hi all: > I notice there is a solution for combining nexus files as appeared in the > cookbook > (http://biopython.org/wiki/Concatenate_nexus ). ?However, in the example the > alignments are concatenated. What if I want is, for example, the following > two files are combined into one file as shown in "combinedFile.nex". I was under the impression that NEXUS files should only hold one alignment matrix. Why do you need it done this way? Isn't your example basically the same thing but interleaved? Peter From denzel.dz.li at gmail.com Mon Oct 5 16:00:06 2009 From: denzel.dz.li at gmail.com (Denzel Li) Date: Mon, 5 Oct 2009 16:00:06 -0400 Subject: [Biopython] Combine nexus files but not concatenating them In-Reply-To: <320fb6e00910051242n54d6229fyacc23119401715e@mail.gmail.com> References: <320fb6e00910051242n54d6229fyacc23119401715e@mail.gmail.com> Message-ID: Hi Peter: Yes, it is basically the same thing returned by "nexus.combine" but "interleaved". A further question is that, is it possible to split one nexus into several nexus according to the Charset (or partition) defined in the file. Like in the concatenation example ( http://biopython.org/wiki/Concatenate_nexus ), split the combined file into btCOI.nex,btCOII.nex and btITS.nex. Thanks, Denzel On Mon, Oct 5, 2009 at 3:42 PM, Peter wrote: > On Mon, Oct 5, 2009 at 6:38 PM, Denzel Li wrote: > > Hi all: > > I notice there is a solution for combining nexus files as appeared in the > > cookbook > > (http://biopython.org/wiki/Concatenate_nexus ). However, in the example > the > > alignments are concatenated. What if I want is, for example, the > following > > two files are combined into one file as shown in "combinedFile.nex". > > I was under the impression that NEXUS files should only hold > one alignment matrix. Why do you need it done this way? Isn't > your example basically the same thing but interleaved? > > Peter > From biopython at maubp.freeserve.co.uk Mon Oct 5 16:31:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 5 Oct 2009 21:31:53 +0100 Subject: [Biopython] Combine nexus files but not concatenating them In-Reply-To: References: <320fb6e00910051242n54d6229fyacc23119401715e@mail.gmail.com> Message-ID: <320fb6e00910051331u2f27c7ecmde744d29fb3ba2ab@mail.gmail.com> On Mon, Oct 5, 2009 at 9:00 PM, Denzel Li wrote: > Hi Peter: > Yes, it is basically the same thing returned by "nexus.combine" but > "interleaved". Surely whether or not the data is interleaved is immaterial to the meaning. Does the combined version following our wiki not work for some 3rd party tool? > A further question is that, is it possible to split one nexus > into several nexus according to the Charset (or partition) > defined in the file. Like in the concatenation example > (http://biopython.org/wiki/Concatenate_nexus ), split the > combined file into btCOI.nex,btCOII.nex and btITS.nex. Does the write_nexus_data_partitions() method of the Nexus object do what you want? Peter From harekrishna at gmail.com Tue Oct 6 17:07:52 2009 From: harekrishna at gmail.com (Austin Davis-Richardson) Date: Tue, 6 Oct 2009 17:07:52 -0400 Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary() results Message-ID: Howdy, I'm using BioPython to generate a table of accession numbers and their corresponding TaxIDs. The fastest way I can do this is 20 at a time (20 per 3 seconds rather than 1 per 3 seconds). However, this results in a problem. whenever my script receives a result from NCBI that is blank such as there being no value for TaxID, BioPython crashes with the error: File "taxcollector3.py", line 39, in getTaxID record = Entrez.read(handle) File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py", line 259, in read record = handler.run(handle) File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", line 90, in run self.parser.ParseFile(handle) File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", line 191, in endElement value = IntegerElement(value) ValueError: invalid literal for int() with base 10: '' my code looks like this: Where gids is a string of comma-separated GIDs (I get the GIDs from the accession numbers using eEntrez.esearch(db="nucleotide", rettype="text", term=accessions)) handle = Entrez.esummary(db="nucleotide", id=gids) record = Entrez.read(handle) The only solution I can come up with is searching one at a time, but this is very slow. (I have about 300,000 accession numbers) Does anyone know perhaps a patch or a solution for this? Or maybe an easier way to get a TaxID from an accession number? Thanks, Austin Davis-Richardson From mjldehoon at yahoo.com Tue Oct 6 22:11:36 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Tue, 6 Oct 2009 19:11:36 -0700 (PDT) Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary() results In-Reply-To: Message-ID: <362834.37683.qm@web62401.mail.re1.yahoo.com> You could try the following (with biopython 1.52): handle = Entrez.esummary(db="nucleotide", id=gids) records = Entrez.parse(handle) while True: try: record = records.next() except StopIteration: break except: print "Skipping record" We should probably modify Bio.Entrez so that empty "integer" values are treated correctly. --Michiel. --- On Tue, 10/6/09, Austin Davis-Richardson wrote: > From: Austin Davis-Richardson > Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary() results > To: biopython at lists.open-bio.org > Date: Tuesday, October 6, 2009, 5:07 PM > Howdy, > > I'm using BioPython to generate a table of accession > numbers and their > corresponding TaxIDs.? The fastest way I can do this > is 20 at a time > (20 per 3 seconds rather than 1 per 3 seconds). > > However, this results in a problem. > > whenever my script receives a result from NCBI that is > blank such as > there being no value for TaxID, BioPython crashes with the > error: > > ? File "taxcollector3.py", line 39, in getTaxID > ? ? record = Entrez.read(handle) > ? File > "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py", > line 259, in read > ? ? record = handler.run(handle) > ? File > "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", > line 90, in run > ? ? self.parser.ParseFile(handle) > ? File > "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", > line 191, in endElement > ? ? value = IntegerElement(value) > ValueError: invalid literal for int() with base 10: '' > > > my code looks like this:? Where gids is a string of > comma-separated GIDs > (I get the GIDs from the accession numbers using > eEntrez.esearch(db="nucleotide", rettype="text", > term=accessions)) > > ??? ??? ??? > handle = Entrez.esummary(db="nucleotide", id=gids) > ??? ??? ??? > record = Entrez.read(handle) > > > The only solution I can come up with is searching one at a > time, but > this is very slow.? (I have about 300,000 accession > numbers) > > Does anyone know perhaps a patch or a solution for > this?? Or maybe an > easier way to get a TaxID from an accession number? > > Thanks, > Austin Davis-Richardson > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Wed Oct 7 05:29:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 7 Oct 2009 10:29:36 +0100 Subject: [Biopython] Combine nexus files but not concatenating them In-Reply-To: References: <320fb6e00910051242n54d6229fyacc23119401715e@mail.gmail.com> <320fb6e00910051331u2f27c7ecmde744d29fb3ba2ab@mail.gmail.com> Message-ID: <320fb6e00910070229n1b78542dj82998de13cf7eed7@mail.gmail.com> On Wed, Oct 7, 2009 at 4:22 AM, Denzel Li wrote: > Hi Peter: > Thank you for the help. Both functions work well. By the way, will > "standard" datatype or "mixed" datatype be supported in Bio:Nexus:Nexus? > > Best, > Denzel Hi Denzel, I CC'd the list - please try and keep replies send there. I'm glad Bio.Nexus is working well for you. Regarding the finer details of the NEXUS file format and the Biopython code, I am not an expert - we need Frank or Cymon to comment. If you could give us a couple of examples of what you are asking for it would probably be much clearer (to me at least). Regards, Peter From biopython at maubp.freeserve.co.uk Wed Oct 7 07:17:30 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 7 Oct 2009 12:17:30 +0100 Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary() results In-Reply-To: <362834.37683.qm@web62401.mail.re1.yahoo.com> References: <362834.37683.qm@web62401.mail.re1.yahoo.com> Message-ID: <320fb6e00910070417w26236a62ifece2e2610256609@mail.gmail.com> On Wed, Oct 7, 2009 at 3:11 AM, Michiel de Hoon wrote: > > We should probably modify Bio.Entrez so that empty "integer" values are treated correctly. > Does "correctly" mean a default value? I see Brad has just commited a change to use -1 in this case, but perhaps None is also a good choice? Can we alternatively leave this bit of the data structure empty? Peter From chapmanb at 50mail.com Wed Oct 7 07:17:37 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 7 Oct 2009 07:17:37 -0400 Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary() results In-Reply-To: References: Message-ID: <20091007111737.GC84267@sobchak.mgh.harvard.edu> Hi Austin; > I'm using BioPython to generate a table of accession numbers and their > corresponding TaxIDs. The fastest way I can do this is 20 at a time > (20 per 3 seconds rather than 1 per 3 seconds). > > However, this results in a problem. > > whenever my script receives a result from NCBI that is blank such as > there being no value for TaxID, BioPython crashes with the error: > > File "taxcollector3.py", line 39, in getTaxID > record = Entrez.read(handle) > File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py", > line 259, in read > record = handler.run(handle) > File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", > line 90, in run > self.parser.ParseFile(handle) > File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", > line 191, in endElement > value = IntegerElement(value) > ValueError: invalid literal for int() with base 10: '' In addition to Michiel's workaround, I checked in a small change which could at least circumvent the error you are reporting: http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279 It affects only one file, so if you don't want to pull the latest from GitHub, you can download just that file and replace it in your Biopython library: http://github.com/biopython/biopython/blob/master/Bio/Entrez/Parser.py Ideally, we should have a test case to cover this. Could you let us know specific GIs that are causing the problem? The group of 20 is fine if you haven't narrowed it further than that. This'll also help us check if there are any other problems with these records. Thanks for reporting this, Brad From mjldehoon at yahoo.com Wed Oct 7 08:19:01 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 7 Oct 2009 05:19:01 -0700 (PDT) Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary() results In-Reply-To: <20091007111737.GC84267@sobchak.mgh.harvard.edu> Message-ID: <826538.32828.qm@web62406.mail.re1.yahoo.com> > In addition to Michiel's workaround, I checked in a small > change > which could at least circumvent the error you are > reporting: > > http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279 Sorry, but that change introduces two bugs. First, we should be able to distinguish between -1 and missing values. More importantly, we want to be able to add attributes to value. Since -1 is an integer instead of an object, it won't allow that. Can you revert this change? --Michiel --- On Wed, 10/7/09, Brad Chapman wrote: > From: Brad Chapman > Subject: Re: [Biopython] Skipping over blank/erroneous Entrez.esummary() results > To: "Austin Davis-Richardson" > Cc: biopython at lists.open-bio.org > Date: Wednesday, October 7, 2009, 7:17 AM > Hi Austin; > > > I'm using BioPython to generate a table of accession > numbers and their > > corresponding TaxIDs.? The fastest way I can do > this is 20 at a time > > (20 per 3 seconds rather than 1 per 3 seconds). > > > > However, this results in a problem. > > > > whenever my script receives a result from NCBI that is > blank such as > > there being no value for TaxID, BioPython crashes with > the error: > > > >???File "taxcollector3.py", line 39, in > getTaxID > >? ???record = Entrez.read(handle) > >???File > "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py", > > line 259, in read > >? ???record = handler.run(handle) > >???File > "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", > > line 90, in run > >? ???self.parser.ParseFile(handle) > >???File > "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", > > line 191, in endElement > >? ???value = IntegerElement(value) > > ValueError: invalid literal for int() with base 10: > '' > > In addition to Michiel's workaround, I checked in a small > change > which could at least circumvent the error you are > reporting: > > http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279 > > It affects only one file, so if you don't want to pull the > latest > from GitHub, you can download just that file and replace it > in your > Biopython library: > > http://github.com/biopython/biopython/blob/master/Bio/Entrez/Parser.py > > Ideally, we should have a test case to cover this. Could > you let us > know specific GIs that are causing the problem? The group > of 20 is > fine if you haven't narrowed it further than that. This'll > also help > us check if there are any other problems with these > records. > > Thanks for reporting this, > Brad > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From chapmanb at 50mail.com Wed Oct 7 08:32:27 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 7 Oct 2009 08:32:27 -0400 Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary() results In-Reply-To: <826538.32828.qm@web62406.mail.re1.yahoo.com> References: <20091007111737.GC84267@sobchak.mgh.harvard.edu> <826538.32828.qm@web62406.mail.re1.yahoo.com> Message-ID: <20091007123227.GD84267@sobchak.mgh.harvard.edu> Peter and Michiel; > > In addition to Michiel's workaround, I checked in a small > > change which could at least circumvent the error you are > > reporting: > > > > http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279 Peter: > Does "correctly" mean a default value? I see Brad has just commited a change to > use -1 in this case, but perhaps None is also a good choice? Can we > alternatively > leave this bit of the data structure empty? Michiel: > Sorry, but that change introduces two bugs. First, we should be able > to distinguish between -1 and missing values. More importantly, we > want to be able to add attributes to value. Since -1 is an integer > instead of an object, it won't allow that. > > Can you revert this change? Thanks guys -- not the best choice. How do you feel about just passing it along as an empty string and only doing the integer conversion if we actually have data to convert? http://github.com/biopython/biopython/commit/1fff8038e4fa9e2643851a70118e3227ccbea44e So now missing values are empty strings, as passed, instead of any sort of integer interpretation of them. Brad From harekrishna at gmail.com Wed Oct 7 16:11:03 2009 From: harekrishna at gmail.com (Austin Davis-Richardson) Date: Wed, 7 Oct 2009 16:11:03 -0400 Subject: [Biopython] Biopython Digest, Vol 82, Issue 3 In-Reply-To: References: Message-ID: I'm confused now. In the latest version http://github.com/biopython/biopython/commit/1fff8038e4fa9e2643851a70118e3227ccbea44e Missing values are empty strings so if I did something like record = Entrez.read(handle) for item in record: myList.append += item['TaxId'] myList should be something like : [ '1234', '2434', '', '9970' ] where myList[2] is the result of a missing value However, when I run my script. I find no blank spaces despite knowing that there are some that should have missing values. Which screws things up later when I zip tax ID's with their corresponding accession number: zip (accessions, taxids) I'm all for using '1' (root) or '-1' for missing values. 2009/10/7 : > Send Biopython mailing list submissions to > ? ? ? ?biopython at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > ? ? ? ?http://lists.open-bio.org/mailman/listinfo/biopython > or, via email, send a message with subject or body 'help' to > ? ? ? ?biopython-request at lists.open-bio.org > > You can reach the person managing the list at > ? ? ? ?biopython-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biopython digest..." > > > Today's Topics: > > ? 1. Skipping over blank/erroneous Entrez.esummary() results > ? ? ?(Austin Davis-Richardson) > ? 2. Re: Skipping over blank/erroneous Entrez.esummary() ? ? ? results > ? ? ?(Michiel de Hoon) > ? 3. Re: Combine nexus files but not concatenating them (Peter) > ? 4. Re: Skipping over blank/erroneous Entrez.esummary() ? ? ? results > ? ? ?(Peter) > ? 5. Re: Skipping over blank/erroneous Entrez.esummary() ? ? ? results > ? ? ?(Brad Chapman) > ? 6. Re: Skipping over blank/erroneous Entrez.esummary() ? ? ? results > ? ? ?(Michiel de Hoon) > ? 7. Re: Skipping over blank/erroneous Entrez.esummary() ? ? ? results > ? ? ?(Brad Chapman) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 6 Oct 2009 17:07:52 -0400 > From: Austin Davis-Richardson > Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary() > ? ? ? ?results > To: biopython at lists.open-bio.org > Message-ID: > ? ? ? ? > Content-Type: text/plain; charset=ISO-8859-1 > > Howdy, > > I'm using BioPython to generate a table of accession numbers and their > corresponding TaxIDs. ?The fastest way I can do this is 20 at a time > (20 per 3 seconds rather than 1 per 3 seconds). > > However, this results in a problem. > > whenever my script receives a result from NCBI that is blank such as > there being no value for TaxID, BioPython crashes with the error: > > ?File "taxcollector3.py", line 39, in getTaxID > ? ?record = Entrez.read(handle) > ?File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py", > line 259, in read > ? ?record = handler.run(handle) > ?File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", > line 90, in run > ? ?self.parser.ParseFile(handle) > ?File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", > line 191, in endElement > ? ?value = IntegerElement(value) > ValueError: invalid literal for int() with base 10: '' > > > my code looks like this: ?Where gids is a string of comma-separated GIDs > (I get the GIDs from the accession numbers using > eEntrez.esearch(db="nucleotide", rettype="text", term=accessions)) > > ? ? ? ? ? ? ? ? ? ? ? ?handle = Entrez.esummary(db="nucleotide", id=gids) > ? ? ? ? ? ? ? ? ? ? ? ?record = Entrez.read(handle) > > > The only solution I can come up with is searching one at a time, but > this is very slow. ?(I have about 300,000 accession numbers) > > Does anyone know perhaps a patch or a solution for this? ?Or maybe an > easier way to get a TaxID from an accession number? > > Thanks, > Austin Davis-Richardson > > > ------------------------------ > > Message: 2 > Date: Tue, 6 Oct 2009 19:11:36 -0700 (PDT) > From: Michiel de Hoon > Subject: Re: [Biopython] Skipping over blank/erroneous > ? ? ? ?Entrez.esummary() ? ? ? results > To: biopython at lists.open-bio.org, ? ? ? Austin Davis-Richardson > ? ? ? ? > Message-ID: <362834.37683.qm at web62401.mail.re1.yahoo.com> > Content-Type: text/plain; charset=iso-8859-1 > > You could try the following (with biopython 1.52): > > handle = Entrez.esummary(db="nucleotide", id=gids) > records = Entrez.parse(handle) > while True: > ? ?try: > ? ? ? ?record = records.next() > ? ?except StopIteration: > ? ? ? ?break > ? ?except: > ? ? ? ?print "Skipping record" > > > We should probably modify Bio.Entrez so that empty "integer" values are treated correctly. > > > --Michiel. > > --- On Tue, 10/6/09, Austin Davis-Richardson wrote: > >> From: Austin Davis-Richardson >> Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary() results >> To: biopython at lists.open-bio.org >> Date: Tuesday, October 6, 2009, 5:07 PM >> Howdy, >> >> I'm using BioPython to generate a table of accession >> numbers and their >> corresponding TaxIDs.? The fastest way I can do this >> is 20 at a time >> (20 per 3 seconds rather than 1 per 3 seconds). >> >> However, this results in a problem. >> >> whenever my script receives a result from NCBI that is >> blank such as >> there being no value for TaxID, BioPython crashes with the >> error: >> >> ? File "taxcollector3.py", line 39, in getTaxID >> ? ? record = Entrez.read(handle) >> ? File >> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py", >> line 259, in read >> ? ? record = handler.run(handle) >> ? File >> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", >> line 90, in run >> ? ? self.parser.ParseFile(handle) >> ? File >> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", >> line 191, in endElement >> ? ? value = IntegerElement(value) >> ValueError: invalid literal for int() with base 10: '' >> >> >> my code looks like this:? Where gids is a string of >> comma-separated GIDs >> (I get the GIDs from the accession numbers using >> eEntrez.esearch(db="nucleotide", rettype="text", >> term=accessions)) >> >> ??? ??? ??? >> handle = Entrez.esummary(db="nucleotide", id=gids) >> ??? ??? ??? >> record = Entrez.read(handle) >> >> >> The only solution I can come up with is searching one at a >> time, but >> this is very slow.? (I have about 300,000 accession >> numbers) >> >> Does anyone know perhaps a patch or a solution for >> this?? Or maybe an >> easier way to get a TaxID from an accession number? >> >> Thanks, >> Austin Davis-Richardson >> _______________________________________________ >> Biopython mailing list? -? Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > > > > > > ------------------------------ > > Message: 3 > Date: Wed, 7 Oct 2009 10:29:36 +0100 > From: Peter > Subject: Re: [Biopython] Combine nexus files but not concatenating > ? ? ? ?them > To: Denzel Li > Cc: Biopython Mailing List > Message-ID: > ? ? ? ?<320fb6e00910070229n1b78542dj82998de13cf7eed7 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > On Wed, Oct 7, 2009 at 4:22 AM, Denzel Li wrote: >> Hi Peter: >> Thank you for the help. Both functions work well. By the way, will >> "standard" datatype or "mixed" datatype be supported in Bio:Nexus:Nexus? >> >> Best, >> Denzel > > Hi Denzel, > > I CC'd the list - please try and keep replies send there. > > I'm glad Bio.Nexus is working well for you. > > Regarding the finer details of the NEXUS file format and the Biopython > code, I am not an expert - we need Frank or Cymon to comment. If > you could give us a couple of examples of what you are asking for it > would probably be much clearer (to me at least). > > Regards, > > Peter > > > ------------------------------ > > Message: 4 > Date: Wed, 7 Oct 2009 12:17:30 +0100 > From: Peter > Subject: Re: [Biopython] Skipping over blank/erroneous > ? ? ? ?Entrez.esummary() ? ? ? results > To: Michiel de Hoon > Cc: biopython at lists.open-bio.org, ? ? ? Austin Davis-Richardson > ? ? ? ? > Message-ID: > ? ? ? ?<320fb6e00910070417w26236a62ifece2e2610256609 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > On Wed, Oct 7, 2009 at 3:11 AM, Michiel de Hoon wrote: >> >> We should probably modify Bio.Entrez so that empty "integer" values are treated correctly. >> > > Does "correctly" mean a default value? I see Brad has just commited a change to > use -1 in this case, but perhaps None is also a good choice? Can we > alternatively > leave this bit of the data structure empty? > > Peter > > > ------------------------------ > > Message: 5 > Date: Wed, 7 Oct 2009 07:17:37 -0400 > From: Brad Chapman > Subject: Re: [Biopython] Skipping over blank/erroneous > ? ? ? ?Entrez.esummary() ? ? ? results > To: Austin Davis-Richardson > Cc: biopython at lists.open-bio.org > Message-ID: <20091007111737.GC84267 at sobchak.mgh.harvard.edu> > Content-Type: text/plain; charset=us-ascii > > Hi Austin; > >> I'm using BioPython to generate a table of accession numbers and their >> corresponding TaxIDs. ?The fastest way I can do this is 20 at a time >> (20 per 3 seconds rather than 1 per 3 seconds). >> >> However, this results in a problem. >> >> whenever my script receives a result from NCBI that is blank such as >> there being no value for TaxID, BioPython crashes with the error: >> >> ? File "taxcollector3.py", line 39, in getTaxID >> ? ? record = Entrez.read(handle) >> ? File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py", >> line 259, in read >> ? ? record = handler.run(handle) >> ? File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", >> line 90, in run >> ? ? self.parser.ParseFile(handle) >> ? File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", >> line 191, in endElement >> ? ? value = IntegerElement(value) >> ValueError: invalid literal for int() with base 10: '' > > In addition to Michiel's workaround, I checked in a small change > which could at least circumvent the error you are reporting: > > http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279 > > It affects only one file, so if you don't want to pull the latest > from GitHub, you can download just that file and replace it in your > Biopython library: > > http://github.com/biopython/biopython/blob/master/Bio/Entrez/Parser.py > > Ideally, we should have a test case to cover this. Could you let us > know specific GIs that are causing the problem? The group of 20 is > fine if you haven't narrowed it further than that. This'll also help > us check if there are any other problems with these records. > > Thanks for reporting this, > Brad > > > ------------------------------ > > Message: 6 > Date: Wed, 7 Oct 2009 05:19:01 -0700 (PDT) > From: Michiel de Hoon > Subject: Re: [Biopython] Skipping over blank/erroneous > ? ? ? ?Entrez.esummary() ? ? ? results > To: Austin Davis-Richardson , ? ?Brad Chapman > ? ? ? ? > Cc: biopython at lists.open-bio.org > Message-ID: <826538.32828.qm at web62406.mail.re1.yahoo.com> > Content-Type: text/plain; charset=iso-8859-1 > >> In addition to Michiel's workaround, I checked in a small >> change >> which could at least circumvent the error you are >> reporting: >> >> http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279 > > Sorry, but that change introduces two bugs. First, we should be able to distinguish between -1 and missing values. More importantly, we want to be able to add attributes to value. Since -1 is an integer instead of an object, it won't allow that. > > Can you revert this change? > > --Michiel > > --- On Wed, 10/7/09, Brad Chapman wrote: > >> From: Brad Chapman >> Subject: Re: [Biopython] Skipping over blank/erroneous Entrez.esummary() results >> To: "Austin Davis-Richardson" >> Cc: biopython at lists.open-bio.org >> Date: Wednesday, October 7, 2009, 7:17 AM >> Hi Austin; >> >> > I'm using BioPython to generate a table of accession >> numbers and their >> > corresponding TaxIDs.? The fastest way I can do >> this is 20 at a time >> > (20 per 3 seconds rather than 1 per 3 seconds). >> > >> > However, this results in a problem. >> > >> > whenever my script receives a result from NCBI that is >> blank such as >> > there being no value for TaxID, BioPython crashes with >> the error: >> > >> >???File "taxcollector3.py", line 39, in >> getTaxID >> >? ???record = Entrez.read(handle) >> >???File >> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py", >> > line 259, in read >> >? ???record = handler.run(handle) >> >???File >> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", >> > line 90, in run >> >? ???self.parser.ParseFile(handle) >> >???File >> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", >> > line 191, in endElement >> >? ???value = IntegerElement(value) >> > ValueError: invalid literal for int() with base 10: >> '' >> >> In addition to Michiel's workaround, I checked in a small >> change >> which could at least circumvent the error you are >> reporting: >> >> http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279 >> >> It affects only one file, so if you don't want to pull the >> latest >> from GitHub, you can download just that file and replace it >> in your >> Biopython library: >> >> http://github.com/biopython/biopython/blob/master/Bio/Entrez/Parser.py >> >> Ideally, we should have a test case to cover this. Could >> you let us >> know specific GIs that are causing the problem? The group >> of 20 is >> fine if you haven't narrowed it further than that. This'll >> also help >> us check if there are any other problems with these >> records. >> >> Thanks for reporting this, >> Brad >> _______________________________________________ >> Biopython mailing list? -? Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > > > > > > ------------------------------ > > Message: 7 > Date: Wed, 7 Oct 2009 08:32:27 -0400 > From: Brad Chapman > Subject: Re: [Biopython] Skipping over blank/erroneous > ? ? ? ?Entrez.esummary() ? ? ? results > To: Michiel de Hoon > Cc: Austin Davis-Richardson , > ? ? ? ?biopython at lists.open-bio.org > Message-ID: <20091007123227.GD84267 at sobchak.mgh.harvard.edu> > Content-Type: text/plain; charset=us-ascii > > Peter and Michiel; > >> > In addition to Michiel's workaround, I checked in a small >> > change which could at least circumvent the error you are >> > reporting: >> > >> > http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279 > > Peter: >> Does "correctly" mean a default value? I see Brad has just commited a change to >> use -1 in this case, but perhaps None is also a good choice? Can we >> alternatively >> leave this bit of the data structure empty? > > Michiel: >> Sorry, but that change introduces two bugs. First, we should be able >> to distinguish between -1 and missing values. More importantly, we >> want to be able to add attributes to value. Since -1 is an integer >> instead of an object, it won't allow that. >> >> Can you revert this change? > > Thanks guys -- not the best choice. How do you feel about just passing > it along as an empty string and only doing the integer conversion if we > actually have data to convert? > > http://github.com/biopython/biopython/commit/1fff8038e4fa9e2643851a70118e3227ccbea44e > > So now missing values are empty strings, as passed, instead of any > sort of integer interpretation of them. > > Brad > > > ------------------------------ > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > End of Biopython Digest, Vol 82, Issue 3 > **************************************** > -- AGDR From chapmanb at 50mail.com Wed Oct 7 16:29:11 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 7 Oct 2009 16:29:11 -0400 Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary() In-Reply-To: References: Message-ID: <20091007202911.GI92415@sobchak.mgh.harvard.edu> Hi Austin; That is strange. That change may have unintended consequences downstream. Could you send along a GI number that is causing problems? If you revert that change and run the code printing out GI numbers at each step, let me know the specific ones that are leading to the initial error. Once we have something reproducible to work with, we should be able to track it down and provide a fix. Thanks, Brad > I'm confused now. In the latest version > > http://github.com/biopython/biopython/commit/1fff8038e4fa9e2643851a70118e3227ccbea44e > > Missing values are empty strings so if I did something like > > record = Entrez.read(handle) > > for item in record: > myList.append += item['TaxId'] > > myList should be something like : > [ '1234', '2434', '', '9970' ] > where myList[2] is the result of a missing value > > However, when I run my script. I find no blank spaces despite knowing > that there are some that should have missing values. > Which screws things up later when I zip tax ID's with their > corresponding accession number: > > zip (accessions, taxids) > > I'm all for using '1' (root) or '-1' for missing values. > > > 2009/10/7 : > > Send Biopython mailing list submissions to > > ? ? ? ?biopython at lists.open-bio.org > > > > To subscribe or unsubscribe via the World Wide Web, visit > > ? ? ? ?http://lists.open-bio.org/mailman/listinfo/biopython > > or, via email, send a message with subject or body 'help' to > > ? ? ? ?biopython-request at lists.open-bio.org > > > > You can reach the person managing the list at > > ? ? ? ?biopython-owner at lists.open-bio.org > > > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of Biopython digest..." > > > > > > Today's Topics: > > > > ? 1. Skipping over blank/erroneous Entrez.esummary() results > > ? ? ?(Austin Davis-Richardson) > > ? 2. Re: Skipping over blank/erroneous Entrez.esummary() ? ? ? results > > ? ? ?(Michiel de Hoon) > > ? 3. Re: Combine nexus files but not concatenating them (Peter) > > ? 4. Re: Skipping over blank/erroneous Entrez.esummary() ? ? ? results > > ? ? ?(Peter) > > ? 5. Re: Skipping over blank/erroneous Entrez.esummary() ? ? ? results > > ? ? ?(Brad Chapman) > > ? 6. Re: Skipping over blank/erroneous Entrez.esummary() ? ? ? results > > ? ? ?(Michiel de Hoon) > > ? 7. Re: Skipping over blank/erroneous Entrez.esummary() ? ? ? results > > ? ? ?(Brad Chapman) > > > > > > ---------------------------------------------------------------------- > > > > Message: 1 > > Date: Tue, 6 Oct 2009 17:07:52 -0400 > > From: Austin Davis-Richardson > > Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary() > > ? ? ? ?results > > To: biopython at lists.open-bio.org > > Message-ID: > > ? ? ? ? > > Content-Type: text/plain; charset=ISO-8859-1 > > > > Howdy, > > > > I'm using BioPython to generate a table of accession numbers and their > > corresponding TaxIDs. ?The fastest way I can do this is 20 at a time > > (20 per 3 seconds rather than 1 per 3 seconds). > > > > However, this results in a problem. > > > > whenever my script receives a result from NCBI that is blank such as > > there being no value for TaxID, BioPython crashes with the error: > > > > ?File "taxcollector3.py", line 39, in getTaxID > > ? ?record = Entrez.read(handle) > > ?File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py", > > line 259, in read > > ? ?record = handler.run(handle) > > ?File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", > > line 90, in run > > ? ?self.parser.ParseFile(handle) > > ?File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", > > line 191, in endElement > > ? ?value = IntegerElement(value) > > ValueError: invalid literal for int() with base 10: '' > > > > > > my code looks like this: ?Where gids is a string of comma-separated GIDs > > (I get the GIDs from the accession numbers using > > eEntrez.esearch(db="nucleotide", rettype="text", term=accessions)) > > > > ? ? ? ? ? ? ? ? ? ? ? ?handle = Entrez.esummary(db="nucleotide", id=gids) > > ? ? ? ? ? ? ? ? ? ? ? ?record = Entrez.read(handle) > > > > > > The only solution I can come up with is searching one at a time, but > > this is very slow. ?(I have about 300,000 accession numbers) > > > > Does anyone know perhaps a patch or a solution for this? ?Or maybe an > > easier way to get a TaxID from an accession number? > > > > Thanks, > > Austin Davis-Richardson > > > > > > ------------------------------ > > > > Message: 2 > > Date: Tue, 6 Oct 2009 19:11:36 -0700 (PDT) > > From: Michiel de Hoon > > Subject: Re: [Biopython] Skipping over blank/erroneous > > ? ? ? ?Entrez.esummary() ? ? ? results > > To: biopython at lists.open-bio.org, ? ? ? Austin Davis-Richardson > > ? ? ? ? > > Message-ID: <362834.37683.qm at web62401.mail.re1.yahoo.com> > > Content-Type: text/plain; charset=iso-8859-1 > > > > You could try the following (with biopython 1.52): > > > > handle = Entrez.esummary(db="nucleotide", id=gids) > > records = Entrez.parse(handle) > > while True: > > ? ?try: > > ? ? ? ?record = records.next() > > ? ?except StopIteration: > > ? ? ? ?break > > ? ?except: > > ? ? ? ?print "Skipping record" > > > > > > We should probably modify Bio.Entrez so that empty "integer" values are treated correctly. > > > > > > --Michiel. > > > > --- On Tue, 10/6/09, Austin Davis-Richardson wrote: > > > >> From: Austin Davis-Richardson > >> Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary() results > >> To: biopython at lists.open-bio.org > >> Date: Tuesday, October 6, 2009, 5:07 PM > >> Howdy, > >> > >> I'm using BioPython to generate a table of accession > >> numbers and their > >> corresponding TaxIDs.? The fastest way I can do this > >> is 20 at a time > >> (20 per 3 seconds rather than 1 per 3 seconds). > >> > >> However, this results in a problem. > >> > >> whenever my script receives a result from NCBI that is > >> blank such as > >> there being no value for TaxID, BioPython crashes with the > >> error: > >> > >> ? File "taxcollector3.py", line 39, in getTaxID > >> ? ? record = Entrez.read(handle) > >> ? File > >> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py", > >> line 259, in read > >> ? ? record = handler.run(handle) > >> ? File > >> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", > >> line 90, in run > >> ? ? self.parser.ParseFile(handle) > >> ? File > >> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", > >> line 191, in endElement > >> ? ? value = IntegerElement(value) > >> ValueError: invalid literal for int() with base 10: '' > >> > >> > >> my code looks like this:? Where gids is a string of > >> comma-separated GIDs > >> (I get the GIDs from the accession numbers using > >> eEntrez.esearch(db="nucleotide", rettype="text", > >> term=accessions)) > >> > >> ??? ??? ??? > >> handle = Entrez.esummary(db="nucleotide", id=gids) > >> ??? ??? ??? > >> record = Entrez.read(handle) > >> > >> > >> The only solution I can come up with is searching one at a > >> time, but > >> this is very slow.? (I have about 300,000 accession > >> numbers) > >> > >> Does anyone know perhaps a patch or a solution for > >> this?? Or maybe an > >> easier way to get a TaxID from an accession number? > >> > >> Thanks, > >> Austin Davis-Richardson > >> _______________________________________________ > >> Biopython mailing list? -? Biopython at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biopython > >> > > > > > > > > > > > > > > ------------------------------ > > > > Message: 3 > > Date: Wed, 7 Oct 2009 10:29:36 +0100 > > From: Peter > > Subject: Re: [Biopython] Combine nexus files but not concatenating > > ? ? ? ?them > > To: Denzel Li > > Cc: Biopython Mailing List > > Message-ID: > > ? ? ? ?<320fb6e00910070229n1b78542dj82998de13cf7eed7 at mail.gmail.com> > > Content-Type: text/plain; charset=ISO-8859-1 > > > > On Wed, Oct 7, 2009 at 4:22 AM, Denzel Li wrote: > >> Hi Peter: > >> Thank you for the help. Both functions work well. By the way, will > >> "standard" datatype or "mixed" datatype be supported in Bio:Nexus:Nexus? > >> > >> Best, > >> Denzel > > > > Hi Denzel, > > > > I CC'd the list - please try and keep replies send there. > > > > I'm glad Bio.Nexus is working well for you. > > > > Regarding the finer details of the NEXUS file format and the Biopython > > code, I am not an expert - we need Frank or Cymon to comment. If > > you could give us a couple of examples of what you are asking for it > > would probably be much clearer (to me at least). > > > > Regards, > > > > Peter > > > > > > ------------------------------ > > > > Message: 4 > > Date: Wed, 7 Oct 2009 12:17:30 +0100 > > From: Peter > > Subject: Re: [Biopython] Skipping over blank/erroneous > > ? ? ? ?Entrez.esummary() ? ? ? results > > To: Michiel de Hoon > > Cc: biopython at lists.open-bio.org, ? ? ? Austin Davis-Richardson > > ? ? ? ? > > Message-ID: > > ? ? ? ?<320fb6e00910070417w26236a62ifece2e2610256609 at mail.gmail.com> > > Content-Type: text/plain; charset=ISO-8859-1 > > > > On Wed, Oct 7, 2009 at 3:11 AM, Michiel de Hoon wrote: > >> > >> We should probably modify Bio.Entrez so that empty "integer" values are treated correctly. > >> > > > > Does "correctly" mean a default value? I see Brad has just commited a change to > > use -1 in this case, but perhaps None is also a good choice? Can we > > alternatively > > leave this bit of the data structure empty? > > > > Peter > > > > > > ------------------------------ > > > > Message: 5 > > Date: Wed, 7 Oct 2009 07:17:37 -0400 > > From: Brad Chapman > > Subject: Re: [Biopython] Skipping over blank/erroneous > > ? ? ? ?Entrez.esummary() ? ? ? results > > To: Austin Davis-Richardson > > Cc: biopython at lists.open-bio.org > > Message-ID: <20091007111737.GC84267 at sobchak.mgh.harvard.edu> > > Content-Type: text/plain; charset=us-ascii > > > > Hi Austin; > > > >> I'm using BioPython to generate a table of accession numbers and their > >> corresponding TaxIDs. ?The fastest way I can do this is 20 at a time > >> (20 per 3 seconds rather than 1 per 3 seconds). > >> > >> However, this results in a problem. > >> > >> whenever my script receives a result from NCBI that is blank such as > >> there being no value for TaxID, BioPython crashes with the error: > >> > >> ? File "taxcollector3.py", line 39, in getTaxID > >> ? ? record = Entrez.read(handle) > >> ? File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py", > >> line 259, in read > >> ? ? record = handler.run(handle) > >> ? File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", > >> line 90, in run > >> ? ? self.parser.ParseFile(handle) > >> ? File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", > >> line 191, in endElement > >> ? ? value = IntegerElement(value) > >> ValueError: invalid literal for int() with base 10: '' > > > > In addition to Michiel's workaround, I checked in a small change > > which could at least circumvent the error you are reporting: > > > > http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279 > > > > It affects only one file, so if you don't want to pull the latest > > from GitHub, you can download just that file and replace it in your > > Biopython library: > > > > http://github.com/biopython/biopython/blob/master/Bio/Entrez/Parser.py > > > > Ideally, we should have a test case to cover this. Could you let us > > know specific GIs that are causing the problem? The group of 20 is > > fine if you haven't narrowed it further than that. This'll also help > > us check if there are any other problems with these records. > > > > Thanks for reporting this, > > Brad > > > > > > ------------------------------ > > > > Message: 6 > > Date: Wed, 7 Oct 2009 05:19:01 -0700 (PDT) > > From: Michiel de Hoon > > Subject: Re: [Biopython] Skipping over blank/erroneous > > ? ? ? ?Entrez.esummary() ? ? ? results > > To: Austin Davis-Richardson , ? ?Brad Chapman > > ? ? ? ? > > Cc: biopython at lists.open-bio.org > > Message-ID: <826538.32828.qm at web62406.mail.re1.yahoo.com> > > Content-Type: text/plain; charset=iso-8859-1 > > > >> In addition to Michiel's workaround, I checked in a small > >> change > >> which could at least circumvent the error you are > >> reporting: > >> > >> http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279 > > > > Sorry, but that change introduces two bugs. First, we should be able to distinguish between -1 and missing values. More importantly, we want to be able to add attributes to value. Since -1 is an integer instead of an object, it won't allow that. > > > > Can you revert this change? > > > > --Michiel > > > > --- On Wed, 10/7/09, Brad Chapman wrote: > > > >> From: Brad Chapman > >> Subject: Re: [Biopython] Skipping over blank/erroneous Entrez.esummary() results > >> To: "Austin Davis-Richardson" > >> Cc: biopython at lists.open-bio.org > >> Date: Wednesday, October 7, 2009, 7:17 AM > >> Hi Austin; > >> > >> > I'm using BioPython to generate a table of accession > >> numbers and their > >> > corresponding TaxIDs.? The fastest way I can do > >> this is 20 at a time > >> > (20 per 3 seconds rather than 1 per 3 seconds). > >> > > >> > However, this results in a problem. > >> > > >> > whenever my script receives a result from NCBI that is > >> blank such as > >> > there being no value for TaxID, BioPython crashes with > >> the error: > >> > > >> >???File "taxcollector3.py", line 39, in > >> getTaxID > >> >? ???record = Entrez.read(handle) > >> >???File > >> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py", > >> > line 259, in read > >> >? ???record = handler.run(handle) > >> >???File > >> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", > >> > line 90, in run > >> >? ???self.parser.ParseFile(handle) > >> >???File > >> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py", > >> > line 191, in endElement > >> >? ???value = IntegerElement(value) > >> > ValueError: invalid literal for int() with base 10: > >> '' > >> > >> In addition to Michiel's workaround, I checked in a small > >> change > >> which could at least circumvent the error you are > >> reporting: > >> > >> http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279 > >> > >> It affects only one file, so if you don't want to pull the > >> latest > >> from GitHub, you can download just that file and replace it > >> in your > >> Biopython library: > >> > >> http://github.com/biopython/biopython/blob/master/Bio/Entrez/Parser.py > >> > >> Ideally, we should have a test case to cover this. Could > >> you let us > >> know specific GIs that are causing the problem? The group > >> of 20 is > >> fine if you haven't narrowed it further than that. This'll > >> also help > >> us check if there are any other problems with these > >> records. > >> > >> Thanks for reporting this, > >> Brad > >> _______________________________________________ > >> Biopython mailing list? -? Biopython at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biopython > >> > > > > > > > > > > > > > > ------------------------------ > > > > Message: 7 > > Date: Wed, 7 Oct 2009 08:32:27 -0400 > > From: Brad Chapman > > Subject: Re: [Biopython] Skipping over blank/erroneous > > ? ? ? ?Entrez.esummary() ? ? ? results > > To: Michiel de Hoon > > Cc: Austin Davis-Richardson , > > ? ? ? ?biopython at lists.open-bio.org > > Message-ID: <20091007123227.GD84267 at sobchak.mgh.harvard.edu> > > Content-Type: text/plain; charset=us-ascii > > > > Peter and Michiel; > > > >> > In addition to Michiel's workaround, I checked in a small > >> > change which could at least circumvent the error you are > >> > reporting: > >> > > >> > http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279 > > > > Peter: > >> Does "correctly" mean a default value? I see Brad has just commited a change to > >> use -1 in this case, but perhaps None is also a good choice? Can we > >> alternatively > >> leave this bit of the data structure empty? > > > > Michiel: > >> Sorry, but that change introduces two bugs. First, we should be able > >> to distinguish between -1 and missing values. More importantly, we > >> want to be able to add attributes to value. Since -1 is an integer > >> instead of an object, it won't allow that. > >> > >> Can you revert this change? > > > > Thanks guys -- not the best choice. How do you feel about just passing > > it along as an empty string and only doing the integer conversion if we > > actually have data to convert? > > > > http://github.com/biopython/biopython/commit/1fff8038e4fa9e2643851a70118e3227ccbea44e > > > > So now missing values are empty strings, as passed, instead of any > > sort of integer interpretation of them. > > > > Brad > > > > > > ------------------------------ > > > > _______________________________________________ > > Biopython mailing list ?- ?Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > > End of Biopython Digest, Vol 82, Issue 3 > > **************************************** > > > > > > -- > AGDR > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From denzel.dz.li at gmail.com Wed Oct 7 19:23:17 2009 From: denzel.dz.li at gmail.com (Denzel Li) Date: Wed, 7 Oct 2009 19:23:17 -0400 Subject: [Biopython] Combine nexus files but not concatenating them In-Reply-To: <320fb6e00910070229n1b78542dj82998de13cf7eed7@mail.gmail.com> References: <320fb6e00910051242n54d6229fyacc23119401715e@mail.gmail.com> <320fb6e00910051331u2f27c7ecmde744d29fb3ba2ab@mail.gmail.com> <320fb6e00910070229n1b78542dj82998de13cf7eed7@mail.gmail.com> Message-ID: Hi Peter: Regarding the Nexus datatype supported in Bio:Nexus:Nexus, I mean nexus like the following, where the datatype is a "mixing" of "standard" and "DNA". According to the function Bio:Nexus:Nexus._format (line 696), these datatypes are not supported yet. I am just wondering does the team has the plan to support these data types. ------------ # Nexus Begin data; Dimensions ntax=2 nchar=1000; Format datatype=mixed(Standard:1-5,DNA:6-1000) interleave=yes gap=- missing=?; Matrix [morphology] s1 10010 s2 20011 s3 20010 s4 10020 [Gene 1] s1 ACGT s2 AAGT s3 ACGA s4 ACGT ... ; end; --------------- Best, Denzel On Wed, Oct 7, 2009 at 5:29 AM, Peter wrote: > On Wed, Oct 7, 2009 at 4:22 AM, Denzel Li wrote: > > Hi Peter: > > Thank you for the help. Both functions work well. By the way, will > > "standard" datatype or "mixed" datatype be supported in Bio:Nexus:Nexus? > > > > Best, > > Denzel > > Hi Denzel, > > I CC'd the list - please try and keep replies send there. > > I'm glad Bio.Nexus is working well for you. > > Regarding the finer details of the NEXUS file format and the Biopython > code, I am not an expert - we need Frank or Cymon to comment. If > you could give us a couple of examples of what you are asking for it > would probably be much clearer (to me at least). > > Regards, > > Peter > From biopython at maubp.freeserve.co.uk Thu Oct 8 04:54:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 8 Oct 2009 09:54:39 +0100 Subject: [Biopython] Combine nexus files but not concatenating them In-Reply-To: References: <320fb6e00910051242n54d6229fyacc23119401715e@mail.gmail.com> <320fb6e00910051331u2f27c7ecmde744d29fb3ba2ab@mail.gmail.com> <320fb6e00910070229n1b78542dj82998de13cf7eed7@mail.gmail.com> Message-ID: <320fb6e00910080154m2e75a38eh7e120e807103773b@mail.gmail.com> On Thu, Oct 8, 2009 at 12:23 AM, Denzel Li wrote: > Hi Peter: > Regarding the Nexus datatype supported in Bio:Nexus:Nexus, I mean nexus like > the following, where the datatype is a "mixing" of "standard" and "DNA". > According to the function Bio:Nexus:Nexus._format (line 696), these > datatypes are not supported yet. I am just wondering does the team has the > plan to support these data types. Oh right - in in your example, the digits encode morphology, but they could also be phenotypes, or some other characteristic like gene copy number. As to Bio.Nexus supporting this, hopefully Frank or Cymon can comment. If Bio.Nexus did support this, then from the Bio.AlignIO point of view, with the current object structure we'd have to use a sequence object (holding both the digits, and the DNA) for the sequence strings (e.g. for s1 in your example, Seq("10010ACGT")) with a generic single letter alphabet. This would lose the fact that the first five characters are digits, but the rest are DNA. This isn't ideal, and would probably cause trouble for Nexus output (writing such alignments). Would you want to try and deal with such "mixed" alignments via the Bio.AlignIO interface? Peter From ibdeno at gmail.com Mon Oct 12 04:11:38 2009 From: ibdeno at gmail.com (Miguel Ortiz Lombardia) Date: Mon, 12 Oct 2009 10:11:38 +0200 Subject: [Biopython] Problems parsing with PSIBlastParser Message-ID: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> Dear list members, I have a problem with NCBIStandalone.PSIBlastParser, which I need to use instead of NCBIXML since the latter one lacks some record properties that I need. My code used to work until recently (say three months) and now it seems something has changed in the latest biopython (1.52-1, I install it on an intel OSX 10.5.8 via fink). I get the same problem irrespectively of whether I use python 2.5 or 2.6. Here follows the relevant part of the code: #### blast_out, error_info = NCBIStandalone.blastpgp( blastcmd='/usr/local/blast-2.2.18/bin/blastpgp', database='/opt/BlastDBs/' + db, infile=file, npasses=passes, program='blastpgp', descriptions='500', alignments='1000', align_view='0', matrix_outfile=outbase + '.' + db + '.' + str(passes) + '.pssm') b_parser = NCBIStandalone.PSIBlastParser() b_record = b_parser.parse(blast_out) #### And this is the error that I now get: #### File "/Users/mol/bin/lpbl.py", line 64, in doblast b_record = b_parser.parse(blast_out) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 777, in parse self._scanner.feed(handle, self._consumer) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 97, in feed self._scan_rounds(uhandle, consumer) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 234, in _scan_rounds self._scan_alignments(uhandle, consumer) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 376, in _scan_alignments self._scan_pairwise_alignments(uhandle, consumer) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 386, in _scan_pairwise_alignments self._scan_one_pairwise_alignment(uhandle, consumer) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 398, in _scan_one_pairwise_alignment self._scan_hsp(uhandle, consumer) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 433, in _scan_hsp self._scan_hsp_alignment(uhandle, consumer) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 464, in _scan_hsp_alignment read_and_call(uhandle, consumer.query, start='Query') File "/sw/lib/python2.6/site-packages/Bio/ParserSupport.py", line 303, in read_and_call method(line) File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py", line 1138, in query raise ValueError("I could not find the query in line\n%s" % line) ValueError: I could not find the query in line Query: 0 - #### Now, the interesting thing is that if I run blastpgp directly and catch the output to a file, this file never includes such a line as: Query: 0 - Actually, if I modify my code so it reads this output file, the PSIBlastParser processes it without error. I have found that something may have changed in NCBIStandalone recently, namely, this bit: _query_re = re.compile(r"Query(:?) \s*(\d+)\s*(.+) (\d+)") def query(self, line): m = self._query_re.search(line) if m is None: raise ValueError("I could not find the query in line\n%s" % line) Anyone has a clue? Thank you! -- Miguel From biopython at maubp.freeserve.co.uk Mon Oct 12 05:19:33 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 12 Oct 2009 10:19:33 +0100 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> Message-ID: <320fb6e00910120219g46a85467ia9fe30131380d932@mail.gmail.com> On Mon, Oct 12, 2009 at 9:11 AM, Miguel Ortiz Lombardia wrote: > Dear list members, > > I have a problem with NCBIStandalone.PSIBlastParser, which I need to use > instead of NCBIXML since the latter one lacks some record properties that I > need. > > My code used to work until recently (say three months) and now it seems > something has changed in the latest biopython (1.52-1, I install it on an > intel OSX 10.5.8 via fink). I get the same problem irrespectively of whether > I use python 2.5 or 2.6. You definitely didn't upgrade your copy of BLAST at the same time? Could you file a bug please. Then run PSI-BLAST "by hand" and record the plain text output to a file, and upload the file to Bugzilla. Note you have to file the bug before it will let you upload a file. Having the XML output could be helpful too. http://bugzilla.open-bio.org/enter_bug.cgi?product=Biopython Also, can we use your BLAST file as a unit test? Thanks Peter From krother at rubor.de Mon Oct 12 07:44:10 2009 From: krother at rubor.de (Kristian Rother) Date: Mon, 12 Oct 2009 13:44:10 +0200 Subject: [Biopython] RuPy 2009 Bioinformatics Satellite 6.11. in Poznan, Poland Message-ID: <1c64f5fbb09ada1aae8207d5c7d737a8-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FaLkQAWl9aRF9dXQg=-webmailer2@server01.webmailer.hosteurope.de> Hi, As some of you may know, this years November 7th-8th, the RuPy (Ruby/Python) conference is taking place in Poznan, Poland. --> see: http://rupy.eu I am happy to announce that we will have a small satellite meeting to the RuPy conference dedicated to structural bioinformatics. Please feel invited to join - everybody is welcome. Date: November 6th Time: 13:00 Place: Collegium Biologicum - right next to the main conference Room: 1.126 (1st floor at the very end of the building) Tentative programme: - Lightning talks (enrolment on-site) - Code gallery - Space for hands-on work on modules of interest, e.g.: * Bio.PDB * Bio.RNA * django.* * moderna.* * ... Total duration: 3-4 hours. Best regards, Kristian Rother Laboratory of structural bioinformatics, UAM http://bioinformatics.amu.edu.pl/index_.html From biopython at maubp.freeserve.co.uk Tue Oct 13 07:10:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 13 Oct 2009 12:10:06 +0100 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> Message-ID: <320fb6e00910130410h224c0a9ft595befcad4a47cf4@mail.gmail.com> On Mon, Oct 12, 2009 at 9:11 AM, Miguel Ortiz Lombardia wrote: > Dear list members, > > I have a problem with NCBIStandalone.PSIBlastParser, which I need to use > instead of NCBIXML since the latter one lacks some record properties that I > need. > > My code used to work until recently (say three months) and now it seems > something has changed in the latest biopython (1.52-1, I install it on an > intel OSX 10.5.8 via fink). I get the same problem irrespectively of whether > I use python 2.5 or 2.6. Thanks for filing the bug, and supplying the example output files. http://bugzilla.open-bio.org/show_bug.cgi?id=2927 Do you remember what version of Biopython you used to be running before updating to 1.52? This would help to narrow down the change triggering this problem. In the mean time, I have tried parsing your sample output, and it seems fine: from Bio.Blast.NCBIStandalone import PSIBlastParser b_parser = PSIBlastParser() handle = open("Q3V4Q0.psiblast.txt") b_record = b_parser.parse(handle) handle.close() for b_round in b_record.rounds : print "Round %i has %i alignments" \ % (b_round.number, len(b_round.alignments)) Gives: Round 1 has 385 alignments Round 2 has 1000 alignments Round 3 has 1000 alignments Round 4 has 1000 alignments Round 5 has 1000 alignments So, if the file parser is fine, then my guess is this is something to do with how we are running PSI-BLAST via NCBIStandalone.blastpgp - and this code has changed in recent releases. It used to use the python function os.popen3 but this was deprecated in Python 2.6 and we now use the subprocess library. It is also possible that the command line options you used when running BLAST by hand to supply me the example output differed from what was used in your Python script. What exactly did you type at the command line to make the example output you sent me? I'd like to double check the Python code is using the same thing... Peter From biopython at maubp.freeserve.co.uk Tue Oct 13 07:41:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 13 Oct 2009 12:41:58 +0100 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> <320fb6e00910130410h224c0a9ft595befcad4a47cf4@mail.gmail.com> Message-ID: <320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com> Don't forget to CC the mailing list ;) On Tue, Oct 13, 2009 at 12:22 PM, Miguel Ortiz Lombardia wrote: > > > Le 13 oct. 09 ? 13:10, Peter a ?crit : > >> On Mon, Oct 12, 2009 at 9:11 AM, Miguel Ortiz Lombardia >> wrote: >>> >>> Dear list members, >>> >>> I have a problem with NCBIStandalone.PSIBlastParser, which I need to use >>> instead of NCBIXML since the latter one lacks some record properties that >>> I need. >>> >>> My code used to work until recently (say three months) and now it seems >>> something has changed in the latest biopython (1.52-1, I install it on an >>> intel OSX 10.5.8 via fink). I get the same problem irrespectively of >>> whether I use python 2.5 or 2.6. >> >> Thanks for filing the bug, and supplying the example output files. >> http://bugzilla.open-bio.org/show_bug.cgi?id=2927 >> >> Do you remember what version of Biopython you used to be running >> before updating to 1.52? This would help to narrow down the change >> triggering this problem. >> > > Sorry, can't tell it for sure, but it was whatever version was current in > March 2009. Probably Biopython 1.49 then. That may help. >> In the mean time, I have tried parsing your sample output, and it seems >> fine: >> >> from Bio.Blast.NCBIStandalone import PSIBlastParser >> b_parser = PSIBlastParser() >> handle = open("Q3V4Q0.psiblast.txt") >> b_record = b_parser.parse(handle) >> handle.close() >> for b_round in b_record.rounds : >> ? print "Round %i has %i alignments" \ >> ? ? ? ? % (b_round.number, len(b_round.alignments)) >> >> >> Gives: >> >> Round 1 has 385 alignments >> Round 2 has 1000 alignments >> Round 3 has 1000 alignments >> Round 4 has 1000 alignments >> Round 5 has 1000 alignments >> > > Yes, that's also what I see with my code: text files can be parsed. OK - good. So it doesn't look like a parser bug. >> So, if the file parser is fine, then my guess is this is something to do >> with >> how we are running PSI-BLAST via NCBIStandalone.blastpgp - and this >> code has changed in recent releases. It used to use the python function >> os.popen3 but this was deprecated in Python 2.6 and we now use the >> subprocess library. > > I think this is the most likely explanation. > >> It is also possible that the command line options you used when running >> BLAST by hand to supply me the example output differed from what >> was used in your Python script. > > I don't think so, I just used the same command line that was launched from > the python script (got it from a 'ps' command) Great :) >> What exactly did you type at the command line to make the example >> output you sent me? I'd like to double check the Python code is using >> the same thing... > > For plain text output: > > /usr/local/blast-2.2.22/bin/blastpgp -d /opt/BlastDBs/uniref100 -i > Q3V4Q0.fasta -m 0 -v 500 -b 1000 -Q Q3V4Q0.uniref100.5.pssm -j 5 -p blastpgp >> Q3V4Q0.psiblast.log > > For XML: > > /usr/local/blast-2.2.22/bin/blastpgp -d /opt/BlastDBs/uniref100 -i > Q3V4Q0.fasta -m 7 -v 500 -b 1000 -Q Q3V4Q0.uniref100.5.pssm -j 5 -p blastpgp >> Q3V4Q0.psiblast.xml.log Because you capture the stdout to a file (rather than using the -o option), the output files should be identical to those obtained by the python script. I would need to install the same BLAST database etc in order to try and debug this on my own machine, which is a hassle. So I'll try and ask you to test a few things instead. Could you try changing this line: blast_out, error_info = NCBIStandalone.blastpgp(...) to this: temp_handle, error_info = NCBIStandalone.blastpgp(...) from StringIO import StringIO blast_out = StringIO(temp_handle.read()) temp_handle.close() This will try to read in all the BLAST output (all 5MB of it) into memory as a string, and turn it into a StringIO handle which the parser should accept. You could also try explicitly saving to a file: temp_handle, error_info = NCBIStandalone.blastpgp(...) temp_file = open("temp.txt", "w") temp_file.write(temp_handle.read()) temp_file.close() temp_handle.close() blast_out = open("temp.txt") or, perhaps: temp_handle, error_info = NCBIStandalone.blastpgp(...) temp_file = open("temp.txt", "w") for line in temp_handle : temp_file.write(line) temp_file.close() temp_handle.close() blast_out = open("temp.txt") It would not surprise me to see these fail as before, but having a look at the temp.txt file could be very instructive (especially if it contains that odd query line you mentioned earlier). I know that the Python subprocess module can have problems with deadlocks when dealing with large amounts of piped data. There are ways to cope, but the simplest option is to tell BLAST to save the data to a file (instead of stdout) with the -o command line option. This avoids sending large amounts of data via the stdout pipe. I can explain how to do this within Biopython if you like (this email is already very long). Peter From biopython at maubp.freeserve.co.uk Tue Oct 13 07:46:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 13 Oct 2009 12:46:27 +0100 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com> References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> <320fb6e00910130410h224c0a9ft595befcad4a47cf4@mail.gmail.com> <320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com> Message-ID: <320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com> On Tue, Oct 13, 2009 at 12:41 PM, Peter wrote: >>> >>> Do you remember what version of Biopython you used to be running >>> before updating to 1.52? This would help to narrow down the change >>> triggering this problem. >>> >> >> Sorry, can't tell it for sure, but it was whatever version was current in >> March 2009. > > Probably Biopython 1.49 then. That may help. > Hmm - the switch to using subprocess (on Python 2.4+ or later) was made in October 2008, and would have first appeared in Biopython 1.49. Maybe you were using Biopython 1.48 before - or the issue is something else. Peter From ibdeno at gmail.com Tue Oct 13 07:58:23 2009 From: ibdeno at gmail.com (Miguel Ortiz Lombardia) Date: Tue, 13 Oct 2009 13:58:23 +0200 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com> References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> <320fb6e00910130410h224c0a9ft595befcad4a47cf4@mail.gmail.com> <320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com> <320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com> Message-ID: >>>> >>>> Do you remember what version of Biopython you used to be running >>>> before updating to 1.52? This would help to narrow down the change >>>> triggering this problem. >>>> >>> >>> Sorry, can't tell it for sure, but it was whatever version was >>> current in >>> March 2009. >> >> Probably Biopython 1.49 then. That may help. >> > > Hmm - the switch to using subprocess (on Python 2.4+ or later) was > made > in October 2008, and would have first appeared in Biopython 1.49. > Maybe > you were using Biopython 1.48 before - or the issue is something else. > > Peter It may well have been 1.48... Having a closer look at the files from my last successful runs I discover the actually come from November 2008... I'm now running the tests you suggested. Sorry not to have copied the list in the previous post! Best, -- Miguel From biopython at maubp.freeserve.co.uk Tue Oct 13 09:36:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 13 Oct 2009 14:36:44 +0100 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> <320fb6e00910130410h224c0a9ft595befcad4a47cf4@mail.gmail.com> <320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com> <320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com> Message-ID: <320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com> On Tue, Oct 13, 2009 at 12:58 PM, Miguel Ortiz Lombardia wrote: >> >> Hmm - the switch to using subprocess (on Python 2.4+ or later) was made >> in October 2008, and would have first appeared in Biopython 1.49. Maybe >> you were using Biopython 1.48 before - or the issue is something else. >> >> Peter > > > It may well have been 1.48... Having a closer look at the files from my last > successful runs I discover the actually come from November 2008... > > I'm now running the tests you suggested. Let me know what they show. How long do these BLAST runs take? Perhaps I was ambitious with the number of suggestions to try ;) Assuming the problem is with how we are calling the BLAST tool via the subprocess module, I have two suggested fixes in mind. The first is a change to the _invoke_blast() function in Bio/Blast/NCBIStandalone.py, essentially replace these lines: blast_process.stdin.close() return blast_process.stdout, blast_process.stderr With this: stdout, stderr = blast_process.communicate() from StringIO import StringIO return StringIO(stdout), StringIO(stderr) We had to make a similar change to Bio.Clustalw for Bug 2804. This uses subprocess to buffer the data in order to avoid any deadlock reading from the handles. I hadn't made this change before as it imposes a memory overhead (and BLAST output is often *very* large, especially as XML), and until now there hadn't been any problems reported. It would be worth trying in your situation (even just to confirm the source of the error), but I don't think we should make this change for the official distribution. The second option (which I mentioned before) is to tell blastpgp to write its output directly to a file, and then parse the file. This is how I normally run large BLAST jobs. This is possible but not elegant via the function Bio.Blast.NCBIStandalone.blastpgp (which always returns stdout/stderr handles). Bug 2654 has an example, http://bugzilla.open-bio.org/show_bug.cgi?id=2654 However, what I want to recommend instead is to use the more flexible Bio.Blast.Applications objects instead (in this case, the class BlastpgpCommandline). I had planed to update the BLAST chapter of the Biopython Tutorial to cover this, but it didn't happen in time for the Biopython 1.52 release. However, the alignment chapter goes through several examples of this style of command line tool wrapper, and the BLAST wrappers work in exactly the same way. Using these "lower level" application wrappers, it is up to you to invoke subprocess (or another system call) as you see fit (e.g. with pipes). This is more flexible than the old Bio.Blast.NCBIStandalone.blastpgp function (and others like it) where the behaviour could not be set. Feel free to ask for clarification on this - questions now will help for rewriting the BLAST chapter later on ;) Regards, Peter P.S. See also http://docs.python.org/library/subprocess.html From ibdeno at gmail.com Tue Oct 13 09:57:13 2009 From: ibdeno at gmail.com (Miguel Ortiz Lombardia) Date: Tue, 13 Oct 2009 15:57:13 +0200 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com> References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> <320fb6e00910130410h224c0a9ft595befcad4a47cf4@mail.gmail.com> <320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com> <320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com> <320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com> Message-ID: <7AFF985A-4DDE-4B72-A81E-26516BDA689F@gmail.com> Le 13 oct. 09 ? 15:36, Peter a ?crit : > On Tue, Oct 13, 2009 at 12:58 PM, Miguel Ortiz Lombardia > wrote: >>> >>> Hmm - the switch to using subprocess (on Python 2.4+ or later) was >>> made >>> in October 2008, and would have first appeared in Biopython 1.49. >>> Maybe >>> you were using Biopython 1.48 before - or the issue is something >>> else. >>> >>> Peter >> >> >> It may well have been 1.48... Having a closer look at the files >> from my last >> successful runs I discover the actually come from November 2008... >> >> I'm now running the tests you suggested. > > Let me know what they show. How long do these BLAST runs take? > Perhaps I was ambitious with the number of suggestions to try ;) It took long, because I wanted to reproduce the same situation. All the three suggestions you made worked! I have at least a work-around now. > > Assuming the problem is with how we are calling the BLAST tool via the > subprocess module, I have two suggested fixes in mind. The first is > a change > to the _invoke_blast() function in Bio/Blast/NCBIStandalone.py, > essentially > replace these lines: > > blast_process.stdin.close() > return blast_process.stdout, blast_process.stderr > > With this: > > stdout, stderr = blast_process.communicate() > from StringIO import StringIO > return StringIO(stdout), StringIO(stderr) > > We had to make a similar change to Bio.Clustalw for Bug 2804. This > uses > subprocess to buffer the data in order to avoid any deadlock reading > from > the handles. I hadn't made this change before as it imposes a memory > overhead (and BLAST output is often *very* large, especially as XML), > and until now there hadn't been any problems reported. It would be > worth > trying in your situation (even just to confirm the source of the > error), but > I don't think we should make this change for the official > distribution. > You're right, probably not justified if I'm the only one with this problem. > The second option (which I mentioned before) is to tell blastpgp to > write > its output directly to a file, and then parse the file. This is how > I normally > run large BLAST jobs. This is possible but not elegant via the > function > Bio.Blast.NCBIStandalone.blastpgp (which always returns stdout/stderr > handles). Bug 2654 has an example, > http://bugzilla.open-bio.org/show_bug.cgi?id=2654 > > However, what I want to recommend instead is to use the more flexible > Bio.Blast.Applications objects instead (in this case, the class > BlastpgpCommandline). I had planed to update the BLAST chapter > of the Biopython Tutorial to cover this, but it didn't happen in > time for > the Biopython 1.52 release. However, the alignment chapter goes > through several examples of this style of command line tool wrapper, > and the BLAST wrappers work in exactly the same way. > > Using these "lower level" application wrappers, it is up to you to > invoke > subprocess (or another system call) as you see fit (e.g. with pipes). > This is more flexible than the old Bio.Blast.NCBIStandalone.blastpgp > function (and others like it) where the behaviour could not be set. I will explore this possibility, it seems definitely more elegant than the other one (as in Bug 2654). > > Feel free to ask for clarification on this - questions now will help > for > rewriting the BLAST chapter later on ;) I may come back with questions :-) Thank you very much for your help! Best, -- Miguel From carlos.borroto at gmail.com Tue Oct 13 18:45:13 2009 From: carlos.borroto at gmail.com (Carlos Javier Borroto) Date: Tue, 13 Oct 2009 18:45:13 -0400 Subject: [Biopython] Is there any Entrez Gene parser out there? Message-ID: <65d4b7fc0910131545y793696dv6305bf98b2e51e13@mail.gmail.com> Do biopython have a parser for Entrez Gene?, Does someone know if there is any python parser for this database at all? I see there is one on Bioperl, but I'll be happy if I can stick to python. regards, -- Carlos Javier Borroto Baltimore, MD Phone: (410) 929 4020 From biopython at maubp.freeserve.co.uk Tue Oct 13 19:18:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Oct 2009 00:18:58 +0100 Subject: [Biopython] Is there any Entrez Gene parser out there? In-Reply-To: <65d4b7fc0910131545y793696dv6305bf98b2e51e13@mail.gmail.com> References: <65d4b7fc0910131545y793696dv6305bf98b2e51e13@mail.gmail.com> Message-ID: <320fb6e00910131618t2e880c95n7a7e0df6acc31176@mail.gmail.com> On Tue, Oct 13, 2009 at 11:45 PM, Carlos Javier Borroto wrote: > Do biopython have a parser for Entrez Gene?, Does someone know if > there is any python parser for this database at all? The Bio.Entrez.read() should be fine with the XML Entrez Gene data, or try the recently added Bio.Entrez.parse() for large datasets (incremental parsing). Peter From winda002 at student.otago.ac.nz Tue Oct 13 19:37:52 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Wed, 14 Oct 2009 12:37:52 +1300 Subject: [Biopython] Is there any Entrez Gene parser out there? In-Reply-To: <65d4b7fc0910131545y793696dv6305bf98b2e51e13@mail.gmail.com> References: <65d4b7fc0910131545y793696dv6305bf98b2e51e13@mail.gmail.com> Message-ID: <200910141237.52810.winda002@student.otago.ac.nz> On Wed, 14 Oct 2009 11:45:13 Carlos Javier Borroto wrote: > Do biopython have a parser for Entrez Gene?, Does someone know if > there is any python parser for this database at all? > > I see there is one on Bioperl, but I'll be happy if I can stick to python. Hi Carlos, I don't have much experience with the Entrez module, so this might not be the best way (I thought I should reply before you where forced to resort to Perl ;) If you use Bio.Entrez.esummary() you can get a list of python dictionaries for a given record. Something like this: >>> Entrez.email = "you at someplace" >>> query = Entrez.esummary(db="gene", id="641535") >>> record = Entrez.read(query) >>> record [{'Mim': [], 'Orgname': 'Tribolium castaneum', 'TaxID': 7070 ... >>>for field in record: ... print field["Chromosome"] LG2 There's also documentation in the tutorial and a related cookbook example on the wiki: http://www.biopython.org/wiki/Annotate_Entrez_Gene_IDs Cheers, David From sdavis2 at mail.nih.gov Tue Oct 13 21:04:20 2009 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Tue, 13 Oct 2009 21:04:20 -0400 Subject: [Biopython] Is there any Entrez Gene parser out there? In-Reply-To: <200910141237.52810.winda002@student.otago.ac.nz> References: <65d4b7fc0910131545y793696dv6305bf98b2e51e13@mail.gmail.com> <200910141237.52810.winda002@student.otago.ac.nz> Message-ID: <264855a00910131804k28f08c8nca3cd82e1ab8280e@mail.gmail.com> On Tue, Oct 13, 2009 at 7:37 PM, David Winter wrote: > On Wed, 14 Oct 2009 11:45:13 Carlos Javier Borroto wrote: > > Do biopython have a parser for Entrez Gene?, Does someone know if > > there is any python parser for this database at all? > > > > I see there is one on Bioperl, but I'll be happy if I can stick to > python. > > If you like, there are simple tab-delimited files that contain much of the information that you might want: ftp://ftp.ncbi.nih.gov/gene/DATA/ You can push these into sqlite or another RDBMS or just read them into python directly. Sean > Hi Carlos, > > I don't have much experience with the Entrez module, so this might not be > the > best way (I thought I should reply before you where forced to resort to > Perl > ;) > > If you use Bio.Entrez.esummary() you can get a list of python dictionaries > for > a given record. Something like this: > > >>> Entrez.email = "you at someplace" > >>> query = Entrez.esummary(db="gene", id="641535") > >>> record = Entrez.read(query) > >>> record > [{'Mim': [], 'Orgname': 'Tribolium castaneum', 'TaxID': 7070 ... > >>>for field in record: > ... print field["Chromosome"] > LG2 > > There's also documentation in the tutorial and a related cookbook example > on > the wiki: > > http://www.biopython.org/wiki/Annotate_Entrez_Gene_IDs > > Cheers, > David > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From cjfields at illinois.edu Tue Oct 13 20:54:26 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 13 Oct 2009 19:54:26 -0500 Subject: [Biopython] Is there any Entrez Gene parser out there? In-Reply-To: <200910141237.52810.winda002@student.otago.ac.nz> References: <65d4b7fc0910131545y793696dv6305bf98b2e51e13@mail.gmail.com> <200910141237.52810.winda002@student.otago.ac.nz> Message-ID: On Oct 13, 2009, at 6:37 PM, David Winter wrote: > On Wed, 14 Oct 2009 11:45:13 Carlos Javier Borroto wrote: >> Do biopython have a parser for Entrez Gene?, Does someone know if >> there is any python parser for this database at all? >> >> I see there is one on Bioperl, but I'll be happy if I can stick to >> python. > > Hi Carlos, > > I don't have much experience with the Entrez module, so this might > not be the > best way (I thought I should reply before you where forced to resort > to Perl > ;) Alright now, let's not start cross-lang flame wars, there are cross- lang users out there (like me!). chris From winda002 at student.otago.ac.nz Tue Oct 13 22:46:03 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Wed, 14 Oct 2009 15:46:03 +1300 Subject: [Biopython] Is there any Entrez Gene parser out there? In-Reply-To: References: <65d4b7fc0910131545y793696dv6305bf98b2e51e13@mail.gmail.com> <200910141237.52810.winda002@student.otago.ac.nz> Message-ID: <200910141546.03884.winda002@student.otago.ac.nz> > > I don't have much experience with the Entrez module, so this might > > not be the > > best way (I thought I should reply before you where forced to resort > > to Perl > > ;) > > Alright now, let's not start cross-lang flame wars, there are cross- > lang users out there (like me!). > > chris Sorry Chris, tongue was firmly in cheek there. david From biopython at maubp.freeserve.co.uk Wed Oct 14 08:37:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Oct 2009 13:37:45 +0100 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> <320fb6e00910130410h224c0a9ft595befcad4a47cf4@mail.gmail.com> <320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com> <320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com> <320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com> <7AFF985A-4DDE-4B72-A81E-26516BDA689F@gmail.com> Message-ID: <320fb6e00910140537h1d9d71f5i97417266c542ae3b@mail.gmail.com> On Wed, Oct 14, 2009 at 12:30 PM, Miguel Ortiz Lombardia wrote: > > Hi again, Peter. > > Well, it turned out that I don't have such work-around... When I launched > the script as: > > nohup lpbl.py ... & > > against all my sequences it choked at the first one (quite longer than the > one I was using as an example) with the very same error. It would take longer as it would wait for BLAST to finish before starting to parse it. > However, this time I have the "temp.txt" file and indeed there lines such as: > > Query: 0 ? ?- > > Sbjct: 445 ?G ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?445 > > Query: 0 > > Sbjct: 445 ?G ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?445 > > Query: 0 ? ?------ > > Sbjct: 1316 ETNAPV > 1321 > > present for some alignments and it cannot be parsed by my code. Those do look strange. > When I run blastpgp myself on the command line, same arguments, and catch > the standard output to a temp2.txt file, the latter file does not contain > those lines and can be parsed correctly. This is odd, and I am not sure what would cause this. > So, in the end I went back to my code and modified according to your > recommendation of using the commandline applications. The relevant part of > code now looks like this: > ... > And it works! Great - I'm glad my vague instructions made sense :) > Thanks again for your help, At least we have solution, even if we didn't get to the bottom of the strange BLAST output. I'll close the bug... Peter From ibdeno at gmail.com Wed Oct 14 08:49:30 2009 From: ibdeno at gmail.com (Miguel Ortiz Lombardia) Date: Wed, 14 Oct 2009 14:49:30 +0200 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <320fb6e00910140537h1d9d71f5i97417266c542ae3b@mail.gmail.com> References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> <320fb6e00910130410h224c0a9ft595befcad4a47cf4@mail.gmail.com> <320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com> <320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com> <320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com> <7AFF985A-4DDE-4B72-A81E-26516BDA689F@gmail.com> <320fb6e00910140537h1d9d71f5i97417266c542ae3b@mail.gmail.com> Message-ID: <408372EF-748A-4089-A6F6-28B8E1F88B4B@gmail.com> Le 14 oct. 09 ? 14:37, Peter a ?crit : > On Wed, Oct 14, 2009 at 12:30 PM, Miguel Ortiz Lombardia > wrote: >> >> Hi again, Peter. >> >> Well, it turned out that I don't have such work-around... When I >> launched >> the script as: >> >> nohup lpbl.py ... & >> >> against all my sequences it choked at the first one (quite longer >> than the >> one I was using as an example) with the very same error. > > It would take longer as it would wait for BLAST to finish before > starting > to parse it. > >> However, this time I have the "temp.txt" file and indeed there >> lines such as: >> >> Query: 0 - >> >> Sbjct: 445 >> G 445 >> >> Query: 0 >> >> Sbjct: 445 >> G 445 >> >> Query: 0 ------ >> >> Sbjct: 1316 ETNAPV >> 1321 >> >> present for some alignments and it cannot be parsed by my code. > > Those do look strange. > >> When I run blastpgp myself on the command line, same arguments, and >> catch >> the standard output to a temp2.txt file, the latter file does not >> contain >> those lines and can be parsed correctly. > > This is odd, and I am not sure what would cause this. > >> So, in the end I went back to my code and modified according to your >> recommendation of using the commandline applications. The relevant >> part of >> code now looks like this: >> ... >> And it works! > > Great - I'm glad my vague instructions made sense :) > They were quite clear :-) and the pointer to the alignment tutorial helped a lot. >> Thanks again for your help, > > At least we have solution, even if we didn't get to the bottom of > the strange BLAST output. I'll close the bug... > That's fine. Thanks! -- Miguel From andrea at biodec.com Wed Oct 14 10:28:17 2009 From: andrea at biodec.com (Andrea) Date: Wed, 14 Oct 2009 16:28:17 +0200 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <408372EF-748A-4089-A6F6-28B8E1F88B4B@gmail.com> References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> <320fb6e00910130410h224c0a9ft595befcad4a47cf4@mail.gmail.com> <320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com> <320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com> <320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com> <7AFF985A-4DDE-4B72-A81E-26516BDA689F@gmail.com> <320fb6e00910140537h1d9d71f5i97417266c542ae3b@mail.gmail.com> <408372EF-748A-4089-A6F6-28B8E1F88B4B@gmail.com> Message-ID: <4AD5E001.6070506@biodec.com> Hi to everybody, I work with blast quite often and i could say i run hundreds of thousand of blastpgp. The "Query 0" outpt of blastpgp, is quite common for me, and i wrote a patch to my code, to remove these "nasty" lines, before passing the output to the parser. I found these type of lines in at least 1-2% of my runs. And i'm fully sure that i found them either in the output of blast via shell and in the output of blast via Biopython. The problem, according to me, is in the blastpgp algorithm and maybe could be managed in biopython (as i did in my code), cutting out these "Query 0" lines, because from the point of view of the alignments, they don't have any sense. It seems that blastpgp, wants to show wich is the part of the target sequence align to the query before the starting point of the query itself (something like opening a gap, at the beginning of the query). And this happens "sometimes", and without any apparent reason. What i think, is that there aren't any problem with biopython in wrapping the blastpgp process and maybe, but i'm not sure, the difference in the output could be related to small differences in the parameter of the process (or in the environment... or in the .ncbirc file). I always was able to observe the identity between the blastpgp output via shell (bash) and the output of the popen wrapper. Miguel, could you check if really everything is identical? Because i'm really surprised of such a strange behaviour.... Despite, according to me there aren't any problem in biopython, and maybe, Miguel will be able to discover some differences in the way blastpgp is launched, i would suggest to develop a patch (i could submit mine), that could remove "Query 0" lines. I aplogize if i understanded the problem wrongly and for the fact that i'm entering in the discussion in this moment (maybe when the discussion is finished)... Thanks Andrea Miguel Ortiz Lombardia ha scritto: > Le 14 oct. 09 ? 14:37, Peter a ?crit : > >> On Wed, Oct 14, 2009 at 12:30 PM, Miguel Ortiz Lombardia >> wrote: >>> >>> Hi again, Peter. >>> >>> Well, it turned out that I don't have such work-around... When I >>> launched >>> the script as: >>> >>> nohup lpbl.py ... & >>> >>> against all my sequences it choked at the first one (quite longer >>> than the >>> one I was using as an example) with the very same error. >> >> It would take longer as it would wait for BLAST to finish before >> starting >> to parse it. >> >>> However, this time I have the "temp.txt" file and indeed there lines >>> such as: >>> >>> Query: 0 - >>> >>> Sbjct: 445 >>> G 445 >>> >>> Query: 0 >>> >>> Sbjct: 445 >>> G 445 >>> >>> Query: 0 ------ >>> >>> Sbjct: 1316 ETNAPV >>> 1321 >>> >>> present for some alignments and it cannot be parsed by my code. >> >> Those do look strange. >> >>> When I run blastpgp myself on the command line, same arguments, and >>> catch >>> the standard output to a temp2.txt file, the latter file does not >>> contain >>> those lines and can be parsed correctly. >> >> This is odd, and I am not sure what would cause this. >> >>> So, in the end I went back to my code and modified according to your >>> recommendation of using the commandline applications. The relevant >>> part of >>> code now looks like this: >>> ... >>> And it works! >> >> Great - I'm glad my vague instructions made sense :) >> > > They were quite clear :-) and the pointer to the alignment tutorial > helped a lot. > >>> Thanks again for your help, >> >> At least we have solution, even if we didn't get to the bottom of >> the strange BLAST output. I'll close the bug... >> > > That's fine. > > Thanks! > > > > -- Miguel > > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Wed Oct 14 10:46:48 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Oct 2009 15:46:48 +0100 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <4AD5E001.6070506@biodec.com> References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> <320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com> <320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com> <320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com> <7AFF985A-4DDE-4B72-A81E-26516BDA689F@gmail.com> <320fb6e00910140537h1d9d71f5i97417266c542ae3b@mail.gmail.com> <408372EF-748A-4089-A6F6-28B8E1F88B4B@gmail.com> <4AD5E001.6070506@biodec.com> Message-ID: <320fb6e00910140746j15124dccsa901b9cbee784bdc@mail.gmail.com> On Wed, Oct 14, 2009 at 3:28 PM, Andrea wrote: > > Hi to everybody, > I work with blast quite often and i could say i run hundreds of thousand > of blastpgp. The "Query 0" outpt of blastpgp, is quite common for me, and > i wrote a patch to my code, to remove these "nasty" lines, before passing > the output to the parser. > > I found these type of lines in at least 1-2% of my runs. And i'm fully sure > that i found them either in the output of blast via shell and in the output > of blast via Biopython. > > The problem, according to me, is in the blastpgp algorithm and maybe > could be managed in biopython (as i did in my code), cutting out these > "Query 0" lines, because from the point of view of the alignments, > they don't have any sense. It seems that blastpgp, wants to show > which is the part of the target sequence align to the query before the > starting point of the query itself (something like opening a gap, at the > beginning of the query). > And this happens "sometimes", and without any apparent reason. Andrea - do you have any small example output files with this problem? If it does occur fairly often (1 to 2% of the time), then we should try and update the parser to cope. Miguel's example is useful for testing while working on a bug fix, but too big to include as part the unit tests. > What i think, is that there aren't any problem with biopython in wrapping > the blastpgp process and maybe, but i'm not sure, the difference in the > output could be related to small differences in the parameter of the process > (or in the environment... or in the .ncbirc file). > > I always was able to ?observe ?the identity ?between the blastpgp output > via shell (bash) and the output of the popen wrapper. If you saw "Query 0" output at the command line (shell), then that is worth knowing. > Miguel, could you check if really everything is identical? Because i'm > really surprised of such a strange behaviour.... Maybe the environment variables are different or something? > Despite, according to me there aren't any problem in biopython, and maybe, > Miguel will be able to discover some differences in the way blastpgp is > launched, i would suggest to develop a patch (i could submit mine), that > could remove "Query 0" lines. Could you upload your "Query 0" patch to Bug 2927? http://bugzilla.open-bio.org/show_bug.cgi?id=2927 > I aplogize if i understanded the problem wrongly and for the fact that > i'm entering in the discussion in this moment (maybe when the > discussion is finished)... Well I don't (yet) understand what the problem is either ;) Peter From andrea at biodec.com Wed Oct 14 11:02:40 2009 From: andrea at biodec.com (Andrea) Date: Wed, 14 Oct 2009 17:02:40 +0200 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <320fb6e00910140746j15124dccsa901b9cbee784bdc@mail.gmail.com> References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> <320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com> <320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com> <320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com> <7AFF985A-4DDE-4B72-A81E-26516BDA689F@gmail.com> <320fb6e00910140537h1d9d71f5i97417266c542ae3b@mail.gmail.com> <408372EF-748A-4089-A6F6-28B8E1F88B4B@gmail.com> <4AD5E001.6070506@biodec.com> <320fb6e00910140746j15124dccsa901b9cbee784bdc@mail.gmail.com> Message-ID: <4AD5E810.5090607@biodec.com> Peter ha scritto: > On Wed, Oct 14, 2009 at 3:28 PM, Andrea wrote: > >> Hi to everybody, >> I work with blast quite often and i could say i run hundreds of thousand >> of blastpgp. The "Query 0" outpt of blastpgp, is quite common for me, and >> i wrote a patch to my code, to remove these "nasty" lines, before passing >> the output to the parser. >> >> I found these type of lines in at least 1-2% of my runs. And i'm fully sure >> that i found them either in the output of blast via shell and in the output >> of blast via Biopython. >> >> The problem, according to me, is in the blastpgp algorithm and maybe >> could be managed in biopython (as i did in my code), cutting out these >> "Query 0" lines, because from the point of view of the alignments, >> they don't have any sense. It seems that blastpgp, wants to show >> which is the part of the target sequence align to the query before the >> starting point of the query itself (something like opening a gap, at the >> beginning of the query). >> And this happens "sometimes", and without any apparent reason. >> > > Andrea - do you have any small example output files with this > problem? If it does occur fairly often (1 to 2% of the time), then > we should try and update the parser to cope. Miguel's example > is useful for testing while working on a bug fix, but too big to > include as part the unit tests. > > mmm... i've to search. I've some "cache" of gzipped blastpgp outputs. But I'm not sure i've the original (maybe already patched).... waht I'm sure, is that in the next month I'm going to run almost 100.000 blasptpg so I'll for sure find something small. ;-) >> What i think, is that there aren't any problem with biopython in wrapping >> the blastpgp process and maybe, but i'm not sure, the difference in the >> output could be related to small differences in the parameter of the process >> (or in the environment... or in the .ncbirc file). >> >> I always was able to observe the identity between the blastpgp output >> via shell (bash) and the output of the popen wrapper. >> > > If you saw "Query 0" output at the command line (shell), then that is > worth knowing. > > i think so. >> Miguel, could you check if really everything is identical? Because i'm >> really surprised of such a strange behaviour.... >> > > Maybe the environment variables are different or something? > > >> Despite, according to me there aren't any problem in biopython, and maybe, >> Miguel will be able to discover some differences in the way blastpgp is >> launched, i would suggest to develop a patch (i could submit mine), that >> could remove "Query 0" lines. >> > > Could you upload your "Query 0" patch to Bug 2927? > http://bugzilla.open-bio.org/show_bug.cgi?id=2927 > Now i'm wuite busy, because i'm working on a different project and i've to manage deliveries... but i will for sure upload my patch ASAP. > >> I aplogize if i understanded the problem wrongly and for the fact that >> i'm entering in the discussion in this moment (maybe when the >> discussion is finished)... >> > > Well I don't (yet) understand what the problem is either ;) > > Peter > Ciao andrea From biopython at maubp.freeserve.co.uk Wed Oct 14 11:10:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 14 Oct 2009 16:10:54 +0100 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <4AD5E810.5090607@biodec.com> References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> <320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com> <7AFF985A-4DDE-4B72-A81E-26516BDA689F@gmail.com> <320fb6e00910140537h1d9d71f5i97417266c542ae3b@mail.gmail.com> <408372EF-748A-4089-A6F6-28B8E1F88B4B@gmail.com> <4AD5E001.6070506@biodec.com> <320fb6e00910140746j15124dccsa901b9cbee784bdc@mail.gmail.com> <4AD5E810.5090607@biodec.com> Message-ID: <320fb6e00910140810y296d19beo9190022b9eede94f@mail.gmail.com> On Wed, Oct 14, 2009 at 4:02 PM, Andrea wrote: >> >> Andrea - do you have any small example output files with this >> problem? If it does occur fairly often (1 to 2% of the time), then >> we should try and update the parser to cope. Miguel's example >> is useful for testing while working on a bug fix, but too big to >> include as part the unit tests. > > mmm... i've to search. I've some "cache" of gzipped blastpgp outputs. > But I'm not sure i've the original (maybe already patched).... waht I'm > sure, is that in the next month I'm going to run almost 100.000 > blasptpg so I'll for sure find something small. ;-) Great. >> If you saw "Query 0" output at the command line (shell), then that is >> worth knowing. > > i think so. OK. >> Could you upload your "Query 0" patch to Bug 2927? >> http://bugzilla.open-bio.org/show_bug.cgi?id=2927 > > Now i'm quite busy, because i'm working on a different project and i've > to manage deliveries... but i will for sure upload my patch ASAP. Thanks. Peter From ibdeno at gmail.com Wed Oct 14 16:15:07 2009 From: ibdeno at gmail.com (Miguel Ortiz Lombardia) Date: Wed, 14 Oct 2009 22:15:07 +0200 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <4AD5E810.5090607@biodec.com> References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> <320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com> <320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com> <320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com> <7AFF985A-4DDE-4B72-A81E-26516BDA689F@gmail.com> <320fb6e00910140537h1d9d71f5i97417266c542ae3b@mail.gmail.com> <408372EF-748A-4089-A6F6-28B8E1F88B4B@gmail.com> <4AD5E001.6070506@biodec.com> <320fb6e00910140746j15124dccsa901b9cbee784bdc@mail.gmail.com> <4AD5E810.5090607@biodec.com> Message-ID: <4CDC51BD-7953-47DE-910C-F1F50F0C3275@gmail.com> Le 14 oct. 09 ? 17:02, Andrea a ?crit : > Peter ha scritto: >> On Wed, Oct 14, 2009 at 3:28 PM, Andrea wrote: >> >>> Hi to everybody, >>> I work with blast quite often and i could say i run hundreds of >>> thousand >>> of blastpgp. The "Query 0" outpt of blastpgp, is quite common for >>> me, and >>> i wrote a patch to my code, to remove these "nasty" lines, before >>> passing >>> the output to the parser. >>> >>> I found these type of lines in at least 1-2% of my runs. And i'm >>> fully sure >>> that i found them either in the output of blast via shell and in >>> the output >>> of blast via Biopython. >>> >>> The problem, according to me, is in the blastpgp algorithm and maybe >>> could be managed in biopython (as i did in my code), cutting out >>> these >>> "Query 0" lines, because from the point of view of the alignments, >>> they don't have any sense. It seems that blastpgp, wants to show >>> which is the part of the target sequence align to the query before >>> the >>> starting point of the query itself (something like opening a gap, >>> at the >>> beginning of the query). >>> And this happens "sometimes", and without any apparent reason. >>> >> >> Andrea - do you have any small example output files with this >> problem? If it does occur fairly often (1 to 2% of the time), then >> we should try and update the parser to cope. Miguel's example >> is useful for testing while working on a bug fix, but too big to >> include as part the unit tests. >> >> > mmm... i've to search. I've some "cache" of gzipped blastpgp outputs. > But I'm not > sure i've the original (maybe already patched).... waht I'm sure, is > that in the > next month I'm going to run almost 100.000 blasptpg so I'll for sure > find > something small. ;-) >>> What i think, is that there aren't any problem with biopython in >>> wrapping >>> the blastpgp process and maybe, but i'm not sure, the difference >>> in the >>> output could be related to small differences in the parameter of >>> the process >>> (or in the environment... or in the .ncbirc file). >>> >>> I always was able to observe the identity between the blastpgp >>> output >>> via shell (bash) and the output of the popen wrapper. >>> >> >> If you saw "Query 0" output at the command line (shell), then that is >> worth knowing. All I can say is that this is not what I observe. 1. When I send directly from the shell exactly the same blastpgp search ( I capture the full command line issued in the background by the python script with a 'ps -a | grep blastpgp' ) I have never find the 'Query: 0' lines. 2. When I send the search from within the python script and use 'nohup', the problem is reproducible, not random. 3. If the script is sent without 'nohup', that is, if the shell keeps full control of both standard error and output, then again, the problem seems to disappear. I say 'seems' because I haven't tried with my longest ( more than 1300 aa ) sequences. 4. When, from within the python script I use, as Peter suggested, the BlastpgpCommandline class to ask blastpgp to send the output to a file ( the -o option ) the problem disappears irrespectively whether I use or not 'nohup'. Therefore, in my opinion, the problem is not with blastpgp but with the handling of its output by python or biopython. >> > i think so. >>> Miguel, could you check if really everything is identical? Because >>> i'm >>> really surprised of such a strange behaviour.... >> >> Maybe the environment variables are different or something? Command options are absolutely the same, see above. I am surprised too, but I don't think blastpgp is sensitive to any environment variable and I don't see how they could change from an in-script to a standalone run. >> >>> Despite, according to me there aren't any problem in biopython, >>> and maybe, >>> Miguel will be able to discover some differences in the way >>> blastpgp is >>> launched, i would suggest to develop a patch (i could submit >>> mine), that >>> could remove "Query 0" lines. I couldn't find any differences, so I'm afraid I can't help... I'm still testing the script, I will let you know if I find again this problem. >>> >> Could you upload your "Query 0" patch to Bug 2927? >> http://bugzilla.open-bio.org/show_bug.cgi?id=2927 >> > Now i'm wuite busy, because i'm working on a different project and > i've > to manage deliveries... > but i will for sure upload my patch ASAP. >> >>> I aplogize if i understanded the problem wrongly and for the fact >>> that >>> i'm entering in the discussion in this moment (maybe when the >>> discussion is finished)... >>> >> >> Well I don't (yet) understand what the problem is either ;) >> >> Peter >> > Ciao > andrea Best, -- Miguel From ibdeno at gmail.com Thu Oct 15 09:04:33 2009 From: ibdeno at gmail.com (Miguel Ortiz Lombardia) Date: Thu, 15 Oct 2009 15:04:33 +0200 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <4AD64602.9060603@biodec.com> References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> <320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com> <320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com> <320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com> <7AFF985A-4DDE-4B72-A81E-26516BDA689F@gmail.com> <320fb6e00910140537h1d9d71f5i97417266c542ae3b@mail.gmail.com> <408372EF-748A-4089-A6F6-28B8E1F88B4B@gmail.com> <4AD5E001.6070506@biodec.com> <320fb6e00910140746j15124dccsa901b9cbee784bdc@mail.gmail.com> <4AD5E810.5090607@biodec.com> <4CDC51BD-7953-47DE-910C-F1F50F0C3275@gmail.com> <4AD64602.9060603@biodec.com> Message-ID: <13B32636-1A38-4E0E-96A1-62D75B6AAEFC@gmail.com> Le 14 oct. 09 ? 23:43, Andrea a ?crit : > Miguel Ortiz Lombardia ha scritto: >> Le 14 oct. 09 ? 17:02, Andrea a ?crit : >>> Peter ha scritto: >>>> On Wed, Oct 14, 2009 at 3:28 PM, Andrea wrote: >>>> >>>>> Hi to everybody, >>>>> I work with blast quite often and i could say i run hundreds of >>>>> thousand >>>>> of blastpgp. The "Query 0" outpt of blastpgp, is quite common for >>>>> me, and >>>>> i wrote a patch to my code, to remove these "nasty" lines, before >>>>> passing >>>>> the output to the parser. >>>>> >>>>> I found these type of lines in at least 1-2% of my runs. And i'm >>>>> fully sure >>>>> that i found them either in the output of blast via shell and in >>>>> the output >>>>> of blast via Biopython. >>>>> >>>>> The problem, according to me, is in the blastpgp algorithm and >>>>> maybe >>>>> could be managed in biopython (as i did in my code), cutting out >>>>> these >>>>> "Query 0" lines, because from the point of view of the alignments, >>>>> they don't have any sense. It seems that blastpgp, wants to show >>>>> which is the part of the target sequence align to the query >>>>> before the >>>>> starting point of the query itself (something like opening a gap, >>>>> at the >>>>> beginning of the query). >>>>> And this happens "sometimes", and without any apparent reason. >>>>> >>>> >>>> Andrea - do you have any small example output files with this >>>> problem? If it does occur fairly often (1 to 2% of the time), then >>>> we should try and update the parser to cope. Miguel's example >>>> is useful for testing while working on a bug fix, but too big to >>>> include as part the unit tests. >>>> >>>> >>> mmm... i've to search. I've some "cache" of gzipped blastpgp >>> outputs. >>> But I'm not >>> sure i've the original (maybe already patched).... waht I'm sure, is >>> that in the >>> next month I'm going to run almost 100.000 blasptpg so I'll for sure >>> find >>> something small. ;-) >>>>> What i think, is that there aren't any problem with biopython in >>>>> wrapping >>>>> the blastpgp process and maybe, but i'm not sure, the difference >>>>> in >>>>> the >>>>> output could be related to small differences in the parameter of >>>>> the process >>>>> (or in the environment... or in the .ncbirc file). >>>>> >>>>> I always was able to observe the identity between the blastpgp >>>>> output >>>>> via shell (bash) and the output of the popen wrapper. >>>>> >>>> >>>> If you saw "Query 0" output at the command line (shell), then >>>> that is >>>> worth knowing. >> >> All I can say is that this is not what I observe. >> 1. When I send directly from the shell exactly the same blastpgp >> search ( I capture the full command line issued in the background by >> the python script with a 'ps -a | grep blastpgp' ) I have never find >> the 'Query: 0' lines. >> 2. When I send the search from within the python script and use >> 'nohup', the problem is reproducible, not random. > yes, i'm sure is reproducible. I mean that what I've observed wasn't > random on one sequence, but maybe along > many sequences... >> 3. If the script is sent without 'nohup', that is, if the shell keeps >> full control of both standard error and output, then again, the >> problem seems to disappear. I say 'seems' because I haven't tried >> with >> my longest ( more than 1300 aa ) sequences. >> 4. When, from within the python script I use, as Peter suggested, the >> BlastpgpCommandline class to ask blastpgp to send the output to a >> file >> ( the -o option ) the problem disappears irrespectively whether I use >> or not 'nohup'. >> >> Therefore, in my opinion, the problem is not with blastpgp but with >> the handling of its output by python or biopython. >> > I'm really curious. What you have is very strange, but i believe you > fully. > > Is there the possibility to have: > your database, > your .bashrc > the sequence > the exact command line. > the versione of blastpgp > the versione of blastpgp (2.2.18 ?) > the other things you use (matrix.... ) > the different possibilities you try....( nohup/python/shell ) > I should be reprodcible. > > Have you tried to observe the behaviour of the blastpgp process with a > "strace" expecially at the > beginning? > > >>>> >>> i think so. >>>>> Miguel, could you check if really everything is identical? >>>>> Because i'm >>>>> really surprised of such a strange behaviour.... >>>> >>>> Maybe the environment variables are different or something? >> >> Command options are absolutely the same, see above. I am surprised >> too, but I don't think blastpgp is sensitive to any environment >> variable and I don't see how they could change from an in-script to a >> standalone run. > I think only to .bashrc. >> >>>> >>>>> Despite, according to me there aren't any problem in biopython, >>>>> and >>>>> maybe, >>>>> Miguel will be able to discover some differences in the way >>>>> blastpgp is >>>>> launched, i would suggest to develop a patch (i could submit >>>>> mine), >>>>> that >>>>> could remove "Query 0" lines. >> >> I couldn't find any differences, so I'm afraid I can't help... I'm >> still testing the script, I will let you know if I find again this >> problem. > I will try to find the problem in my sequences (but i could say that > is > quite common)... and if i will > find i will try with the same parameters and the shell... >> >>>>> >>>> Could you upload your "Query 0" patch to Bug 2927? >>>> http://bugzilla.open-bio.org/show_bug.cgi?id=2927 >>>> >>> Now i'm wuite busy, because i'm working on a different project and >>> i've >>> to manage deliveries... >>> but i will for sure upload my patch ASAP. >>>> >>>>> I aplogize if i understanded the problem wrongly and for the >>>>> fact that >>>>> i'm entering in the discussion in this moment (maybe when the >>>>> discussion is finished)... >>>>> >>>> >>>> Well I don't (yet) understand what the problem is either ;) >>>> >>>> Peter >>>> >>> Ciao >>> andrea >> >> >> Best, >> >> >> >> -- Miguel >> >> > thanks. > Ciao > Andrea Hi! Some new findings that contradict my previous perception of the problem. Tonight my script failed again after stumbling upon the same problem for a different sequence. I have now investigated more carefully and found: 1. The problem (a line with 'Query: 0 ---' that impaired parsing of the blastpgp output) was encountered in all these cases: a) nohup myscript.py [some script options] sequences.fasta >& myscript.log & b) myscript.py [some script options] sequences.fasta c) /usr/local/blast-2.2.22/bin/blastpgp -d /opt/BlastDBs/nr -i U7.fasta -m 0 -o tmp.bl.txt -v 500 -b 1000 -a 6 -Q U7.nr.5.pssm -j 5 - h 0.001 -p blastpgp That is, for the first time I was able to reproduce the problem from a standalone run of blastpgp. 2. The problem disappears with a previous version of blastpgp (2.2.18). Using this version, all these cases work: a) nohup myscript.py [some script options] sequences.fasta >& myscript.log & b) myscript.py [some script options] sequences.fasta c) /usr/local/blast-2.2.18/bin/blastpgp -d /opt/BlastDBs/nr -i U7.fasta -m 0 -o tmp.bl.txt -v 500 -b 1000 -a 6 -Q U7.nr.5.pssm -j 5 - h 0.001 -p blastpgp So, it would seem that, as Andrea suggested, this is a bug in blastpgp, to be more precise, after blastpgp-2.2.18. 3. In this particular case, I notice that the problem happens with a sequence containing low complexity region(s). Now, I had thought that the default in blastpgp was to filter those sequences out! I'm running the original script again with blastpgp-2.2.22 with the filter on (-F T) to see if the problem persists. I will write to the blast-help address at the ncbi to let them know about the problem. Best, -- Miguel From andrea at biodec.com Thu Oct 15 11:03:38 2009 From: andrea at biodec.com (Andrea) Date: Thu, 15 Oct 2009 17:03:38 +0200 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <05361598-93AF-4328-BE4B-25D4300DCDE0@gmail.com> References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> <320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com> <320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com> <320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com> <7AFF985A-4DDE-4B72-A81E-26516BDA689F@gmail.com> <320fb6e00910140537h1d9d71f5i97417266c542ae3b@mail.gmail.com> <408372EF-748A-4089-A6F6-28B8E1F88B4B@gmail.com> <4AD5E001.6070506@biodec.com> <320fb6e00910140746j15124dccsa901b9cbee784bdc@mail.gmail.com> <4AD5E810.5090607@biodec.com> <4CDC51BD-7953-47DE-910C-F1F50F0C3275@gmail.com> <4AD64602.9060603@biodec.com> <13B32636-1A38-4E0E-96A1-62D75B6AAEFC@gmail.com> <4AD72978.4030900@biodec.com> <05361598-93AF-4328-BE4B-25D4300DCDE0@gmail.com> Message-ID: <4AD739CA.6090403@biodec.com> Miguel Ortiz Lombardia ha scritto: > > Le 15 oct. 09 ? 15:54, Andrea a ?crit : > >> Miguel Ortiz Lombardia ha scritto: >>> >>> Le 14 oct. 09 ? 23:43, Andrea a ?crit : >>> >>>> Miguel Ortiz Lombardia ha scritto: >>>>> Le 14 oct. 09 ? 17:02, Andrea a ?crit : >>>>>> Peter ha scritto: >>>>>>> On Wed, Oct 14, 2009 at 3:28 PM, Andrea wrote: >>>>>>> >>>>>>>> Hi to everybody, >>>>>>>> I work with blast quite often and i could say i run hundreds of >>>>>>>> thousand >>>>>>>> of blastpgp. The "Query 0" outpt of blastpgp, is quite common for >>>>>>>> me, and >>>>>>>> i wrote a patch to my code, to remove these "nasty" lines, before >>>>>>>> passing >>>>>>>> the output to the parser. >>>>>>>> >>>>>>>> I found these type of lines in at least 1-2% of my runs. And i'm >>>>>>>> fully sure >>>>>>>> that i found them either in the output of blast via shell and in >>>>>>>> the output >>>>>>>> of blast via Biopython. >>>>>>>> >>>>>>>> The problem, according to me, is in the blastpgp algorithm and >>>>>>>> maybe >>>>>>>> could be managed in biopython (as i did in my code), cutting out >>>>>>>> these >>>>>>>> "Query 0" lines, because from the point of view of the alignments, >>>>>>>> they don't have any sense. It seems that blastpgp, wants to show >>>>>>>> which is the part of the target sequence align to the query >>>>>>>> before the >>>>>>>> starting point of the query itself (something like opening a gap, >>>>>>>> at the >>>>>>>> beginning of the query). >>>>>>>> And this happens "sometimes", and without any apparent reason. >>>>>>>> >>>>>>> >>>>>>> Andrea - do you have any small example output files with this >>>>>>> problem? If it does occur fairly often (1 to 2% of the time), then >>>>>>> we should try and update the parser to cope. Miguel's example >>>>>>> is useful for testing while working on a bug fix, but too big to >>>>>>> include as part the unit tests. >>>>>>> >>>>>>> >>>>>> mmm... i've to search. I've some "cache" of gzipped blastpgp >>>>>> outputs. >>>>>> But I'm not >>>>>> sure i've the original (maybe already patched).... waht I'm sure, is >>>>>> that in the >>>>>> next month I'm going to run almost 100.000 blasptpg so I'll for sure >>>>>> find >>>>>> something small. ;-) >>>>>>>> What i think, is that there aren't any problem with biopython in >>>>>>>> wrapping >>>>>>>> the blastpgp process and maybe, but i'm not sure, the >>>>>>>> difference in >>>>>>>> the >>>>>>>> output could be related to small differences in the parameter of >>>>>>>> the process >>>>>>>> (or in the environment... or in the .ncbirc file). >>>>>>>> >>>>>>>> I always was able to observe the identity between the blastpgp >>>>>>>> output >>>>>>>> via shell (bash) and the output of the popen wrapper. >>>>>>>> >>>>>>> >>>>>>> If you saw "Query 0" output at the command line (shell), then >>>>>>> that is >>>>>>> worth knowing. >>>>> >>>>> All I can say is that this is not what I observe. >>>>> 1. When I send directly from the shell exactly the same blastpgp >>>>> search ( I capture the full command line issued in the background by >>>>> the python script with a 'ps -a | grep blastpgp' ) I have never find >>>>> the 'Query: 0' lines. >>>>> 2. When I send the search from within the python script and use >>>>> 'nohup', the problem is reproducible, not random. >>>> yes, i'm sure is reproducible. I mean that what I've observed wasn't >>>> random on one sequence, but maybe along >>>> many sequences... >>>>> 3. If the script is sent without 'nohup', that is, if the shell keeps >>>>> full control of both standard error and output, then again, the >>>>> problem seems to disappear. I say 'seems' because I haven't tried >>>>> with >>>>> my longest ( more than 1300 aa ) sequences. >>>>> 4. When, from within the python script I use, as Peter suggested, the >>>>> BlastpgpCommandline class to ask blastpgp to send the output to a >>>>> file >>>>> ( the -o option ) the problem disappears irrespectively whether I use >>>>> or not 'nohup'. >>>>> >>>>> Therefore, in my opinion, the problem is not with blastpgp but with >>>>> the handling of its output by python or biopython. >>>>> >>>> I'm really curious. What you have is very strange, but i believe you >>>> fully. >>>> >>>> Is there the possibility to have: >>>> your database, >>>> your .bashrc >>>> the sequence >>>> the exact command line. >>>> the versione of blastpgp >>>> the versione of blastpgp (2.2.18 ?) >>>> the other things you use (matrix.... ) >>>> the different possibilities you try....( nohup/python/shell ) >>>> I should be reprodcible. >>>> >>>> Have you tried to observe the behaviour of the blastpgp process with a >>>> "strace" expecially at the >>>> beginning? >>>> >>>> >>>>>>> >>>>>> i think so. >>>>>>>> Miguel, could you check if really everything is identical? >>>>>>>> Because i'm >>>>>>>> really surprised of such a strange behaviour.... >>>>>>> >>>>>>> Maybe the environment variables are different or something? >>>>> >>>>> Command options are absolutely the same, see above. I am surprised >>>>> too, but I don't think blastpgp is sensitive to any environment >>>>> variable and I don't see how they could change from an in-script to a >>>>> standalone run. >>>> I think only to .bashrc. >>>>> >>>>>>> >>>>>>>> Despite, according to me there aren't any problem in biopython, >>>>>>>> and >>>>>>>> maybe, >>>>>>>> Miguel will be able to discover some differences in the way >>>>>>>> blastpgp is >>>>>>>> launched, i would suggest to develop a patch (i could submit >>>>>>>> mine), >>>>>>>> that >>>>>>>> could remove "Query 0" lines. >>>>> >>>>> I couldn't find any differences, so I'm afraid I can't help... I'm >>>>> still testing the script, I will let you know if I find again this >>>>> problem. >>>> I will try to find the problem in my sequences (but i could say >>>> that is >>>> quite common)... and if i will >>>> find i will try with the same parameters and the shell... >>>>> >>>>>>>> >>>>>>> Could you upload your "Query 0" patch to Bug 2927? >>>>>>> http://bugzilla.open-bio.org/show_bug.cgi?id=2927 >>>>>>> >>>>>> Now i'm wuite busy, because i'm working on a different project and >>>>>> i've >>>>>> to manage deliveries... >>>>>> but i will for sure upload my patch ASAP. >>>>>>> >>>>>>>> I aplogize if i understanded the problem wrongly and for the fact >>>>>>>> that >>>>>>>> i'm entering in the discussion in this moment (maybe when the >>>>>>>> discussion is finished)... >>>>>>>> >>>>>>> >>>>>>> Well I don't (yet) understand what the problem is either ;) >>>>>>> >>>>>>> Peter >>>>>>> >>>>>> Ciao >>>>>> andrea >>>>> >>>>> >>>>> Best, >>>>> >>>>> >>>>> >>>>> -- Miguel >>>>> >>>>> >>>> thanks. >>>> Ciao >>>> Andrea >>> >>> Hi! >>> >>> Some new findings that contradict my previous perception of the >>> problem. >>> Tonight my script failed again after stumbling upon the same problem >>> for a different sequence. I have now investigated more carefully and >>> found: >>> >>> 1. The problem (a line with 'Query: 0 ---' that impaired parsing of >>> the blastpgp output) was encountered in all these cases: >>> >>> a) nohup myscript.py [some script options] sequences.fasta >& >>> myscript.log & >>> b) myscript.py [some script options] sequences.fasta >>> c) /usr/local/blast-2.2.22/bin/blastpgp -d /opt/BlastDBs/nr -i >>> U7.fasta -m 0 -o tmp.bl.txt -v 500 -b 1000 -a 6 -Q U7.nr.5.pssm -j 5 >>> -h 0.001 -p blastpgp >>> >>> That is, for the first time I was able to reproduce the problem from a >>> standalone run of blastpgp. >>> >>> 2. The problem disappears with a previous version of blastpgp >>> (2.2.18). Using this version, all these cases work: >>> >>> a) nohup myscript.py [some script options] sequences.fasta >& >>> myscript.log & >>> b) myscript.py [some script options] sequences.fasta >>> c) /usr/local/blast-2.2.18/bin/blastpgp -d /opt/BlastDBs/nr -i >>> U7.fasta -m 0 -o tmp.bl.txt -v 500 -b 1000 -a 6 -Q U7.nr.5.pssm -j 5 >>> -h 0.001 -p blastpgp >>> >>> So, it would seem that, as Andrea suggested, this is a bug in >>> blastpgp, to be more precise, after blastpgp-2.2.18. >>> >>> 3. In this particular case, I notice that the problem happens with a >>> sequence containing low complexity region(s). Now, I had thought that >>> the default in blastpgp was to filter those sequences out! I'm running >>> the original script again with blastpgp-2.2.22 with the filter on (-F >>> T) to see if the problem persists. >>> >>> I will write to the blast-help address at the ncbi to let them know >>> about the problem. >>> >>> Best, >>> >>> >>> -- Miguel >>> >>> >> Hi, >> Thanks for your updates!!!. I can say one thing: >> I've used in the past these three versione of blastpgp: >> - 2.2.15 >> - 2.2.18 >> - 2.2.19 >> and i found the "Query 0" problem in all of them, but, if one >> of them fails (i mean, gives "Query 0" output) the other may not fail >> at all (they most probably not give the "Query 0" output). >> >> Another interesting things is that, with the three version, the same >> database, and the same parameters, the output is quite different... >> ...sorry.. very different... >> >> I'm also sure that it could happens also with the complexity region(s) >> filter "True". >> What i observe, is that there aren't parameters that make it >> disappear. It >> just disappear from a sequence, and it will appear in another.... in >> other >> word, changing parameters, make it "moving" between sequences. >> >> I've never used blastpgp 2.2.22. So i cannot say anything about it. >> >> Thanks >> Andrea > > > Then it looks like something more weird than what I thought... > Andrea, would you mind if I send your e-mail to the blast people? Or > perhaps you can do it yourself... I wrote to blast-help at ncbi.nlm.nih.gov If you can, for me is an help. I hope they will reply. I can also send and email, buti f you have.... > > I suspect they will tell us to use the XML output, but then, not all > info I need seems to go there... i think the same, and i suspect the XML output doesn't suffer of the same problem. > > Thanks a lot! > > To you!! > -- Miguel > > And for my patch, is not a patch.I've checked now. To be fully independent from NcbiStandalone.py i didn't write a patch for it. I wrote a patch in the sense that actually i remove from the blastpgp output, four lines, starting from the "Query 0" one, and then i submit the "new output" to the parser. In this way i'm reading the file twice (so it's not a good idea), but i don't mind if the NcbiStandalone.py change, because I'm fully independent from it. This is my "simple code": ## THIS IS NOT A PATCH. BUT IT WORKS. ## THIS MEANS THAT IF WE FIND THE WAY ## TO REMOVE FOUR LINES STARTING ## FROM "Query 0" THE PROBLEM IS REALLY ## SOLVED (NOW I DON'T HAVE PARSER ## PROBLEMS AT ALL). ## lines is a list derived from a readlines() call of the ## output of blastpgp. ## newlines has to be reconverted into an handle ## object. def removeQuery0lines(lines): newlines = [] count = 0 for l in lines: if count == 4: count = 0 if count != 0: count+=1 if l.startswith('Query: 0'): count = 1 if count == 0: newlines.append(l) return newlines It should be interesting to develope a patch that works inside the parser. I will try to work on it, in November, becaue now i cannot. The right function to manipulate it should be (inside NCBIStandalone.py): def _scan_hsp_alignment(self, uhandle, consumer): # Query: 11 GRGVSACA-------TCDGFFYRNQKVAVIGGGNTAVEEALYLSNIASEVHLIHRRDGF # GRGVS+ TC Y + + V GGG+ + EE L + I R+ # Sbjct: 12 GRGVSSVVRRCIHKPTCKE--YAVKIIDVTGGGSFSAEEVQELREATLKEVDILRKVSG # # Query: 64 AEKILIKR 71 # I +K # Sbjct: 70 PNIIQLKD 77 # while 1: # Blastn adds an extra line filled with spaces before Query attempt_read_and_call(uhandle, consumer.noevent, start=' ') read_and_call(uhandle, consumer.query, start='Query') read_and_call(uhandle, consumer.align, start=' ') read_and_call(uhandle, consumer.sbjct, start='Sbjct') read_and_call_while(uhandle, consumer.noevent, blank=1) line = safe_peekline(uhandle) # Alignment continues if I see a 'Query' or the spaces for Blastn. if not (line.startswith('Query') or line.startswith(' ')): break changing it in: def _scan_hsp_alignment(self, uhandle, consumer): # Query: 11 GRGVSACA-------TCDGFFYRNQKVAVIGGGNTAVEEALYLSNIASEVHLIHRRDGF # GRGVS+ TC Y + + V GGG+ + EE L + I R+ # Sbjct: 12 GRGVSSVVRRCIHKPTCKE--YAVKIIDVTGGGSFSAEEVQELREATLKEVDILRKVSG # # Query: 64 AEKILIKR 71 # I +K # Sbjct: 70 PNIIQLKD 77 # while 1: # Blastn adds an extra line filled with spaces before Query attempt_read_and_call(uhandle, consumer.noevent, start=' ') # Remove Query 0 start (It is only at the beginning...) q0_count = attempt_read_and_call(uhandle, consumer.noevent, start='Query: 0') if q0_count: # if "Query 0" remove its alignment read_and_call(uhandle, consumer.noevent, start=' ') read_and_call(uhandle, consumer.noevent, start='Sbjct') read_and_call_while(uhandle, consumer.noevent, blank=1) # Remove Query 0 end read_and_call(uhandle, consumer.query, start='Query') read_and_call(uhandle, consumer.align, start=' ') read_and_call(uhandle, consumer.sbjct, start='Sbjct') read_and_call_while(uhandle, consumer.noevent, blank=1) line = safe_peekline(uhandle) # Alignment continues if I see a 'Query' or the spaces for Blastn. if not (line.startswith('Query') or line.startswith(' ')): break BUT, i'm not sure of the patch and i didn't try at all... so i cannot submit... It needs to be tryed and tested!!!! And i'm also not sure if it is the right place to patch....!!!! I hope this could help.... Miguel, have you time to try and test? Thanks a lot. Andrea From biopython at maubp.freeserve.co.uk Thu Oct 15 11:15:30 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Oct 2009 16:15:30 +0100 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <4AD739CA.6090403@biodec.com> References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> <4AD5E001.6070506@biodec.com> <320fb6e00910140746j15124dccsa901b9cbee784bdc@mail.gmail.com> <4AD5E810.5090607@biodec.com> <4CDC51BD-7953-47DE-910C-F1F50F0C3275@gmail.com> <4AD64602.9060603@biodec.com> <13B32636-1A38-4E0E-96A1-62D75B6AAEFC@gmail.com> <4AD72978.4030900@biodec.com> <05361598-93AF-4328-BE4B-25D4300DCDE0@gmail.com> <4AD739CA.6090403@biodec.com> Message-ID: <320fb6e00910150815h268b588cx696915143da3f097@mail.gmail.com> Hi guys, So we still don't understand exactly what triggers this, but it affects multiple version of BLAST, and multiple ways of calling blastpgp. I think we should update the Biopython PSI parser to tolerate (i.e. ignore) these "QUERY: 0" lines. It would be very useful to have a few more examples (ideally small files so we can include them with the test suite), covering a few recent versions of BLAST. You can email medium sized files to me personally (NOT to the mailing list), and smaller files can be uploaded to Bug 2927 (which I will reopen): http://bugzilla.open-bio.org/show_bug.cgi?id=2927 Peter From ibdeno at gmail.com Thu Oct 15 11:33:59 2009 From: ibdeno at gmail.com (Miguel Ortiz Lombardia) Date: Thu, 15 Oct 2009 17:33:59 +0200 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <4AD739CA.6090403@biodec.com> References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> <320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com> <320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com> <320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com> <7AFF985A-4DDE-4B72-A81E-26516BDA689F@gmail.com> <320fb6e00910140537h1d9d71f5i97417266c542ae3b@mail.gmail.com> <408372EF-748A-4089-A6F6-28B8E1F88B4B@gmail.com> <4AD5E001.6070506@biodec.com> <320fb6e00910140746j15124dccsa901b9cbee784bdc@mail.gmail.com> <4AD5E810.5090607@biodec.com> <4CDC51BD-7953-47DE-910C-F1F50F0C3275@gmail.com> <4AD64602.9060603@biodec.com> <13B32636-1A38-4E0E-96A1-62D75B6AAEFC@gmail.com> <4AD72978.4030900@biodec.com> <05361598-93AF-4328-BE4B-25D4300DCDE0@gmail.com> <4AD739CA.6090403@biodec.com> Message-ID: Le 15 oct. 09 ? 17:03, Andrea a ?crit : > Miguel Ortiz Lombardia ha scritto: >> >> Le 15 oct. 09 ? 15:54, Andrea a ?crit : >> >>> Miguel Ortiz Lombardia ha scritto: >>>> >>>> Le 14 oct. 09 ? 23:43, Andrea a ?crit : >>>> >>>>> Miguel Ortiz Lombardia ha scritto: >>>>>> Le 14 oct. 09 ? 17:02, Andrea a ?crit : >>>>>>> Peter ha scritto: >>>>>>>> On Wed, Oct 14, 2009 at 3:28 PM, Andrea >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi to everybody, >>>>>>>>> I work with blast quite often and i could say i run hundreds >>>>>>>>> of >>>>>>>>> thousand >>>>>>>>> of blastpgp. The "Query 0" outpt of blastpgp, is quite >>>>>>>>> common for >>>>>>>>> me, and >>>>>>>>> i wrote a patch to my code, to remove these "nasty" lines, >>>>>>>>> before >>>>>>>>> passing >>>>>>>>> the output to the parser. >>>>>>>>> >>>>>>>>> I found these type of lines in at least 1-2% of my runs. And >>>>>>>>> i'm >>>>>>>>> fully sure >>>>>>>>> that i found them either in the output of blast via shell >>>>>>>>> and in >>>>>>>>> the output >>>>>>>>> of blast via Biopython. >>>>>>>>> >>>>>>>>> The problem, according to me, is in the blastpgp algorithm and >>>>>>>>> maybe >>>>>>>>> could be managed in biopython (as i did in my code), cutting >>>>>>>>> out >>>>>>>>> these >>>>>>>>> "Query 0" lines, because from the point of view of the >>>>>>>>> alignments, >>>>>>>>> they don't have any sense. It seems that blastpgp, wants to >>>>>>>>> show >>>>>>>>> which is the part of the target sequence align to the query >>>>>>>>> before the >>>>>>>>> starting point of the query itself (something like opening a >>>>>>>>> gap, >>>>>>>>> at the >>>>>>>>> beginning of the query). >>>>>>>>> And this happens "sometimes", and without any apparent reason. >>>>>>>>> >>>>>>>> >>>>>>>> Andrea - do you have any small example output files with this >>>>>>>> problem? If it does occur fairly often (1 to 2% of the time), >>>>>>>> then >>>>>>>> we should try and update the parser to cope. Miguel's example >>>>>>>> is useful for testing while working on a bug fix, but too big >>>>>>>> to >>>>>>>> include as part the unit tests. >>>>>>>> >>>>>>>> >>>>>>> mmm... i've to search. I've some "cache" of gzipped blastpgp >>>>>>> outputs. >>>>>>> But I'm not >>>>>>> sure i've the original (maybe already patched).... waht I'm >>>>>>> sure, is >>>>>>> that in the >>>>>>> next month I'm going to run almost 100.000 blasptpg so I'll >>>>>>> for sure >>>>>>> find >>>>>>> something small. ;-) >>>>>>>>> What i think, is that there aren't any problem with >>>>>>>>> biopython in >>>>>>>>> wrapping >>>>>>>>> the blastpgp process and maybe, but i'm not sure, the >>>>>>>>> difference in >>>>>>>>> the >>>>>>>>> output could be related to small differences in the >>>>>>>>> parameter of >>>>>>>>> the process >>>>>>>>> (or in the environment... or in the .ncbirc file). >>>>>>>>> >>>>>>>>> I always was able to observe the identity between the >>>>>>>>> blastpgp >>>>>>>>> output >>>>>>>>> via shell (bash) and the output of the popen wrapper. >>>>>>>>> >>>>>>>> >>>>>>>> If you saw "Query 0" output at the command line (shell), then >>>>>>>> that is >>>>>>>> worth knowing. >>>>>> >>>>>> All I can say is that this is not what I observe. >>>>>> 1. When I send directly from the shell exactly the same blastpgp >>>>>> search ( I capture the full command line issued in the >>>>>> background by >>>>>> the python script with a 'ps -a | grep blastpgp' ) I have never >>>>>> find >>>>>> the 'Query: 0' lines. >>>>>> 2. When I send the search from within the python script and use >>>>>> 'nohup', the problem is reproducible, not random. >>>>> yes, i'm sure is reproducible. I mean that what I've observed >>>>> wasn't >>>>> random on one sequence, but maybe along >>>>> many sequences... >>>>>> 3. If the script is sent without 'nohup', that is, if the shell >>>>>> keeps >>>>>> full control of both standard error and output, then again, the >>>>>> problem seems to disappear. I say 'seems' because I haven't tried >>>>>> with >>>>>> my longest ( more than 1300 aa ) sequences. >>>>>> 4. When, from within the python script I use, as Peter >>>>>> suggested, the >>>>>> BlastpgpCommandline class to ask blastpgp to send the output to a >>>>>> file >>>>>> ( the -o option ) the problem disappears irrespectively whether >>>>>> I use >>>>>> or not 'nohup'. >>>>>> >>>>>> Therefore, in my opinion, the problem is not with blastpgp but >>>>>> with >>>>>> the handling of its output by python or biopython. >>>>>> >>>>> I'm really curious. What you have is very strange, but i believe >>>>> you >>>>> fully. >>>>> >>>>> Is there the possibility to have: >>>>> your database, >>>>> your .bashrc >>>>> the sequence >>>>> the exact command line. >>>>> the versione of blastpgp >>>>> the versione of blastpgp (2.2.18 ?) >>>>> the other things you use (matrix.... ) >>>>> the different possibilities you try....( nohup/python/shell ) >>>>> I should be reprodcible. >>>>> >>>>> Have you tried to observe the behaviour of the blastpgp process >>>>> with a >>>>> "strace" expecially at the >>>>> beginning? >>>>> >>>>> >>>>>>>> >>>>>>> i think so. >>>>>>>>> Miguel, could you check if really everything is identical? >>>>>>>>> Because i'm >>>>>>>>> really surprised of such a strange behaviour.... >>>>>>>> >>>>>>>> Maybe the environment variables are different or something? >>>>>> >>>>>> Command options are absolutely the same, see above. I am >>>>>> surprised >>>>>> too, but I don't think blastpgp is sensitive to any environment >>>>>> variable and I don't see how they could change from an in- >>>>>> script to a >>>>>> standalone run. >>>>> I think only to .bashrc. >>>>>> >>>>>>>> >>>>>>>>> Despite, according to me there aren't any problem in >>>>>>>>> biopython, >>>>>>>>> and >>>>>>>>> maybe, >>>>>>>>> Miguel will be able to discover some differences in the way >>>>>>>>> blastpgp is >>>>>>>>> launched, i would suggest to develop a patch (i could submit >>>>>>>>> mine), >>>>>>>>> that >>>>>>>>> could remove "Query 0" lines. >>>>>> >>>>>> I couldn't find any differences, so I'm afraid I can't help... >>>>>> I'm >>>>>> still testing the script, I will let you know if I find again >>>>>> this >>>>>> problem. >>>>> I will try to find the problem in my sequences (but i could say >>>>> that is >>>>> quite common)... and if i will >>>>> find i will try with the same parameters and the shell... >>>>>> >>>>>>>>> >>>>>>>> Could you upload your "Query 0" patch to Bug 2927? >>>>>>>> http://bugzilla.open-bio.org/show_bug.cgi?id=2927 >>>>>>>> >>>>>>> Now i'm wuite busy, because i'm working on a different project >>>>>>> and >>>>>>> i've >>>>>>> to manage deliveries... >>>>>>> but i will for sure upload my patch ASAP. >>>>>>>> >>>>>>>>> I aplogize if i understanded the problem wrongly and for the >>>>>>>>> fact >>>>>>>>> that >>>>>>>>> i'm entering in the discussion in this moment (maybe when the >>>>>>>>> discussion is finished)... >>>>>>>>> >>>>>>>> >>>>>>>> Well I don't (yet) understand what the problem is either ;) >>>>>>>> >>>>>>>> Peter >>>>>>>> >>>>>>> Ciao >>>>>>> andrea >>>>>> >>>>>> >>>>>> Best, >>>>>> >>>>>> >>>>>> >>>>>> -- Miguel >>>>>> >>>>>> >>>>> thanks. >>>>> Ciao >>>>> Andrea >>>> >>>> Hi! >>>> >>>> Some new findings that contradict my previous perception of the >>>> problem. >>>> Tonight my script failed again after stumbling upon the same >>>> problem >>>> for a different sequence. I have now investigated more carefully >>>> and >>>> found: >>>> >>>> 1. The problem (a line with 'Query: 0 ---' that impaired parsing of >>>> the blastpgp output) was encountered in all these cases: >>>> >>>> a) nohup myscript.py [some script options] sequences.fasta >& >>>> myscript.log & >>>> b) myscript.py [some script options] sequences.fasta >>>> c) /usr/local/blast-2.2.22/bin/blastpgp -d /opt/BlastDBs/nr -i >>>> U7.fasta -m 0 -o tmp.bl.txt -v 500 -b 1000 -a 6 -Q U7.nr.5.pssm - >>>> j 5 >>>> -h 0.001 -p blastpgp >>>> >>>> That is, for the first time I was able to reproduce the problem >>>> from a >>>> standalone run of blastpgp. >>>> >>>> 2. The problem disappears with a previous version of blastpgp >>>> (2.2.18). Using this version, all these cases work: >>>> >>>> a) nohup myscript.py [some script options] sequences.fasta >& >>>> myscript.log & >>>> b) myscript.py [some script options] sequences.fasta >>>> c) /usr/local/blast-2.2.18/bin/blastpgp -d /opt/BlastDBs/nr -i >>>> U7.fasta -m 0 -o tmp.bl.txt -v 500 -b 1000 -a 6 -Q U7.nr.5.pssm - >>>> j 5 >>>> -h 0.001 -p blastpgp >>>> >>>> So, it would seem that, as Andrea suggested, this is a bug in >>>> blastpgp, to be more precise, after blastpgp-2.2.18. >>>> >>>> 3. In this particular case, I notice that the problem happens >>>> with a >>>> sequence containing low complexity region(s). Now, I had thought >>>> that >>>> the default in blastpgp was to filter those sequences out! I'm >>>> running >>>> the original script again with blastpgp-2.2.22 with the filter on >>>> (-F >>>> T) to see if the problem persists. >>>> >>>> I will write to the blast-help address at the ncbi to let them know >>>> about the problem. >>>> >>>> Best, >>>> >>>> >>>> -- Miguel >>>> >>>> >>> Hi, >>> Thanks for your updates!!!. I can say one thing: >>> I've used in the past these three versione of blastpgp: >>> - 2.2.15 >>> - 2.2.18 >>> - 2.2.19 >>> and i found the "Query 0" problem in all of them, but, if one >>> of them fails (i mean, gives "Query 0" output) the other may not >>> fail >>> at all (they most probably not give the "Query 0" output). >>> >>> Another interesting things is that, with the three version, the same >>> database, and the same parameters, the output is quite different... >>> ...sorry.. very different... >>> >>> I'm also sure that it could happens also with the complexity >>> region(s) >>> filter "True". >>> What i observe, is that there aren't parameters that make it >>> disappear. It >>> just disappear from a sequence, and it will appear in another.... in >>> other >>> word, changing parameters, make it "moving" between sequences. >>> >>> I've never used blastpgp 2.2.22. So i cannot say anything about it. >>> >>> Thanks >>> Andrea >> >> >> Then it looks like something more weird than what I thought... >> Andrea, would you mind if I send your e-mail to the blast people? Or >> perhaps you can do it yourself... I wrote to blast-help at ncbi.nlm.nih.gov > If you can, for me is an help. I hope they will reply. > I can also send and email, buti f you have.... I will do that, no problem >> >> I suspect they will tell us to use the XML output, but then, not all >> info I need seems to go there... > i think the same, and i suspect the XML output doesn't suffer of the > same problem. For me the XML is a no issue, since the NCBIXML parser does not really support PSI-BLAST searches: it can't get information on the rounds, convergence... If you have a look to NCBIXML.py you see a lot of XXX TODO PSI... >> >> Thanks a lot! >> >> > To you!! >> -- Miguel >> >> > And for my patch, is not a patch.I've checked now. To be fully > independent > from NcbiStandalone.py i didn't write a patch for it. I wrote a patch > in the sense that actually i remove from the blastpgp output, four > lines, starting > from the "Query 0" one, and then i submit the "new output" to the > parser. > In this way i'm reading the file twice (so it's not a good idea), > but i > don't mind > if the NcbiStandalone.py change, because I'm fully independent from > it. > > This is my "simple code": > > ## THIS IS NOT A PATCH. BUT IT WORKS. > ## THIS MEANS THAT IF WE FIND THE WAY > ## TO REMOVE FOUR LINES STARTING > ## FROM "Query 0" THE PROBLEM IS REALLY > ## SOLVED (NOW I DON'T HAVE PARSER > ## PROBLEMS AT ALL). > ## lines is a list derived from a readlines() call of the > ## output of blastpgp. > ## newlines has to be reconverted into an handle > ## object. > def removeQuery0lines(lines): > newlines = [] > count = 0 > for l in lines: > if count == 4: count = 0 > if count != 0: count+=1 > if l.startswith('Query: 0'): count = 1 > if count == 0: newlines.append(l) > return newlines > Thanks! > > It should be interesting to develope a patch that works inside the > parser. > I will try to work on it, in November, becaue now i cannot. > The right function to manipulate it should be (inside > NCBIStandalone.py): > > def _scan_hsp_alignment(self, uhandle, consumer): > # Query: 11 > GRGVSACA-------TCDGFFYRNQKVAVIGGGNTAVEEALYLSNIASEVHLIHRRDGF > # GRGVS+ TC Y + + V GGG+ + EE L > + I R+ > # Sbjct: 12 > GRGVSSVVRRCIHKPTCKE--YAVKIIDVTGGGSFSAEEVQELREATLKEVDILRKVSG > # > # Query: 64 AEKILIKR 71 > # I +K > # Sbjct: 70 PNIIQLKD 77 > # > > while 1: > # Blastn adds an extra line filled with spaces before Query > attempt_read_and_call(uhandle, consumer.noevent, > start=' ') > read_and_call(uhandle, consumer.query, start='Query') > read_and_call(uhandle, consumer.align, start=' ') > read_and_call(uhandle, consumer.sbjct, start='Sbjct') > read_and_call_while(uhandle, consumer.noevent, blank=1) > line = safe_peekline(uhandle) > # Alignment continues if I see a 'Query' or the spaces for > Blastn. > if not (line.startswith('Query') or line.startswith(' > ')): > break > > changing it in: > > def _scan_hsp_alignment(self, uhandle, consumer): > # Query: 11 > GRGVSACA-------TCDGFFYRNQKVAVIGGGNTAVEEALYLSNIASEVHLIHRRDGF > # GRGVS+ TC Y + + V GGG+ + EE L > + I R+ > # Sbjct: 12 > GRGVSSVVRRCIHKPTCKE--YAVKIIDVTGGGSFSAEEVQELREATLKEVDILRKVSG > # > # Query: 64 AEKILIKR 71 > # I +K > # Sbjct: 70 PNIIQLKD 77 > # > while 1: > # Blastn adds an extra line filled with spaces before Query > attempt_read_and_call(uhandle, consumer.noevent, > start=' ') > # Remove Query 0 start (It is only at the beginning...) > q0_count = attempt_read_and_call(uhandle, consumer.noevent, > start='Query: 0') > if q0_count: > # if "Query 0" remove its alignment > read_and_call(uhandle, consumer.noevent, > start=' ') > read_and_call(uhandle, consumer.noevent, > start='Sbjct') > read_and_call_while(uhandle, consumer.noevent, > blank=1) > # Remove Query 0 end > read_and_call(uhandle, consumer.query, start='Query') > read_and_call(uhandle, consumer.align, start=' ') > read_and_call(uhandle, consumer.sbjct, start='Sbjct') > read_and_call_while(uhandle, consumer.noevent, blank=1) > line = safe_peekline(uhandle) > # Alignment continues if I see a 'Query' or the spaces for > Blastn. > if not (line.startswith('Query') or line.startswith(' > ')): > break > > BUT, i'm not sure of the patch and i didn't try at all... so i cannot > submit... It needs to be tryed and tested!!!! > And i'm also not sure if it is the right place to patch....!!!! > > > > > I hope this could help.... > Miguel, have you time to try and test? > I'm afraid not in the next 6 weeks... Best, -- Miguel From biopython at maubp.freeserve.co.uk Thu Oct 15 11:39:15 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Oct 2009 16:39:15 +0100 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> <320fb6e00910140746j15124dccsa901b9cbee784bdc@mail.gmail.com> <4AD5E810.5090607@biodec.com> <4CDC51BD-7953-47DE-910C-F1F50F0C3275@gmail.com> <4AD64602.9060603@biodec.com> <13B32636-1A38-4E0E-96A1-62D75B6AAEFC@gmail.com> <4AD72978.4030900@biodec.com> <05361598-93AF-4328-BE4B-25D4300DCDE0@gmail.com> <4AD739CA.6090403@biodec.com> Message-ID: <320fb6e00910150839s445f70fbx1998560c46389a94@mail.gmail.com> You don't have to include *all* the previous email in the quote ;) On Thu, Oct 15, 2009 at 4:33 PM, Miguel Ortiz Lombardia wrote: >>> >>> I suspect they will tell us to use the XML output, but then, not all >>> info I need seems to go there... >> >> i think the same, and i suspect the XML output doesn't suffer of the >> same problem. > > For me the XML is a no issue, since the NCBIXML parser does not really > support PSI-BLAST searches: > it can't get information on the rounds, convergence... If you have a look to > NCBIXML.py you see a lot of XXX TODO PSI... There may well be some things missing in our parser, but last time I checked, the XML file itself was missing lots of information found in the plain text output. Peter From andrea at biodec.com Thu Oct 15 11:39:48 2009 From: andrea at biodec.com (Andrea) Date: Thu, 15 Oct 2009 17:39:48 +0200 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> <320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com> <320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com> <320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com> <7AFF985A-4DDE-4B72-A81E-26516BDA689F@gmail.com> <320fb6e00910140537h1d9d71f5i97417266c542ae3b@mail.gmail.com> <408372EF-748A-4089-A6F6-28B8E1F88B4B@gmail.com> <4AD5E001.6070506@biodec.com> <320fb6e00910140746j15124dccsa901b9cbee784bdc@mail.gmail.com> <4AD5E810.5090607@biodec.com> <4CDC51BD-7953-47DE-910C-F1F50F0C3275@gmail.com> <4AD64602.9060603@biodec.com> <13B32636-1A38-4E0E-96A1-62D75B6AAEFC@gmail.com> <4AD72978.4030900@biodec.com> <05361598-93AF-4328-BE4B-25D4300DCDE0@gmail.com> <4AD739CA.6090403@biodec.com> Message-ID: <4AD74244.2070603@biodec.com> Miguel Ortiz Lombardia ha scritto: > > Le 15 oct. 09 ? 17:03, Andrea a ?crit : > >> Miguel Ortiz Lombardia ha scritto: >>> >>> Le 15 oct. 09 ? 15:54, Andrea a ?crit : >>> >>>> Miguel Ortiz Lombardia ha scritto: >>>>> >>>>> Le 14 oct. 09 ? 23:43, Andrea a ?crit : >>>>> >>>>>> Miguel Ortiz Lombardia ha scritto: >>>>>>> Le 14 oct. 09 ? 17:02, Andrea a ?crit : >>>>>>>> Peter ha scritto: >>>>>>>>> On Wed, Oct 14, 2009 at 3:28 PM, Andrea >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Hi to everybody, >>>>>>>>>> I work with blast quite often and i could say i run hundreds of >>>>>>>>>> thousand >>>>>>>>>> of blastpgp. The "Query 0" outpt of blastpgp, is quite common >>>>>>>>>> for >>>>>>>>>> me, and >>>>>>>>>> i wrote a patch to my code, to remove these "nasty" lines, >>>>>>>>>> before >>>>>>>>>> passing >>>>>>>>>> the output to the parser. >>>>>>>>>> >>>>>>>>>> I found these type of lines in at least 1-2% of my runs. And i'm >>>>>>>>>> fully sure >>>>>>>>>> that i found them either in the output of blast via shell and in >>>>>>>>>> the output >>>>>>>>>> of blast via Biopython. >>>>>>>>>> >>>>>>>>>> The problem, according to me, is in the blastpgp algorithm and >>>>>>>>>> maybe >>>>>>>>>> could be managed in biopython (as i did in my code), cutting out >>>>>>>>>> these >>>>>>>>>> "Query 0" lines, because from the point of view of the >>>>>>>>>> alignments, >>>>>>>>>> they don't have any sense. It seems that blastpgp, wants to show >>>>>>>>>> which is the part of the target sequence align to the query >>>>>>>>>> before the >>>>>>>>>> starting point of the query itself (something like opening a >>>>>>>>>> gap, >>>>>>>>>> at the >>>>>>>>>> beginning of the query). >>>>>>>>>> And this happens "sometimes", and without any apparent reason. >>>>>>>>>> >>>>>>>>> >>>>>>>>> Andrea - do you have any small example output files with this >>>>>>>>> problem? If it does occur fairly often (1 to 2% of the time), >>>>>>>>> then >>>>>>>>> we should try and update the parser to cope. Miguel's example >>>>>>>>> is useful for testing while working on a bug fix, but too big to >>>>>>>>> include as part the unit tests. >>>>>>>>> >>>>>>>>> >>>>>>>> mmm... i've to search. I've some "cache" of gzipped blastpgp >>>>>>>> outputs. >>>>>>>> But I'm not >>>>>>>> sure i've the original (maybe already patched).... waht I'm >>>>>>>> sure, is >>>>>>>> that in the >>>>>>>> next month I'm going to run almost 100.000 blasptpg so I'll for >>>>>>>> sure >>>>>>>> find >>>>>>>> something small. ;-) >>>>>>>>>> What i think, is that there aren't any problem with biopython in >>>>>>>>>> wrapping >>>>>>>>>> the blastpgp process and maybe, but i'm not sure, the >>>>>>>>>> difference in >>>>>>>>>> the >>>>>>>>>> output could be related to small differences in the parameter of >>>>>>>>>> the process >>>>>>>>>> (or in the environment... or in the .ncbirc file). >>>>>>>>>> >>>>>>>>>> I always was able to observe the identity between the >>>>>>>>>> blastpgp >>>>>>>>>> output >>>>>>>>>> via shell (bash) and the output of the popen wrapper. >>>>>>>>>> >>>>>>>>> >>>>>>>>> If you saw "Query 0" output at the command line (shell), then >>>>>>>>> that is >>>>>>>>> worth knowing. >>>>>>> >>>>>>> All I can say is that this is not what I observe. >>>>>>> 1. When I send directly from the shell exactly the same blastpgp >>>>>>> search ( I capture the full command line issued in the >>>>>>> background by >>>>>>> the python script with a 'ps -a | grep blastpgp' ) I have never >>>>>>> find >>>>>>> the 'Query: 0' lines. >>>>>>> 2. When I send the search from within the python script and use >>>>>>> 'nohup', the problem is reproducible, not random. >>>>>> yes, i'm sure is reproducible. I mean that what I've observed >>>>>> wasn't >>>>>> random on one sequence, but maybe along >>>>>> many sequences... >>>>>>> 3. If the script is sent without 'nohup', that is, if the shell >>>>>>> keeps >>>>>>> full control of both standard error and output, then again, the >>>>>>> problem seems to disappear. I say 'seems' because I haven't tried >>>>>>> with >>>>>>> my longest ( more than 1300 aa ) sequences. >>>>>>> 4. When, from within the python script I use, as Peter >>>>>>> suggested, the >>>>>>> BlastpgpCommandline class to ask blastpgp to send the output to a >>>>>>> file >>>>>>> ( the -o option ) the problem disappears irrespectively whether >>>>>>> I use >>>>>>> or not 'nohup'. >>>>>>> >>>>>>> Therefore, in my opinion, the problem is not with blastpgp but with >>>>>>> the handling of its output by python or biopython. >>>>>>> >>>>>> I'm really curious. What you have is very strange, but i believe you >>>>>> fully. >>>>>> >>>>>> Is there the possibility to have: >>>>>> your database, >>>>>> your .bashrc >>>>>> the sequence >>>>>> the exact command line. >>>>>> the versione of blastpgp >>>>>> the versione of blastpgp (2.2.18 ?) >>>>>> the other things you use (matrix.... ) >>>>>> the different possibilities you try....( nohup/python/shell ) >>>>>> I should be reprodcible. >>>>>> >>>>>> Have you tried to observe the behaviour of the blastpgp process >>>>>> with a >>>>>> "strace" expecially at the >>>>>> beginning? >>>>>> >>>>>> >>>>>>>>> >>>>>>>> i think so. >>>>>>>>>> Miguel, could you check if really everything is identical? >>>>>>>>>> Because i'm >>>>>>>>>> really surprised of such a strange behaviour.... >>>>>>>>> >>>>>>>>> Maybe the environment variables are different or something? >>>>>>> >>>>>>> Command options are absolutely the same, see above. I am surprised >>>>>>> too, but I don't think blastpgp is sensitive to any environment >>>>>>> variable and I don't see how they could change from an in-script >>>>>>> to a >>>>>>> standalone run. >>>>>> I think only to .bashrc. >>>>>>> >>>>>>>>> >>>>>>>>>> Despite, according to me there aren't any problem in biopython, >>>>>>>>>> and >>>>>>>>>> maybe, >>>>>>>>>> Miguel will be able to discover some differences in the way >>>>>>>>>> blastpgp is >>>>>>>>>> launched, i would suggest to develop a patch (i could submit >>>>>>>>>> mine), >>>>>>>>>> that >>>>>>>>>> could remove "Query 0" lines. >>>>>>> >>>>>>> I couldn't find any differences, so I'm afraid I can't help... I'm >>>>>>> still testing the script, I will let you know if I find again this >>>>>>> problem. >>>>>> I will try to find the problem in my sequences (but i could say >>>>>> that is >>>>>> quite common)... and if i will >>>>>> find i will try with the same parameters and the shell... >>>>>>> >>>>>>>>>> >>>>>>>>> Could you upload your "Query 0" patch to Bug 2927? >>>>>>>>> http://bugzilla.open-bio.org/show_bug.cgi?id=2927 >>>>>>>>> >>>>>>>> Now i'm wuite busy, because i'm working on a different project and >>>>>>>> i've >>>>>>>> to manage deliveries... >>>>>>>> but i will for sure upload my patch ASAP. >>>>>>>>> >>>>>>>>>> I aplogize if i understanded the problem wrongly and for the >>>>>>>>>> fact >>>>>>>>>> that >>>>>>>>>> i'm entering in the discussion in this moment (maybe when the >>>>>>>>>> discussion is finished)... >>>>>>>>>> >>>>>>>>> >>>>>>>>> Well I don't (yet) understand what the problem is either ;) >>>>>>>>> >>>>>>>>> Peter >>>>>>>>> >>>>>>>> Ciao >>>>>>>> andrea >>>>>>> >>>>>>> >>>>>>> Best, >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- Miguel >>>>>>> >>>>>>> >>>>>> thanks. >>>>>> Ciao >>>>>> Andrea >>>>> >>>>> Hi! >>>>> >>>>> Some new findings that contradict my previous perception of the >>>>> problem. >>>>> Tonight my script failed again after stumbling upon the same problem >>>>> for a different sequence. I have now investigated more carefully and >>>>> found: >>>>> >>>>> 1. The problem (a line with 'Query: 0 ---' that impaired parsing of >>>>> the blastpgp output) was encountered in all these cases: >>>>> >>>>> a) nohup myscript.py [some script options] sequences.fasta >& >>>>> myscript.log & >>>>> b) myscript.py [some script options] sequences.fasta >>>>> c) /usr/local/blast-2.2.22/bin/blastpgp -d /opt/BlastDBs/nr -i >>>>> U7.fasta -m 0 -o tmp.bl.txt -v 500 -b 1000 -a 6 -Q U7.nr.5.pssm -j 5 >>>>> -h 0.001 -p blastpgp >>>>> >>>>> That is, for the first time I was able to reproduce the problem >>>>> from a >>>>> standalone run of blastpgp. >>>>> >>>>> 2. The problem disappears with a previous version of blastpgp >>>>> (2.2.18). Using this version, all these cases work: >>>>> >>>>> a) nohup myscript.py [some script options] sequences.fasta >& >>>>> myscript.log & >>>>> b) myscript.py [some script options] sequences.fasta >>>>> c) /usr/local/blast-2.2.18/bin/blastpgp -d /opt/BlastDBs/nr -i >>>>> U7.fasta -m 0 -o tmp.bl.txt -v 500 -b 1000 -a 6 -Q U7.nr.5.pssm -j 5 >>>>> -h 0.001 -p blastpgp >>>>> >>>>> So, it would seem that, as Andrea suggested, this is a bug in >>>>> blastpgp, to be more precise, after blastpgp-2.2.18. >>>>> >>>>> 3. In this particular case, I notice that the problem happens with a >>>>> sequence containing low complexity region(s). Now, I had thought that >>>>> the default in blastpgp was to filter those sequences out! I'm >>>>> running >>>>> the original script again with blastpgp-2.2.22 with the filter on (-F >>>>> T) to see if the problem persists. >>>>> >>>>> I will write to the blast-help address at the ncbi to let them know >>>>> about the problem. >>>>> >>>>> Best, >>>>> >>>>> >>>>> -- Miguel >>>>> >>>>> >>>> Hi, >>>> Thanks for your updates!!!. I can say one thing: >>>> I've used in the past these three versione of blastpgp: >>>> - 2.2.15 >>>> - 2.2.18 >>>> - 2.2.19 >>>> and i found the "Query 0" problem in all of them, but, if one >>>> of them fails (i mean, gives "Query 0" output) the other may not fail >>>> at all (they most probably not give the "Query 0" output). >>>> >>>> Another interesting things is that, with the three version, the same >>>> database, and the same parameters, the output is quite different... >>>> ...sorry.. very different... >>>> >>>> I'm also sure that it could happens also with the complexity region(s) >>>> filter "True". >>>> What i observe, is that there aren't parameters that make it >>>> disappear. It >>>> just disappear from a sequence, and it will appear in another.... in >>>> other >>>> word, changing parameters, make it "moving" between sequences. >>>> >>>> I've never used blastpgp 2.2.22. So i cannot say anything about it. >>>> >>>> Thanks >>>> Andrea >>> >>> >>> Then it looks like something more weird than what I thought... >>> Andrea, would you mind if I send your e-mail to the blast people? Or >>> perhaps you can do it yourself... I wrote to >>> blast-help at ncbi.nlm.nih.gov >> If you can, for me is an help. I hope they will reply. >> I can also send and email, buti f you have.... > > I will do that, no problem > >>> >>> I suspect they will tell us to use the XML output, but then, not all >>> info I need seems to go there... >> i think the same, and i suspect the XML output doesn't suffer of the >> same problem. > > For me the XML is a no issue, since the NCBIXML parser does not really > support PSI-BLAST searches: > it can't get information on the rounds, convergence... If you have a > look to NCBIXML.py you see a lot of XXX TODO PSI... > >>> >>> Thanks a lot! >>> >>> >> To you!! >>> -- Miguel >>> >>> >> And for my patch, is not a patch.I've checked now. To be fully >> independent >> from NcbiStandalone.py i didn't write a patch for it. I wrote a patch >> in the sense that actually i remove from the blastpgp output, four >> lines, starting >> from the "Query 0" one, and then i submit the "new output" to the >> parser. >> In this way i'm reading the file twice (so it's not a good idea), but i >> don't mind >> if the NcbiStandalone.py change, because I'm fully independent from it. >> >> This is my "simple code": >> >> ## THIS IS NOT A PATCH. BUT IT WORKS. >> ## THIS MEANS THAT IF WE FIND THE WAY >> ## TO REMOVE FOUR LINES STARTING >> ## FROM "Query 0" THE PROBLEM IS REALLY >> ## SOLVED (NOW I DON'T HAVE PARSER >> ## PROBLEMS AT ALL). >> ## lines is a list derived from a readlines() call of the >> ## output of blastpgp. >> ## newlines has to be reconverted into an handle >> ## object. >> def removeQuery0lines(lines): >> newlines = [] >> count = 0 >> for l in lines: >> if count == 4: count = 0 >> if count != 0: count+=1 >> if l.startswith('Query: 0'): count = 1 >> if count == 0: newlines.append(l) >> return newlines >> > > Thanks! > >> >> It should be interesting to develope a patch that works inside the >> parser. >> I will try to work on it, in November, becaue now i cannot. >> The right function to manipulate it should be (inside >> NCBIStandalone.py): >> >> def _scan_hsp_alignment(self, uhandle, consumer): >> # Query: 11 >> GRGVSACA-------TCDGFFYRNQKVAVIGGGNTAVEEALYLSNIASEVHLIHRRDGF >> # GRGVS+ TC Y + + V GGG+ + EE L >> + I R+ >> # Sbjct: 12 >> GRGVSSVVRRCIHKPTCKE--YAVKIIDVTGGGSFSAEEVQELREATLKEVDILRKVSG >> # >> # Query: 64 AEKILIKR 71 >> # I +K >> # Sbjct: 70 PNIIQLKD 77 >> # >> >> while 1: >> # Blastn adds an extra line filled with spaces before Query >> attempt_read_and_call(uhandle, consumer.noevent, >> start=' ') >> read_and_call(uhandle, consumer.query, start='Query') >> read_and_call(uhandle, consumer.align, start=' ') >> read_and_call(uhandle, consumer.sbjct, start='Sbjct') >> read_and_call_while(uhandle, consumer.noevent, blank=1) >> line = safe_peekline(uhandle) >> # Alignment continues if I see a 'Query' or the spaces for >> Blastn. >> if not (line.startswith('Query') or line.startswith(' >> ')): >> break >> >> changing it in: >> >> def _scan_hsp_alignment(self, uhandle, consumer): >> # Query: 11 >> GRGVSACA-------TCDGFFYRNQKVAVIGGGNTAVEEALYLSNIASEVHLIHRRDGF >> # GRGVS+ TC Y + + V GGG+ + EE L >> + I R+ >> # Sbjct: 12 >> GRGVSSVVRRCIHKPTCKE--YAVKIIDVTGGGSFSAEEVQELREATLKEVDILRKVSG >> # >> # Query: 64 AEKILIKR 71 >> # I +K >> # Sbjct: 70 PNIIQLKD 77 >> # >> while 1: >> # Blastn adds an extra line filled with spaces before Query >> attempt_read_and_call(uhandle, consumer.noevent, >> start=' ') >> # Remove Query 0 start (It is only at the beginning...) >> q0_count = attempt_read_and_call(uhandle, consumer.noevent, >> start='Query: 0') >> if q0_count: >> # if "Query 0" remove its alignment >> read_and_call(uhandle, consumer.noevent, start=' ') >> read_and_call(uhandle, consumer.noevent, start='Sbjct') >> read_and_call_while(uhandle, consumer.noevent, blank=1) >> # Remove Query 0 end >> read_and_call(uhandle, consumer.query, start='Query') >> read_and_call(uhandle, consumer.align, start=' ') >> read_and_call(uhandle, consumer.sbjct, start='Sbjct') >> read_and_call_while(uhandle, consumer.noevent, blank=1) >> line = safe_peekline(uhandle) >> # Alignment continues if I see a 'Query' or the spaces for >> Blastn. >> if not (line.startswith('Query') or line.startswith(' >> ')): >> break >> >> BUT, i'm not sure of the patch and i didn't try at all... so i cannot >> submit... It needs to be tryed and tested!!!! >> And i'm also not sure if it is the right place to patch....!!!! >> >> >> >> >> I hope this could help.... >> Miguel, have you time to try and test? >> > > I'm afraid not in the next 6 weeks... > > Best, > > > > -- Miguel > > So i will try in 3 weeks.. ;-) And, as suggested from Peter, we will move the discussion to http://bugzilla.open-bio.org/show_bug.cgi?id=2927 with some examples.... Ciao Andrea From natassa_g_2000 at yahoo.com Thu Oct 15 12:00:28 2009 From: natassa_g_2000 at yahoo.com (natassa) Date: Thu, 15 Oct 2009 09:00:28 -0700 (PDT) Subject: [Biopython] Adaptor trimmer and dimers Message-ID: <355533.31188.qm@web52001.mail.re2.yahoo.com> Hallo Biopythoners, I followed a recent thread conversation about adaptor trimming, which I intend to do on Illumina runs, and I am not sure I know where exactly in github I could find Brad Chapman's code for trimming AFTER modifications that he has done based on the thread conversation. I d like to test that code, which looks very appealing to me if it computes a global alignment and allows for a certain simplicity, ex number of mismatches. The link in BradChapman's original post on the trimmer points to a non-Biopython Github (sorry if i understand bad those things!) and I have the impression it is not updated for the above (and other) features discussed in the thread. On the same topic, I would like to ask people's experience on the detection of adaptor dimers. I have just started considering the issue, and my understanding is that Illumina technology at least is mostly biased for the presence of adapter dimers, rather than adapter fragments within the reads. This was confirmed by the company who did the sequencing for my samples. So I was surprised to find no discussion on dimers or no obvious adaptation on scripts for their detection. Maybe i am wrong? I tested a perl script that detects 'adapter-only sequences' but when i tried to visually inspect those to see if they represent dimers, I realized the importance of doing a global alignment ;-), the script doing a local one. The fraction of the adapter-only sequences, if those represented the dimers I am looking for, is small, so i d be happy to filter them out. But I am not sure for this, and lacking a way to detect such dimers, I would happily give a go to a trimmer, not a very aggressive one! Do you think adapter trimming is critical? What fractions of your illumina reads contained adapters? Sorry for the overflow of questions! Many thanks, Anastasia Anastasia Gioti Post-Doc, Evolutionary Biology Department Upssala University Norbyv?gen 18D SE-752 36? UPPSALA anastasia.gioti at ebc.uu.se Tel: +46-18-471 2837 Fax: +46-18-471 6310 From biopython at maubp.freeserve.co.uk Thu Oct 15 12:09:33 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Oct 2009 17:09:33 +0100 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <025E6C33-9CCE-4319-907B-6C39289C014C@gmail.com> References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> <4CDC51BD-7953-47DE-910C-F1F50F0C3275@gmail.com> <4AD64602.9060603@biodec.com> <13B32636-1A38-4E0E-96A1-62D75B6AAEFC@gmail.com> <4AD72978.4030900@biodec.com> <05361598-93AF-4328-BE4B-25D4300DCDE0@gmail.com> <4AD739CA.6090403@biodec.com> <320fb6e00910150839s445f70fbx1998560c46389a94@mail.gmail.com> <025E6C33-9CCE-4319-907B-6C39289C014C@gmail.com> Message-ID: <320fb6e00910150909i6cb01f59le2021300f70b6458@mail.gmail.com> CC'd back to mailing list On Thu, Oct 15, 2009 at 4:51 PM, Miguel Ortiz Lombardia wrote: > Le 15 oct. 09 ? 17:39, Peter a ?crit : > >>> For me the XML is a no issue, since the NCBIXML parser does not really >>> support PSI-BLAST searches: >>> it can't get information on the rounds, convergence... If you have a look >>> to NCBIXML.py you see a lot of XXX TODO PSI... >> >> There may well be some things missing in our parser, but last time I >> checked, the XML file itself was missing lots of information found in >> the plain text output. >> >> Peter > > I am sending to you an xml file from a PSI-Blast run that converged. You see > there for example info about iteration number and convergence, for example. > It's just 59 Kb, I can upload it to the bug 2927, but I suspect you prefer > not, since this is a new issue (XML parser). > > IMHO, the NCBIXML parser should behave (concerning PSI-BLAST) just like > NCBIStandalone.PSIBlastparser. Perhaps a NCBIXML.PSIBlasparser should be > created ? Just an idea... Michiel also thinks the PSI BLAST XML parser could be better, see: http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006780.html ... http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006791.html http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006802.html Can you file a new bug about PSI-BLAST XML parsing (and attach that example) please? I'd have to look over the new PSI-BLAST XML files before having an informed opinion. Peter From biopython at maubp.freeserve.co.uk Thu Oct 15 12:20:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Oct 2009 17:20:47 +0100 Subject: [Biopython] Adaptor trimmer and dimers In-Reply-To: <355533.31188.qm@web52001.mail.re2.yahoo.com> References: <355533.31188.qm@web52001.mail.re2.yahoo.com> Message-ID: <320fb6e00910150920y6ada0463s60ce0b4e5f788449@mail.gmail.com> On Thu, Oct 15, 2009 at 5:00 PM, natassa wrote: > Hallo Biopythoners, > I followed a recent thread conversation about adaptor trimming, > which I intend to do on Illumina runs, and I am not sure I know > where exactly in github I could find Brad Chapman's code for > trimming AFTER modifications that he has done based on the > thread conversation. ... I guess you mean Brad's August Blog Post: http://bcbio.wordpress.com/2009/08/09/trimming-adaptors-from-short-read-sequences/ and the following mailing list thread which included some tips on speeding up the Biopython side of things: http://lists.open-bio.org/pipermail/biopython/2009-August/005417.html For anyone else interested, there are some simple examples in the tutorial (using SeqRecord slicing - elegant and simple, but a bit slow): http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:FASTQ-slicing-off-primer http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:FASTQ-slicing-off-adaptor And I did a blog post about low level FASTQ handling for speed at the cost of flexibility and simplicity (using some of the same ideas from the August mailing list discussion): http://news.open-bio.org/news/2009/09/biopython-fast-fastq/ Peter From ibdeno at gmail.com Thu Oct 15 12:24:24 2009 From: ibdeno at gmail.com (Miguel Ortiz Lombardia) Date: Thu, 15 Oct 2009 18:24:24 +0200 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <320fb6e00910150909i6cb01f59le2021300f70b6458@mail.gmail.com> References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> <4CDC51BD-7953-47DE-910C-F1F50F0C3275@gmail.com> <4AD64602.9060603@biodec.com> <13B32636-1A38-4E0E-96A1-62D75B6AAEFC@gmail.com> <4AD72978.4030900@biodec.com> <05361598-93AF-4328-BE4B-25D4300DCDE0@gmail.com> <4AD739CA.6090403@biodec.com> <320fb6e00910150839s445f70fbx1998560c46389a94@mail.gmail.com> <025E6C33-9CCE-4319-907B-6C39289C014C@gmail.com> <320fb6e00910150909i6cb01f59le2021300f70b6458@mail.gmail.com> Message-ID: Le 15 oct. 09 ? 18:09, Peter a ?crit : >> >> IMHO, the NCBIXML parser should behave (concerning PSI-BLAST) just >> like >> NCBIStandalone.PSIBlastparser. Perhaps a NCBIXML.PSIBlasparser >> should be >> created ? Just an idea... > > Michiel also thinks the PSI BLAST XML parser could be better, see: > > http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006780.html > ... > http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006791.html > http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006802.html > Sorry to have missed them. I still believe that the logic behind the NCBIStandalone.PSIBlastparser is correct or, at least, useful. But I could change my mind if you think otherwise. The XML file that I sent to Peter came from blastpgp 2.2.22. It seems to me that it is a proper XML file, not a concatenation. > Can you file a new bug about PSI-BLAST XML parsing (and attach that > example) > please? I'd have to look over the new PSI-BLAST XML files before > having an > informed opinion. I have filed the bug: http://bugzilla.open-bio.org/show_bug.cgi?id=2929 and have upload the XML from blastpgp v. 2.2.22 mentioned above. Best, -- Miguel From biopython at maubp.freeserve.co.uk Thu Oct 15 12:32:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Oct 2009 17:32:06 +0100 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> <13B32636-1A38-4E0E-96A1-62D75B6AAEFC@gmail.com> <4AD72978.4030900@biodec.com> <05361598-93AF-4328-BE4B-25D4300DCDE0@gmail.com> <4AD739CA.6090403@biodec.com> <320fb6e00910150839s445f70fbx1998560c46389a94@mail.gmail.com> <025E6C33-9CCE-4319-907B-6C39289C014C@gmail.com> <320fb6e00910150909i6cb01f59le2021300f70b6458@mail.gmail.com> Message-ID: <320fb6e00910150932x5cdef3c0k5ad9b28eea51930f@mail.gmail.com> On Thu, Oct 15, 2009 at 5:24 PM, Miguel Ortiz Lombardia wrote: > > Le 15 oct. 09 ? 18:09, Peter a ?crit : >>> >>> IMHO, the NCBIXML parser should behave (concerning PSI-BLAST) just like >>> NCBIStandalone.PSIBlastparser. Perhaps a NCBIXML.PSIBlasparser should be >>> created ? Just an idea... >> >> Michiel also thinks the PSI BLAST XML parser could be better, see: >> >> http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006780.html >> ... >> http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006791.html >> http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006802.html >> > > Sorry to have missed them. They were on the dev list, so that makes sense. > I still believe that the logic behind the NCBIStandalone.PSIBlastparser is > correct or, at least, useful. But I could change my mind if you think > otherwise. The idea of the NCBIStandalone.PSIBlastparser plain text parser, and its object structure makes sense. > The XML file that I sent to Peter came from blastpgp 2.2.22. It seems to me > that it is a proper XML file, not a concatenation. > >> Can you file a new bug about PSI-BLAST XML parsing (and attach that >> example) please? I'd have to look over the new PSI-BLAST XML files >> before having an informed opinion. > > I have filed the bug: > http://bugzilla.open-bio.org/show_bug.cgi?id=2929 > and have upload the XML from blastpgp v. 2.2.22 mentioned above. Lovely - thank you. So that is a single query, with 3 iterations. What would be *really* nice, is a multiple query file (say three queries, each needing just a few iterations to keep the file small). Peter From ibdeno at gmail.com Thu Oct 15 12:52:45 2009 From: ibdeno at gmail.com (Miguel Ortiz Lombardia) Date: Thu, 15 Oct 2009 18:52:45 +0200 Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <320fb6e00910150932x5cdef3c0k5ad9b28eea51930f@mail.gmail.com> References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com> <13B32636-1A38-4E0E-96A1-62D75B6AAEFC@gmail.com> <4AD72978.4030900@biodec.com> <05361598-93AF-4328-BE4B-25D4300DCDE0@gmail.com> <4AD739CA.6090403@biodec.com> <320fb6e00910150839s445f70fbx1998560c46389a94@mail.gmail.com> <025E6C33-9CCE-4319-907B-6C39289C014C@gmail.com> <320fb6e00910150909i6cb01f59le2021300f70b6458@mail.gmail.com> <320fb6e00910150932x5cdef3c0k5ad9b28eea51930f@mail.gmail.com> Message-ID: <1FA32B70-84CC-4012-97F5-9B67D56BADC6@gmail.com> Le 15 oct. 09 ? 18:32, Peter a ?crit : > >> I still believe that the logic behind the >> NCBIStandalone.PSIBlastparser is >> correct or, at least, useful. But I could change my mind if you think >> otherwise. > > The idea of the NCBIStandalone.PSIBlastparser plain text parser, and > its object structure makes sense. > Good! >> I have filed the bug: >> http://bugzilla.open-bio.org/show_bug.cgi?id=2929 >> and have upload the XML from blastpgp v. 2.2.22 mentioned above. > > Lovely - thank you. So that is a single query, with 3 iterations. > What would be *really* nice, is a multiple query file (say three > queries, each needing just a few iterations to keep the file small). Never used multiple query file... Do you mean starting from a multiple- alignment file with the -B option? -- Miguel From pengyu.ut at gmail.com Thu Oct 15 17:17:26 2009 From: pengyu.ut at gmail.com (Peng Yu) Date: Thu, 15 Oct 2009 16:17:26 -0500 Subject: [Biopython] How to get sequences upstream of TSS of genes? Message-ID: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com> I have a set of genes. I want to get the 5kb sequence that is upstream of the TSS's of each gene. I have the following specific questions. Could somebody help me? Thank you! Which database I can access to get mouse genome? Give a gene name what function I should call to get the gene's location? From carlos.borroto at gmail.com Thu Oct 15 17:18:17 2009 From: carlos.borroto at gmail.com (Carlos Javier Borroto) Date: Thu, 15 Oct 2009 17:18:17 -0400 Subject: [Biopython] How to construct a SeqRecord with the info in the SeqFeatures type mRNA or CDS? Message-ID: <65d4b7fc0910151418q30347842yb48e6508a2cab544@mail.gmail.com> Hi, I want to construct a SeqRecord with the sequence make from the sum of the Locations of the SubFeatures I get from a SeqFeature type mRNA or CDS. Does biopython has something already to do this? It looks like something many people may want, but is proving to be king of difficult to implement manually, so I'm wondering if is already there? I read in the tutorial that you can splice a SeqRecord, but I can't find a reference to how to form a SeqRecord from several different splicing, something like: new_record = record[1:200] + record[400:600] thanks in advance, -- Carlos Javier Borroto Baltimore, MD Google Voice: (410) 929 4020 From biopython at maubp.freeserve.co.uk Thu Oct 15 17:35:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Oct 2009 22:35:52 +0100 Subject: [Biopython] How to construct a SeqRecord with the info in the SeqFeatures type mRNA or CDS? In-Reply-To: <65d4b7fc0910151418q30347842yb48e6508a2cab544@mail.gmail.com> References: <65d4b7fc0910151418q30347842yb48e6508a2cab544@mail.gmail.com> Message-ID: <320fb6e00910151435k26f48292jb0eaab5661d83759@mail.gmail.com> On Thu, Oct 15, 2009 at 10:18 PM, Carlos Javier Borroto wrote: > Hi, > > I want to construct a SeqRecord with the sequence make from the sum of > the Locations of the SubFeatures I get from a SeqFeature type mRNA or > CDS. Does biopython has something already to do this? It looks like > something many people may want, but is proving to be king of difficult > to implement manually, so I'm wondering if is already there? There isn't anything built in now, partly because to do it properly means coping with a lot of possible fuzzy locations and joins. I can go into more detail, but it would help to know what kind of organisms are you working with? For prokaryotes and viruses, CDS locations are (usually) trivial so you just need the start, end and strand. > I read in the tutorial that you can splice a SeqRecord, but I can't > find a reference to how to form a SeqRecord from several different > splicing, something like: > > new_record = record[1:200] + record[400:600] That isn't built in, but is something I've been working on that might be in Biopython in future. Do you fancy trying some experimental code? http://github.com/peterjc/biopython/tree/seqrecords http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006851.html Peter From biopython at maubp.freeserve.co.uk Thu Oct 15 17:42:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Oct 2009 22:42:41 +0100 Subject: [Biopython] How to get sequences upstream of TSS of genes? In-Reply-To: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com> References: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com> Message-ID: <320fb6e00910151442v4a96cbd6j7a03d3f397b9c264@mail.gmail.com> On Thu, Oct 15, 2009 at 10:17 PM, Peng Yu wrote: > I have a set of genes. I want to get the 5kb sequence that is upstream > of the TSS's of each gene. > > I have the following specific questions. Could somebody help me? Thank you! > > Which database I can access to get mouse genome? > Give a gene name what function I should call to get the gene's location? I am not familiar with mouse specific databases. My first instinct would be to download the GenBank files for all the mouse chromosomes via FTP from the NCBI. You can parse these with Biopython, and pull out the gene of interest. Then using the gene's strand and the start/end location, you can deduce the coordinates to the upstream region, and take this section from the chromosome sequence (and reverse complement if on the reverse strand). Peter From biopython at maubp.freeserve.co.uk Thu Oct 15 17:48:20 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 15 Oct 2009 22:48:20 +0100 Subject: [Biopython] How to construct a SeqRecord with the info in the SeqFeatures type mRNA or CDS? In-Reply-To: <320fb6e00910151435k26f48292jb0eaab5661d83759@mail.gmail.com> References: <65d4b7fc0910151418q30347842yb48e6508a2cab544@mail.gmail.com> <320fb6e00910151435k26f48292jb0eaab5661d83759@mail.gmail.com> Message-ID: <320fb6e00910151448i125bf77emb8dafcf30d9fdd1a@mail.gmail.com> On Thu, Oct 15, 2009 at 10:35 PM, Peter wrote: > On Thu, Oct 15, 2009 at 10:18 PM, Carlos Javier Borroto wrote: >> Hi, >> >> I want to construct a SeqRecord with the sequence make from the sum of >> the Locations of the SubFeatures I get from a SeqFeature type mRNA or >> CDS. Does biopython has something already to do this? It looks like >> something many people may want, but is proving to be king of difficult >> to implement manually, so I'm wondering if is already there? > > There isn't anything built in now, partly because to do it properly > means coping with a lot of possible fuzzy locations and joins. > I can go into more detail, but it would help to know what kind > of organisms are you working with? For prokaryotes and viruses, > CDS locations are (usually) trivial so you just need the start, end > and strand. There is a partly tested function called get_feature_nuc in the unit test file test_SeqIO_features.py, which takes a SeqFeature and the parent Seq object. In fact looking at it now, some of the comments look out of date (I think I fixed the GenBank parser to cope with mixed strand features ...). This might do what you want - but as I said, it needs more testing. It had crossed my mind (as you can tell from the comments) that this could be added to Biopython proper at some point. One idea was as a method of the SeqRecord object, which would take a SeqFeature (or just the integer index of the desired feature in the SeqRecord's list of features). Peter From mjldehoon at yahoo.com Thu Oct 15 21:04:20 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 15 Oct 2009 18:04:20 -0700 (PDT) Subject: [Biopython] Problems parsing with PSIBlastParser In-Reply-To: <1FA32B70-84CC-4012-97F5-9B67D56BADC6@gmail.com> Message-ID: <737542.47267.qm@web62401.mail.re1.yahoo.com> Last time I checked (which was a few weeks ago), a multiple-query PSIBlast search gives a file consisting of concatenated XML files. The problem is in the design of Blast XML output. For a single-query PSIBlast, the fields under are used to store the output of the PSIBlast iterations. For multiple-query regular Blast, the same fields are used to store the search results of each query. With multiple-query PSIBlast, there is then no way to store the output in the current XML format. I've been meaning to write to NCBI about this, but I haven't gotten round to it yet. Will do so this weekend. --Michiel. --- On Thu, 10/15/09, Miguel Ortiz Lombardia wrote: > From: Miguel Ortiz Lombardia > Subject: Re: [Biopython] Problems parsing with PSIBlastParser > To: "Peter" > Cc: "Biopython Mailing List" > Date: Thursday, October 15, 2009, 12:52 PM > > Le 15 oct. 09 ? 18:32, Peter a ?crit : > > > >> I still believe that the logic behind the > NCBIStandalone.PSIBlastparser is > >> correct or, at least, useful. But I could change > my mind if you think > >> otherwise. > > > > The idea of the NCBIStandalone.PSIBlastparser plain > text parser, and > > its object structure makes sense. > > > > Good! > > >> I have filed the bug: > >> http://bugzilla.open-bio.org/show_bug.cgi?id=2929 > >> and have upload the XML from blastpgp v. 2.2.22 > mentioned above. > > > > Lovely - thank you. So that is a single query, with 3 > iterations. > > What would be *really* nice, is a multiple query file > (say three > > queries, each needing just a few iterations to keep > the file small). > > > Never used multiple query file... Do you mean starting from > a multiple-alignment file with the -B option? > > -- Miguel > > > > > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Fri Oct 16 04:11:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 16 Oct 2009 09:11:45 +0100 Subject: [Biopython] How to construct a SeqRecord with the info in the SeqFeatures type mRNA or CDS? In-Reply-To: <320fb6e00910151435k26f48292jb0eaab5661d83759@mail.gmail.com> References: <65d4b7fc0910151418q30347842yb48e6508a2cab544@mail.gmail.com> <320fb6e00910151435k26f48292jb0eaab5661d83759@mail.gmail.com> Message-ID: <320fb6e00910160111t4f999350we0ef349dc454902a@mail.gmail.com> On Thu, Oct 15, 2009 at 10:35 PM, Peter wrote: > On Thu, Oct 15, 2009 at 10:18 PM, Carlos Javier Borroto wrote: >> I read in the tutorial that you can splice a SeqRecord, but I can't >> find a reference to how to form a SeqRecord from several different >> splicing, something like: >> >> new_record = record[1:200] + record[400:600] > > That isn't built in, but is something I've been working on that > might be in Biopython in future. Do you fancy trying some > experimental code? > > http://github.com/peterjc/biopython/tree/seqrecords > http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006851.html What I should have added yesterday was how you would solve this with Biopython is it is now (e.g. Biopython 1.52): new_record = SeqRecord(record.seq[1:200]+record.seq[400:600]) new_record.id = record.id #if this makes sense new_record.name = record.name #if this makes sense ... Dealing with complex annotation however is (currently) more complicated - hence the code I was working on. Peter From dalloliogm at gmail.com Fri Oct 16 04:29:46 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Fri, 16 Oct 2009 10:29:46 +0200 Subject: [Biopython] How to get sequences upstream of TSS of genes? In-Reply-To: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com> References: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com> Message-ID: <5aa3b3570910160129g3a5e2f0fr1297db95d779e807@mail.gmail.com> On Thu, Oct 15, 2009 at 11:17 PM, Peng Yu wrote: > I have a set of genes. I want to get the 5kb sequence that is upstream > of the TSS's of each gene. You can do that with biomart: - http://www.ensembl.org/biomart/martview/a90f00892a48e04d438f762f551bf48a/a90f00892a48e04d438f762f551bf48a select Ensembl56 as database, Mus Musculus as species, go to Filters and fill the 'Id list limit' form to add the required geneIds, then go to Attributes, select Sequences and then check 'Upstream Flank - 5000'. As for doing that in python, I am not sure there are python interfaces to BioMart. Galaxy (http://main.g2.bx.psu.edu/) is written in python, so they must have written a library for that somewhere, but I don't know their code. If you use R (remember that you can mix python and R with rpy2) there is a nice module in bioconductor called BioMart. > I have the following specific questions. Could somebody help me? Thank you! > > Which database I can access to get mouse genome? > Give a gene name what function I should call to get the gene's location? > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From pengyu.ut at gmail.com Fri Oct 16 10:52:00 2009 From: pengyu.ut at gmail.com (Peng Yu) Date: Fri, 16 Oct 2009 09:52:00 -0500 Subject: [Biopython] How to get sequences upstream of TSS of genes? In-Reply-To: <5aa3b3570910160129g3a5e2f0fr1297db95d779e807@mail.gmail.com> References: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com> <5aa3b3570910160129g3a5e2f0fr1297db95d779e807@mail.gmail.com> Message-ID: <366c6f340910160752u2f6eef3gbc6f90c7aa9b534d@mail.gmail.com> On Fri, Oct 16, 2009 at 3:29 AM, Giovanni Marco Dall'Olio wrote: > On Thu, Oct 15, 2009 at 11:17 PM, Peng Yu wrote: >> I have a set of genes. I want to get the 5kb sequence that is upstream >> of the TSS's of each gene. > > You can do that with biomart: > - http://www.ensembl.org/biomart/martview/a90f00892a48e04d438f762f551bf48a/a90f00892a48e04d438f762f551bf48a > > select Ensembl56 as database, Mus Musculus as species, go to Filters > and fill the 'Id list limit' form to add the required geneIds, then go > to Attributes, select Sequences and then check 'Upstream Flank - > 5000'. I have gene names (for example, Krt83) what geneIDs shall I choose? > As for doing that in python, I am not sure there are python interfaces > to BioMart. Galaxy (http://main.g2.bx.psu.edu/) is written in python, > so they must have written a library for that somewhere, but I don't > know their code. > > If you use R (remember that you can mix python and R with rpy2) there > is a nice module in bioconductor called BioMart. > > >> I have the following specific questions. Could somebody help me? Thank you! >> >> Which database I can access to get mouse genome? >> Give a gene name what function I should call to get the gene's location? >> _______________________________________________ >> Biopython mailing list ?- ?Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > > > -- > Giovanni Dall'Olio, phd student > Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) > > My blog on bioinformatics: http://bioinfoblog.it > From mailinglist.honeypot at gmail.com Fri Oct 16 10:55:19 2009 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Fri, 16 Oct 2009 10:55:19 -0400 Subject: [Biopython] How to get sequences upstream of TSS of genes? In-Reply-To: <366c6f340910160752u2f6eef3gbc6f90c7aa9b534d@mail.gmail.com> References: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com> <5aa3b3570910160129g3a5e2f0fr1297db95d779e807@mail.gmail.com> <366c6f340910160752u2f6eef3gbc6f90c7aa9b534d@mail.gmail.com> Message-ID: <40B70639-3317-4289-97C1-940EDFD16BEF@gmail.com> Hi, On Oct 16, 2009, at 10:52 AM, Peng Yu wrote: > On Fri, Oct 16, 2009 at 3:29 AM, Giovanni Marco Dall'Olio > wrote: >> On Thu, Oct 15, 2009 at 11:17 PM, Peng Yu >> wrote: >>> I have a set of genes. I want to get the 5kb sequence that is >>> upstream >>> of the TSS's of each gene. >> >> You can do that with biomart: >> - http://www.ensembl.org/biomart/martview/a90f00892a48e04d438f762f551bf48a/a90f00892a48e04d438f762f551bf48a >> >> select Ensembl56 as database, Mus Musculus as species, go to Filters >> and fill the 'Id list limit' form to add the required geneIds, then >> go >> to Attributes, select Sequences and then check 'Upstream Flank - >> 5000'. > > I have gene names (for example, Krt83) what geneIDs shall I choose? Since your on ensembl's web site, I'd imagine ensembl gene id's might be a good bet, no? :-) -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact From dalloliogm at gmail.com Fri Oct 16 11:24:55 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Fri, 16 Oct 2009 17:24:55 +0200 Subject: [Biopython] How to get sequences upstream of TSS of genes? In-Reply-To: <40B70639-3317-4289-97C1-940EDFD16BEF@gmail.com> References: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com> <5aa3b3570910160129g3a5e2f0fr1297db95d779e807@mail.gmail.com> <366c6f340910160752u2f6eef3gbc6f90c7aa9b534d@mail.gmail.com> <40B70639-3317-4289-97C1-940EDFD16BEF@gmail.com> Message-ID: <5aa3b3570910160824h7a569570x60a113909b8ea93f@mail.gmail.com> On Fri, Oct 16, 2009 at 4:55 PM, Steve Lianoglou wrote: > Hi, > > On Oct 16, 2009, at 10:52 AM, Peng Yu wrote: > >> On Fri, Oct 16, 2009 at 3:29 AM, Giovanni Marco Dall'Olio >> wrote: >>> >> >> I have gene names (for example, Krt83) what geneIDs shall I choose? > > Since your on ensembl's web site, I'd imagine ensembl gene id's might be a > good bet, no? :-) exactly, but if you look at the form more carefully you will see that there is a menu from which you can choose the type of geneId, for example: ensembl, kegg, ncbi, etc... note: I didn't send you the ufficial biomart's link. The right one is: - http://www.ensembl.org/biomart/martview > -steve > > -- > Steve Lianoglou > Graduate Student: Computational Systems Biology > ?| ?Memorial Sloan-Kettering Cancer Center > ?| ?Weill Medical College of Cornell University > Contact Info: http://cbio.mskcc.org/~lianos/contact > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From pengyu.ut at gmail.com Fri Oct 16 11:44:55 2009 From: pengyu.ut at gmail.com (Peng Yu) Date: Fri, 16 Oct 2009 10:44:55 -0500 Subject: [Biopython] How to get sequences upstream of TSS of genes? In-Reply-To: <5aa3b3570910160824h7a569570x60a113909b8ea93f@mail.gmail.com> References: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com> <5aa3b3570910160129g3a5e2f0fr1297db95d779e807@mail.gmail.com> <366c6f340910160752u2f6eef3gbc6f90c7aa9b534d@mail.gmail.com> <40B70639-3317-4289-97C1-940EDFD16BEF@gmail.com> <5aa3b3570910160824h7a569570x60a113909b8ea93f@mail.gmail.com> Message-ID: <366c6f340910160844x14e9d532o1b236d28b22cea4@mail.gmail.com> On Fri, Oct 16, 2009 at 10:24 AM, Giovanni Marco Dall'Olio wrote: > On Fri, Oct 16, 2009 at 4:55 PM, Steve Lianoglou > wrote: >> Hi, >> >> On Oct 16, 2009, at 10:52 AM, Peng Yu wrote: >> >>> On Fri, Oct 16, 2009 at 3:29 AM, Giovanni Marco Dall'Olio >>> wrote: >>>> >>> >>> I have gene names (for example, Krt83) what geneIDs shall I choose? >> >> Since your on ensembl's web site, I'd imagine ensembl gene id's might be a >> good bet, no? :-) > > exactly, but if you look at the form more carefully you will see that > there is a menu from which you can choose the type of geneId, for > example: ensembl, kegg, ncbi, etc... > > note: I didn't send you the ufficial biomart's link. The right one is: > - http://www.ensembl.org/biomart/martview My question was how to figure what type of geneID it was for 'Krt83'? I tried 'Ensembl Gene ID(s)' in the menu and put 'Krt83' in the box below it. But I get an empty mart_export.txt file. From mailinglist.honeypot at gmail.com Fri Oct 16 11:56:03 2009 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Fri, 16 Oct 2009 11:56:03 -0400 Subject: [Biopython] How to get sequences upstream of TSS of genes? In-Reply-To: <366c6f340910160844x14e9d532o1b236d28b22cea4@mail.gmail.com> References: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com> <5aa3b3570910160129g3a5e2f0fr1297db95d779e807@mail.gmail.com> <366c6f340910160752u2f6eef3gbc6f90c7aa9b534d@mail.gmail.com> <40B70639-3317-4289-97C1-940EDFD16BEF@gmail.com> <5aa3b3570910160824h7a569570x60a113909b8ea93f@mail.gmail.com> <366c6f340910160844x14e9d532o1b236d28b22cea4@mail.gmail.com> Message-ID: <3A3213F6-0D6D-413D-9487-D7F0EF24BB83@gmail.com> Hi, On Oct 16, 2009, at 11:44 AM, Peng Yu wrote: > My question was how to figure what type of geneID it was for 'Krt83'? > I tried 'Ensembl Gene ID(s)' in the menu and put 'Krt83' in the box > below it. But I get an empty mart_export.txt file. I'm guessing you're filters are set wrong. Try with: * FILTER set to: MGI symbol * ATTRIBUTES set to: Ensembl Gene ID, Ensembl Transcript ID, MGI Symbol You'd get: Ensembl Gene ID Ensembl Transcript ID MGI symbol ENSMUSG00000047641 ENSMUST00000108897 Krt83 ENSMUSG00000047641 ENSMUST00000081945 Krt83 -steve -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact From dalloliogm at gmail.com Fri Oct 16 11:57:05 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Fri, 16 Oct 2009 17:57:05 +0200 Subject: [Biopython] How to get sequences upstream of TSS of genes? In-Reply-To: <366c6f340910160844x14e9d532o1b236d28b22cea4@mail.gmail.com> References: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com> <5aa3b3570910160129g3a5e2f0fr1297db95d779e807@mail.gmail.com> <366c6f340910160752u2f6eef3gbc6f90c7aa9b534d@mail.gmail.com> <40B70639-3317-4289-97C1-940EDFD16BEF@gmail.com> <5aa3b3570910160824h7a569570x60a113909b8ea93f@mail.gmail.com> <366c6f340910160844x14e9d532o1b236d28b22cea4@mail.gmail.com> Message-ID: <5aa3b3570910160857x7ee6a137u327d3da0adad15fa@mail.gmail.com> On Fri, Oct 16, 2009 at 5:44 PM, Peng Yu wrote: > On Fri, Oct 16, 2009 at 10:24 AM, Giovanni Marco Dall'Olio > wrote: >> On Fri, Oct 16, 2009 at 4:55 PM, Steve Lianoglou >> wrote: >>> Hi, >>> >>> On Oct 16, 2009, at 10:52 AM, Peng Yu wrote: >>> >>>> On Fri, Oct 16, 2009 at 3:29 AM, Giovanni Marco Dall'Olio >>>> wrote: >>>>> >>>> >>>> I have gene names (for example, Krt83) what geneIDs shall I choose? >>> >>> Since your on ensembl's web site, I'd imagine ensembl gene id's might be a >>> good bet, no? :-) >> >> exactly, but if you look at the form more carefully you will see that >> there is a menu from which you can choose the type of geneId, for >> example: ensembl, kegg, ncbi, etc... >> >> note: I didn't send you the ufficial biomart's link. The right one is: >> - http://www.ensembl.org/biomart/martview > > My question was how to figure what type of geneID it was for 'Krt83'? > I tried 'Ensembl Gene ID(s)' in the menu and put 'Krt83' in the box > below it. But I get an empty mart_export.txt file. All ensembl Ids starts with 'ENSG0....'. Your Krt83 should be an EntrezGene id: - http://www.ensembl.org/Homo_sapiens/Search/Details?_C=eJwFwdEJgDAMBcA3inSBKqKIA7iA*gepEcXQ1JA6v3ck4Az6Mg4!9yoOevGYT32T1Ira7hzdmOewaYmrVksceRgD6Lp9qSLoWvxUeBcJ&_c=%2b15428165997832314387&_c=%2b18088233473301975577 > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From pengyu.ut at gmail.com Sun Oct 18 11:44:58 2009 From: pengyu.ut at gmail.com (Peng Yu) Date: Sun, 18 Oct 2009 10:44:58 -0500 Subject: [Biopython] How to get sequences upstream of TSS of genes? In-Reply-To: <3A3213F6-0D6D-413D-9487-D7F0EF24BB83@gmail.com> References: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com> <5aa3b3570910160129g3a5e2f0fr1297db95d779e807@mail.gmail.com> <366c6f340910160752u2f6eef3gbc6f90c7aa9b534d@mail.gmail.com> <40B70639-3317-4289-97C1-940EDFD16BEF@gmail.com> <5aa3b3570910160824h7a569570x60a113909b8ea93f@mail.gmail.com> <366c6f340910160844x14e9d532o1b236d28b22cea4@mail.gmail.com> <3A3213F6-0D6D-413D-9487-D7F0EF24BB83@gmail.com> Message-ID: <366c6f340910180844o5924ea98v1e840a6e19150c17@mail.gmail.com> On Fri, Oct 16, 2009 at 10:56 AM, Steve Lianoglou wrote: > Hi, > > On Oct 16, 2009, at 11:44 AM, Peng Yu wrote: > >> My question was how to figure what type of geneID it was for 'Krt83'? >> I tried 'Ensembl Gene ID(s)' in the menu and put 'Krt83' in the box >> below it. But I get an empty mart_export.txt file. > > > I'm guessing you're filters are set wrong. > > Try with: > ?* FILTER set to: MGI symbol > ?* ATTRIBUTES set to: Ensembl Gene ID, Ensembl Transcript ID, MGI Symbol > > You'd get: > > Ensembl Gene ID Ensembl Transcript ID ? MGI symbol > ENSMUSG00000047641 ? ? ?ENSMUST00000108897 ? ? ?Krt83 > ENSMUSG00000047641 ? ? ?ENSMUST00000081945 ? ? ?Krt83 It seems that it can not report both MGI symbol and the 5kb upstream sequences simultaneously from Ensembl website. Is it true? If so, probably I will have to make a short program to combine the results. From natassa_g_2000 at yahoo.com Mon Oct 19 06:03:18 2009 From: natassa_g_2000 at yahoo.com (natassa) Date: Mon, 19 Oct 2009 03:03:18 -0700 (PDT) Subject: [Biopython] Adaptor trimmer and dimers In-Reply-To: <320fb6e00910150920y6ada0463s60ce0b4e5f788449@mail.gmail.com> Message-ID: <693756.78143.qm@web52011.mail.re2.yahoo.com> Thanks Peter, I ve gone through these posts already, so my question was whether a global alignment script exists-Brad Chapman's script does a local alignment. Also, I would be mostly interested in discarding adapter-dimer reads and I do not find any adaptation on his code to detect those, unless I am wrong.. I would also like to discard their pairs, as I am inputting those to velvet assembler which takes into account the pair-read information for scaffolding. I can try to? write up something integrating the above features, I was just wondering if there is anything out there already and whether people find this a sensible approach. Kind regards, Anastasia --- On Thu, 10/15/09, Peter wrote: From: Peter Subject: Re: [Biopython] Adaptor trimmer and dimers To: "natassa" Cc: biopython at lists.open-bio.org Date: Thursday, October 15, 2009, 12:20 PM On Thu, Oct 15, 2009 at 5:00 PM, natassa wrote: > Hallo Biopythoners, > I followed a recent thread conversation about adaptor trimming, > which I intend to do on Illumina runs, and I am not sure I know > where exactly in github I could find Brad Chapman's code for > trimming AFTER modifications that he has done based on the > thread conversation. ... I guess you mean Brad's August Blog Post: http://bcbio.wordpress.com/2009/08/09/trimming-adaptors-from-short-read-sequences/ and the following mailing list thread which included some tips on speeding up the Biopython side of things: http://lists.open-bio.org/pipermail/biopython/2009-August/005417.html For anyone else interested, there are some simple examples in the tutorial (using SeqRecord slicing - elegant and simple, but a bit slow): http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:FASTQ-slicing-off-primer http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:FASTQ-slicing-off-adaptor And I did a blog post about low level FASTQ handling for speed at the cost of flexibility and simplicity (using some of the same ideas from the August mailing list discussion): http://news.open-bio.org/news/2009/09/biopython-fast-fastq/ Peter From chapmanb at 50mail.com Mon Oct 19 07:24:41 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 19 Oct 2009 07:24:41 -0400 Subject: [Biopython] Adaptor trimmer and dimers In-Reply-To: <693756.78143.qm@web52011.mail.re2.yahoo.com> References: <320fb6e00910150920y6ada0463s60ce0b4e5f788449@mail.gmail.com> <693756.78143.qm@web52011.mail.re2.yahoo.com> Message-ID: <20091019112441.GA72523@sobchak.mgh.harvard.edu> Hi Anastasia; > I ve gone through these posts already, so my question was whether > a global alignment script exists-Brad Chapman's script does a > local alignment. I found that local alignments behaved better in terms of trimming, but if you want global alignments it's easy to change. Edit line 42 of the script from: pairwise2.align.localms to: pairwise2.align.globalms > Also, I would be mostly interested in discarding > adapter-dimer reads and I do not find any adaptation on his code to > detect those, unless I am wrong.. You should get back an empty or very short read, which you can then discard in your script. > I would also like to discard their > pairs, as I am inputting those to velvet assembler which takes into > account the pair-read information for scaffolding. This is also something you can do after calling the trimmer. Read each end of the pair, trim both sequences and then check that they pass your size threshold. If both pass, then write them to the file you'll be using for assembly: adaptor = "GATC" num_errors = 2 size_thresh = 17 pair1 = read_seq() pair2 = read_seq() trim1 = trim_adaptor(pair1, adaptor, num_errors) trim2 = trim_adaptor(pair2, adaptor, num_errors) if len(trim1) >= size_thresh and len(trim2) >= size_thresh: write_pair(trim1, trim2) Hope this helps, Brad > I can try to? write > up something integrating the above features, I was just wondering if > there is anything out there already and whether people find this a > sensible approach. Kind regards, > Anastasia > --- On Thu, 10/15/09, Peter wrote: > > From: Peter > Subject: Re: [Biopython] Adaptor trimmer and dimers > To: "natassa" > Cc: biopython at lists.open-bio.org > Date: Thursday, October 15, 2009, 12:20 PM > > On Thu, Oct 15, 2009 at 5:00 PM, natassa wrote: > > Hallo Biopythoners, > > I followed a recent thread conversation about adaptor trimming, > > which I intend to do on Illumina runs, and I am not sure I know > > where exactly in github I could find Brad Chapman's code for > > trimming AFTER modifications that he has done based on the > > thread conversation. ... > > I guess you mean Brad's August Blog Post: > http://bcbio.wordpress.com/2009/08/09/trimming-adaptors-from-short-read-sequences/ > and the following mailing list thread which included some tips on > speeding up the Biopython side of things: > http://lists.open-bio.org/pipermail/biopython/2009-August/005417.html > > For anyone else interested, there are some simple examples in the > tutorial (using SeqRecord slicing - elegant and simple, but a bit slow): > http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:FASTQ-slicing-off-primer > http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:FASTQ-slicing-off-adaptor > > And I did a blog post about low level FASTQ handling for speed > at the cost of flexibility and simplicity (using some of the same > ideas from the August mailing list discussion): > http://news.open-bio.org/news/2009/09/biopython-fast-fastq/ > > Peter > > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From fkauff at biologie.uni-kl.de Mon Oct 19 09:44:39 2009 From: fkauff at biologie.uni-kl.de (Frank Kauff) Date: Mon, 19 Oct 2009 15:44:39 +0200 Subject: [Biopython] Combine nexus files but not concatenating them In-Reply-To: <320fb6e00910080154m2e75a38eh7e120e807103773b@mail.gmail.com> References: <320fb6e00910051242n54d6229fyacc23119401715e@mail.gmail.com> <320fb6e00910051331u2f27c7ecmde744d29fb3ba2ab@mail.gmail.com> <320fb6e00910070229n1b78542dj82998de13cf7eed7@mail.gmail.com> <320fb6e00910080154m2e75a38eh7e120e807103773b@mail.gmail.com> Message-ID: <4ADC6D47.2010409@biologie.uni-kl.de> Hi all, unfortunately, morphological data types and mixed data types are curretnly unsupported. For no special reason - I just never bothered to implement them... I think it's not trivial, though, because one would have to store the data type for each individual character in some way, which would probably mean to significantly change the data structure that is currently used to hold the alignment data... With regard to splitting up a nexus file - yes, if there is a data partition defined, the individual subdivisions can be saved as individual nexus files with mynexusinstance.write_nexus_data_partitions(charpartition='name_of_partition') Please see the method for further details of customization. Otherwise, one could save the characters defined in a character set as nexus using mynexusinstance.write_nexus_data(filename'charsetxy.nex',exclude=[c for c in range(mynexusinstance.nchar) if c not in mynexusinstance.charsets['name_of_charset_i_want_to_save']]) Cheers, Frank On 10/08/2009 10:54 AM, Peter wrote: > On Thu, Oct 8, 2009 at 12:23 AM, Denzel Li wrote: > >> Hi Peter: >> Regarding the Nexus datatype supported in Bio:Nexus:Nexus, I mean nexus like >> the following, where the datatype is a "mixing" of "standard" and "DNA". >> According to the function Bio:Nexus:Nexus._format (line 696), these >> datatypes are not supported yet. I am just wondering does the team has the >> plan to support these data types. >> > Oh right - in in your example, the digits encode morphology, but they could > also be phenotypes, or some other characteristic like gene copy number. > > As to Bio.Nexus supporting this, hopefully Frank or Cymon can comment. > > If Bio.Nexus did support this, then from the Bio.AlignIO point of view, with > the current object structure we'd have to use a sequence object (holding > both the digits, and the DNA) for the sequence strings (e.g. for s1 in your > example, Seq("10010ACGT")) with a generic single letter alphabet. This > would lose the fact that the first five characters are digits, but the rest are > DNA. This isn't ideal, and would probably cause trouble for Nexus output > (writing such alignments). > > Would you want to try and deal with such "mixed" alignments via the > Bio.AlignIO interface? > > Peter > > -- J-Prof. Dr. Frank Kauff Molecular Phylogenetics FB Biologie, 13/276 TU Kaiserslautern Postfach 3049 67653 Kaiserslautern Tel. +49 (0)631 205-2562 Fax. +49 (0)631 205-2998 email: fkauff at biologie.uni-kl.de skype: frank.kauff From mike.thon at gmail.com Mon Oct 19 13:35:49 2009 From: mike.thon at gmail.com (Michael Thon) Date: Mon, 19 Oct 2009 19:35:49 +0200 Subject: [Biopython] parsing an in memory sequence string with SeqIO Message-ID: <945EB0E1-4D74-4DF0-8B17-CD9A23CA6C56@gmail.com> I have been looking at the documentation but I can't figure out how to parse a string of text in a python variable into a SeqRecord object. the main function (SeqIO.parse) requires a file handle. I'm getting the text from a web server POST request and it seems a little inefficient to write it to a file before I do parsing with biopython. Maybe there is some way in python to create a handle to a variable? Thanks Mike From kellrott at gmail.com Mon Oct 19 13:48:17 2009 From: kellrott at gmail.com (Kyle Ellrott) Date: Mon, 19 Oct 2009 10:48:17 -0700 Subject: [Biopython] parsing an in memory sequence string with SeqIO In-Reply-To: <945EB0E1-4D74-4DF0-8B17-CD9A23CA6C56@gmail.com> References: <945EB0E1-4D74-4DF0-8B17-CD9A23CA6C56@gmail.com> Message-ID: Try the StringIO interface. http://docs.python.org/library/stringio.html Kyle On Mon, Oct 19, 2009 at 10:35 AM, Michael Thon wrote: > I have been looking at the documentation but I can't figure out how to parse > a string of text in a python variable into a SeqRecord object. ?the main > function (SeqIO.parse) requires a file handle. ?I'm getting the text from a > web server POST request and it seems a little inefficient to write it to a > file before I do parsing with biopython. ?Maybe there is some way in python > to create a handle to a variable? > Thanks > Mike > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From mikelisanke at gmail.com Mon Oct 19 15:37:10 2009 From: mikelisanke at gmail.com (Mike Lisanke) Date: Mon, 19 Oct 2009 15:37:10 -0400 Subject: [Biopython] Windows installer does not find Python 2.63 with multiple pythons Message-ID: <8c5c6e580910191237w54e857bfs1107ba86df05db7c@mail.gmail.com> I had Python 3.0 installed prior to attempting a bio-python install. I installed Python 2.6 to its own directory, and a proper registry entry was made in HKEY_LOCAL_MACHINE\SOFTWARE\Python, however; the bio-python can not find the Python 2.6 install. Is there a problem having multiple python installs? Thanks. -- Best regards, Mike From biopython at maubp.freeserve.co.uk Mon Oct 19 17:29:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 19 Oct 2009 22:29:12 +0100 Subject: [Biopython] Windows installer does not find Python 2.63 with multiple pythons In-Reply-To: <8c5c6e580910191237w54e857bfs1107ba86df05db7c@mail.gmail.com> References: <8c5c6e580910191237w54e857bfs1107ba86df05db7c@mail.gmail.com> Message-ID: <320fb6e00910191429l481223d8iae9fbe78702fff70@mail.gmail.com> On Mon, Oct 19, 2009 at 8:37 PM, Mike Lisanke wrote: > I had Python 3.0 installed prior to attempting a bio-python install. I > installed Python 2.6 to its own directory, and a proper registry entry was > made in HKEY_LOCAL_MACHINE\SOFTWARE\Python, however; > the bio-python can not find the Python 2.6 install. Is there a problem > having multiple python installs? Thanks. On my Windows machine I have Python 2.4, 2.5 and 2.6 all co-existing fine (and I used to have 2.3 as well). These were all default installs to C:\Python26 etc, and I didn't have to do anything funny to the registry. I can try and remember to check the registry settings on my machine if you like... but for now I can only suggest you might try uninstalling Python 2.6, perhaps clean the registry, and then reinstall Python 2.6. Peter P.S. I haven't tried putting Python 3.0 on my Windows machine (not that I would bother, I would go straight to Python 3.1 now that it is out). From tevang3 at gmail.com Tue Oct 20 06:44:45 2009 From: tevang3 at gmail.com (Thomas Evangelidis) Date: Tue, 20 Oct 2009 13:44:45 +0300 Subject: [Biopython] search Entrez with boolean operators Message-ID: <833e1faf0910200344x29e1d636ra35325433b6263b7@mail.gmail.com> Dear all, is it possible to set the term parameter in Bio.Entrez.esearch() accordingly so that it will search Entrez using boolean operators? I tried myself several combinations with no luck. For instance lets say I want to query All Fields of PubMed using this whole phrase (not intividual words): "ABC efflux transporter", how should I write it? thanks in advance. From biopython at maubp.freeserve.co.uk Tue Oct 20 06:53:59 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Oct 2009 11:53:59 +0100 Subject: [Biopython] search Entrez with boolean operators In-Reply-To: <833e1faf0910200344x29e1d636ra35325433b6263b7@mail.gmail.com> References: <833e1faf0910200344x29e1d636ra35325433b6263b7@mail.gmail.com> Message-ID: <320fb6e00910200353x2ce754edkdc197f8cfc6ece21@mail.gmail.com> On Tue, Oct 20, 2009 at 11:44 AM, Thomas Evangelidis wrote: > Dear all, > > is it possible to set the term parameter in Bio.Entrez.esearch() accordingly > so that it will search Entrez using boolean operators? I tried myself > several combinations with no luck. You can use AND in upper case, e.g. abc[title] AND efflux[title] AND transporter[title] abc[all] AND efflux[all] AND transporter[all] abc AND efflux AND transporter > For instance lets say I want to query All > Fields of PubMed using this whole phrase (not intividual words): "ABC efflux > transporter", how should I write it? For phrases, you need quote characters - you can try this on the NCBI Entrez webpage, e.g. "ABC efflux transporter" "ABC efflux transporter"[all] Note that these give no hits! Remember in Python there are at least two ways to build a string with quotes in it, for example single-quote double-quote text double-quote single-quote: >>> search = '"ABC efflux transporter"' >>> print search "ABC efflux transporter" Or, sticking with all double quotes you must escape some: >>> search = "\"ABC efflux transporter\"" >>> print search "ABC efflux transporter" Peter From pengyu.ut at gmail.com Tue Oct 20 11:33:08 2009 From: pengyu.ut at gmail.com (Peng Yu) Date: Tue, 20 Oct 2009 10:33:08 -0500 Subject: [Biopython] Making the tutorial more concise Message-ID: <366c6f340910200833k5447fca3yda5a804c889a3733@mail.gmail.com> I feel that the document can be made more concise. http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc73 For example, on the above link, it says "Hey, everybody loves BLAST right? I mean, geez, how can get it get any easier to do comparisons between one of your sequences and every other sequence in the known world?" I think this can be delete. Or it can be simply stated what Chapter 7 is about at the beginning. From biopython at maubp.freeserve.co.uk Tue Oct 20 11:41:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 20 Oct 2009 16:41:21 +0100 Subject: [Biopython] Making the tutorial more concise In-Reply-To: <366c6f340910200833k5447fca3yda5a804c889a3733@mail.gmail.com> References: <366c6f340910200833k5447fca3yda5a804c889a3733@mail.gmail.com> Message-ID: <320fb6e00910200841r6081dcd4ga0c661a14fc7aa6f@mail.gmail.com> On Tue, Oct 20, 2009 at 4:33 PM, Peng Yu wrote: > I feel that the document can be made more concise. > > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc73 > > For example, on the above link, it says > "Hey, everybody loves BLAST right? I mean, geez, how can get it get > any easier to do comparisons between one of your sequences and every > other sequence in the known world?" > > I think this can be delete. Or it can be simply stated what Chapter 7 > is about at the beginning. I agree - I think that might have been Brad's casual writing style ;) I am planning to re-write the BLAST chapter soon, partly due to Biopython switching to using command line wrappers in module Bio.Blast.Applications with subprocess, but also we will want to support the new BLAST+ tools from the NCBI (different command line argument names etc). Peter From lueck at ipk-gatersleben.de Tue Oct 20 12:01:41 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Tue, 20 Oct 2009 18:01:41 +0200 Subject: [Biopython] Making the tutorial more concise References: <366c6f340910200833k5447fca3yda5a804c889a3733@mail.gmail.com> Message-ID: <007a01ca519e$9cd68eb0$1022a8c0@ipkgatersleben.de> >From my point of view, I like such comments. They make a tutorial not so dry ;-) Anyway, I only can thank all the people, which wrote this nice tutorial. It helped me already a lot and I don't mine small jokes ;-) Nice evening! Stefanie ----- Original Message ----- From: "Peng Yu" To: Sent: Tuesday, October 20, 2009 5:33 PM Subject: [Biopython] Making the tutorial more concise >I feel that the document can be made more concise. > > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc73 > > For example, on the above link, it says > "Hey, everybody loves BLAST right? I mean, geez, how can get it get > any easier to do comparisons between one of your sequences and every > other sequence in the known world?" > > I think this can be delete. Or it can be simply stated what Chapter 7 > is about at the beginning. > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From fufezan at uni-muenster.de Wed Oct 21 03:25:01 2009 From: fufezan at uni-muenster.de (Christian Fufezan) Date: Wed, 21 Oct 2009 09:25:01 +0200 Subject: [Biopython] Biopython & p3d Message-ID: <2034B111-0AB9-4FE8-B1F9-62E753DA0FE8@uni-muenster.de> Hello Biopython, we ( Michael Specht & I ) published recently p3d, a python module for structural bioinformatics and were wondering if it wouldn't be a good good thing if could join the Biopython project. We understand that Biopython has already a PDB parser but we programmed an alternative version since we found the Biopython.pdb syntax to be too non- pythonian. One example why is shown below: Biopython: def test6(structure): '''get protein surrounding (5) of NAG''' bucket = set() atom_list=Selection.unfold_entities(structure,'A') ns = NeighborSearch(atom_list) for model in structure.get_list(): for chain in model.get_list(): for residue in chain.get_list(): if residue.get_resname() == 'NAG': for atom in residue.get_list(): centre = atom.get_coord() R = 5.0 neighbor_list = ns.search(centre,R) neighbors = Selection.unfold_entities(neighbor_list,'A') for atom2 in neighbors: if 'O' in atom2.get_name(): bucket.add(atom2) print ' found',len(bucket),' oxygens around NAG' return p3d: def test6(pdb): ''' protein surrounding (5) of resname NAG''' bgl = pdb.query('resname NAG') bucket = pdb.query('protein and oxygen and within 5 of ',bgl) print ' found',len(bucket),' oxygens around NAG' return Certainly, Biopythons PDB module has its advantages and the is no way p3d could replace it, but both modules have their advantages :) The fact that biopythons.pdb parser uses a KTree written in C and we wrote one in python makes certain queries to the protein structure faster in Biopyhton; however if the query involves more complex demands, multiple loops are inevitable in biopython, whereas p3d offers a human readable query function that combines all aspects. The link to our publication is: http://www.biomedcentral.com/1471-2105/10/258 Looking forward to hear from you, maybe one can also envision a combined module with a new all advantages together. Kind regards Christian Fufezan From biopython at maubp.freeserve.co.uk Wed Oct 21 05:18:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Oct 2009 10:18:17 +0100 Subject: [Biopython] Biopython & p3d In-Reply-To: <2034B111-0AB9-4FE8-B1F9-62E753DA0FE8@uni-muenster.de> References: <2034B111-0AB9-4FE8-B1F9-62E753DA0FE8@uni-muenster.de> Message-ID: <320fb6e00910210218n39b746bu6c5fd826efe6097f@mail.gmail.com> On Wed, Oct 21, 2009 at 8:25 AM, Christian Fufezan wrote: > Hello Biopython, > > we ( Michael Specht & I ) published recently p3d, a python module for > structural bioinformatics and were wondering if it wouldn't be a good good > thing if could join the Biopython project. We understand that Biopython has > already a PDB parser but we programmed an alternative version since we found > the Biopython.pdb syntax to be too non-pythonian. One example why is shown > below: > > Biopython: > > def test6(structure): > ? ? ? ?'''get protein surrounding (5) of NAG''' > ? ? ? ?bucket = set() > ? ? ? ?atom_list=Selection.unfold_entities(structure,'A') > ? ? ? ?ns = NeighborSearch(atom_list) > ? ? ? ?for model in structure.get_list(): > ? ? ? ? ? ? ? ?for chain in model.get_list(): > ? ? ? ? ? ? ? ? ? ? ? ?for residue in chain.get_list(): I'm not very familiar with the NeighborSearch code, but I'm pretty sure the above for loops can be just: for model in structure: for chain in model: for residue in chain: ... And regarding detecting oxygen atoms, I think there is a patch on bugzilla to record the (relatively) new atom column from the PDB file (which will help with Hg and mercury versus hydrogen). Still, I would agree with you that some parts of Bio.PDB are not very pythonic - too many functions names get_*() which could be replaced with properties. This is something we could evolve gradually (add new properties, keep the old methods in place but gradually deprecate them). Specific suggestions would be welcome. > def test6(pdb): > ? ? ? ?''' protein surrounding (5) of resname NAG''' > ? ? ? ?bgl = pdb.query('resname NAG') > ? ? ? ?bucket = pdb.query('protein and oxygen and within 5 of ',bgl) > ? ? ? ?print ' ? ? found',len(bucket),' oxygens around NAG' > ? ? ? ?return > > Certainly, Biopythons PDB module has its advantages and the is no way p3d > could replace it, but both modules have their advantages :) The fact that > biopythons.pdb parser uses a KTree written in C and we wrote one in python > makes certain queries to the protein structure faster in Biopyhton; however > if the query involves more complex demands, multiple loops are inevitable in > biopython, whereas p3d offers a human readable query function that combines > all aspects. The link to our publication is: > http://www.biomedcentral.com/1471-2105/10/258 I remember skim reading it a month ago or so. I remember the final line of the abstract was a very strong opinion ("a perfect tool"), and I was rather surprised the reviewers and editor let you keep it - regardless of any bias I might feel to Biopython ;) > Looking forward to hear from you, maybe one can also envision a > combined module with a new all advantages together. That would be a good outcome. >From the snippet of code and the examples in the paper, the big feature you have that Bio.PDB lacks is "fancy selections", and that is certainly something which could be improved in Biopython. It is interesting you have implemented (invented?) a string based language with logical and, within etc. In some ways it reminds me of the selection formulae in VMD - have you used that 3D visualisation tool? This also reminds me of the SQL language for database selections, and how classical SQL code with Python just used SQL statements within Python strings. Have you ever used SQLAlchemy, and looked at how they handle SQL statements like filters, ands, ors, etc with a clever object based interface? Perhaps something like that could work for a 3D structure query API. Regards, Peter From natassa_g_2000 at yahoo.com Wed Oct 21 05:54:26 2009 From: natassa_g_2000 at yahoo.com (natassa) Date: Wed, 21 Oct 2009 02:54:26 -0700 (PDT) Subject: [Biopython] Adaptor trimmer and dimers In-Reply-To: <20091019112441.GA72523@sobchak.mgh.harvard.edu> Message-ID: <843737.47817.qm@web52003.mail.re2.yahoo.com> Brad, Thank you for the tips. I adapted your code a bit to handle pairs (that is, I have both read1 and 2 of a pair in the same file and if I find the adaptor in any read of the pair, I discard the pair.) I also had to add an additional test for the length of the alignment output, as I got an index Error for the cases the adapter does not align at all. I am not sure i got this part right, I looked a bit at the related Biopython alignment code, and that is what? I concluded. My main problem now is performance of this script: On a file of 19 million reads of 76 bp it is running for more than 12 hours! So I copy here my code and would be very grateful if someone could indicate parts where it could be sped up. Also, Brad, could you check this extra test line in the handle_adaptor function? I am not very good in python for sure, but I am also pretty sure this is not an endless loop problem and I have run out of ideas how to make it faster (unless I abandon working with Seq Records). I am seriously thinking of inputting Fastas instead of Fastq-illumina files, but for a whole bunch of tests I am running now, being able to work with Fastq would be ideal... Hope this is just a silly mistake of mine.. Here is the code: from Bio import SeqIO import os from Bio import pairwise2 from Bio.Seq import Seq def handle_adaptor(record, adaptor, num_errors): ??? '''returns 1 if no adaptor found as exact match or as a a pairwise alignment allowing two errors. Otherwise: none''' ??? gap_char = '-' ??? exact_pos = str(record.seq).find(adaptor) ??? #exact match ??? if exact_pos >= 0: ??????? seq_region = str(record.seq[exact_pos:exact_pos+len(adaptor)]) ??????? adapt_region = adaptor ??? else: ??????? if len(pairwise2.align.localms(str(record.seq),str(adaptor), 5.0, -4.0, -9.0, -0.5, one_alignment_only=True, gap_char=gap_char)) ==0: #no alignment at all ?????????? return 1 ??????? else: ??????????? ??????????? if len(pairwise2.align.localms(str(record.seq),str(adaptor), 5.0, -4.0, -9.0, -0.5, one_alignment_only=True, gap_char=gap_char)) >=1:?? ??????????????? seq_a, adaptor_a, score, start, end = pairwise2.align.localms(str(record.seq),str(adaptor), 5.0, -4.0, -9.0, -0.5, one_alignment_only=True, ????????????????????????????????????????????????????????????????????????????? gap_char=gap_char)[0] ??????????????? adapt_region = adaptor_a[start:end] ??????????????? seq_region = seq_a[start:end] ??????? ??? matches = sum((1 if s == adapt_region[i] else 0) for i, s in ????????????????? enumerate(seq_region)) ??? # too many errors -- ??? if (len(adaptor) - matches) > num_errors: ????????????????????? return 1????? ??? ???? ??????????? ??????? def Handle_shuffledFiles (path, number_of_adaptor, num_errors): ??? all_files=os.listdir(path) ??? for file in all_files: ??????? if not file.endswith('fastq'): ??????????? continue ??????? else: ??????????? if '_afr_' in file : ??????????????? print "working on : "+file + "..." ??????????????? if number_of_adaptor==1: ??????????????????? adaptor='ACACTCTTTCCCTACACGACGCTCTTCCGATCT' ??????????????????? output=path+'Adaptor1'+'_removedNat/'+file+'_Clean.txt' ??????????????? elif number_of_adaptor==2: ??????????????????? adaptor= 'TCTAGCCTTCTCGCCAAGTCGTCCTTACGGCTCTGGC'? ??????????????????? output=path+'Adaptor2'+'_removedNat/'+file+'_Clean.txt' ??????????????? out_handle=open(output, "w") ?????????????? ??????????????? iter = SeqIO.parse(open(path+file), "fastq-illumina") ??????????????? j=0 ??????????????? k=0 ??????????????? try: ??????????????????? while 1: ??????????????????????? rec1 = iter.next()??? ??????????????????????? rec2 = iter.next() ??????????????????????? k=k+1 ??????????????????????? ??????????????????????? Ad_inR1 = handle_adaptor(rec1, adaptor, num_errors? ) #returns 1 if no adaptor found or if found with >2 mismatches ??????????????????????? Ad_inR2 = handle_adaptor(rec2, adaptor, num_errors? ) ????????????????????? ??????????????????????? if Ad_inR1 and Ad_inR2: ??????????????????????????? j=j+1 ??????????????????????????? print 'Counting the %i th pair that has no adaptor ...' %j ???????????????????????? ??????????????????????????? SeqIO.write([rec1, rec2], out_handle, "fastq-illumina") ??????????????????????? ??????????????? except StopIteration, e: ??????????????????? pass ?????? ??????????? ??????????????? out_handle.close() ??????????????? print '..out of %i pairs total' %k?? ??????? ??????????????????????? ???????????????????????????????????? if __name__ == "__main__": ??? path2Fastq="/Users/nat/Data/Illumina/Restricted_forTests/Fastq-Illumina/shuffled/" ??? Handle_shuffledFiles(path2Fastq, 1,? 2) Thanks! Anastasia Post-Doc, Evolutionary Biology Department Upssala University Norbyv?gen 18D SE-752 36? UPPSALA anastasia.gioti at ebc.uu.se Tel: +46-18-471 2837 Fax: +46-18-471 6310 From biopython at maubp.freeserve.co.uk Wed Oct 21 06:18:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Oct 2009 11:18:09 +0100 Subject: [Biopython] Adaptor trimmer and dimers In-Reply-To: <843737.47817.qm@web52003.mail.re2.yahoo.com> References: <20091019112441.GA72523@sobchak.mgh.harvard.edu> <843737.47817.qm@web52003.mail.re2.yahoo.com> Message-ID: <320fb6e00910210318v622658daw3133f90761a7ab7d@mail.gmail.com> On Wed, Oct 21, 2009 at 10:54 AM, natassa wrote: > > My main problem now is performance of this script: On a file of > 19 million reads of 76 bp it is running for more than 12 hours! > So I copy here my code and would be very grateful if someone > could indicate parts where it could be sped up. The best way to answer that is to run some profiling yourself. I would just make a small test file, and profile that. > I am not very good in python for sure, but I am also pretty sure > this is not an endless loop problem and I have run out of ideas > how to make it faster (unless I abandon working with Seq Records). > I am seriously thinking of inputting Fastas instead of Fastq-illumina > files, but for a whole bunch of tests I am running now, being > able to work with Fastq would be ideal... You are using Bio.SeqIO to parse the FASTQ files, but you don't use the quality scores as all. Therefore it would be faster to use FASTA files, or keep working with FASTQ files but switch from using SeqRecords to simple strings as described here: http://lists.open-bio.org/pipermail/biopython/2009-August/005430.html http://news.open-bio.org/news/2009/09/biopython-fast-fastq/ Peter From mjldehoon at yahoo.com Wed Oct 21 06:15:35 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 21 Oct 2009 03:15:35 -0700 (PDT) Subject: [Biopython] Biopython & p3d In-Reply-To: <2034B111-0AB9-4FE8-B1F9-62E753DA0FE8@uni-muenster.de> Message-ID: <416618.94041.qm@web62407.mail.re1.yahoo.com> I think that we should avoid the situation that there are two PDB modules in Biopython. Can we somehow merge Bio.PDB and p3d? Take the best features of p3d and add them to Bio.PDB, or vice versa. If that is not possible, I think we should make a choice between Bio.PDB and p3d. --Michiel. --- On Wed, 10/21/09, Christian Fufezan wrote: > From: Christian Fufezan > Subject: [Biopython] Biopython & p3d > To: biopython at biopython.org > Cc: "Michael Specht" > Date: Wednesday, October 21, 2009, 3:25 AM > Hello Biopython, > > we ( Michael Specht & I ) published recently p3d, a > python module for structural bioinformatics and were > wondering if it wouldn't be a good good thing if could join > the Biopython project. We understand that Biopython has > already a PDB parser but we programmed an alternative > version since we found the Biopython.pdb syntax to be too > non-pythonian. One example why is shown below: > > Biopython: > > def test6(structure): > ??? '''get protein surrounding (5) of > NAG''' > ??? bucket = set() > ??? > atom_list=Selection.unfold_entities(structure,'A') > ??? ns = NeighborSearch(atom_list) > ??? for model in structure.get_list(): > ??? ??? for chain in > model.get_list(): > ??? ??? ??? > for residue in chain.get_list(): > ??? ??? ??? > ??? if residue.get_resname() == 'NAG': > ??? ??? ??? > ??? ??? for atom in > residue.get_list(): > ??? ??? ??? > ??? ??? ??? > centre = atom.get_coord() > ??? ??? ??? > ??? ??? ??? R = > 5.0 > ??? ??? ??? > ??? ??? ??? > neighbor_list = ns.search(centre,R) > ??? ??? ??? > ??? ??? ??? > neighbors = Selection.unfold_entities(neighbor_list,'A') > ??? ??? ??? > ??? ??? ??? for > atom2 in neighbors: > ??? ??? ??? > ??? ??? ??? > ??? if 'O' in atom2.get_name(): > ??? ??? ??? > ??? ??? ??? > ??? ??? bucket.add(atom2) > ??? print '? > ???found',len(bucket),' oxygens around NAG' > ??? return > > p3d: > > def test6(pdb): > ??? ''' protein surrounding (5) of resname > NAG''' > ??? bgl = pdb.query('resname NAG') > ??? bucket = pdb.query('protein and oxygen > and within 5 of ',bgl) > ??? print '? > ???found',len(bucket),' oxygens around NAG' > ??? return > > Certainly, Biopythons PDB module has its advantages and the > is no way p3d could replace it, but both modules have their > advantages :) The fact that biopythons.pdb parser uses a > KTree written in C and we wrote one in python makes certain > queries to the protein structure faster in Biopyhton; > however if the query involves more complex demands, multiple > loops are inevitable in biopython, whereas p3d offers a > human readable query function that combines all aspects. The > link to our publication is: > http://www.biomedcentral.com/1471-2105/10/258 > > Looking forward to hear from you, maybe one can also > envision a combined module with a new all advantages > together. > > Kind regards > > Christian Fufezan > > > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Wed Oct 21 06:28:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Oct 2009 11:28:56 +0100 Subject: [Biopython] Biopython & p3d In-Reply-To: <416618.94041.qm@web62407.mail.re1.yahoo.com> References: <2034B111-0AB9-4FE8-B1F9-62E753DA0FE8@uni-muenster.de> <416618.94041.qm@web62407.mail.re1.yahoo.com> Message-ID: <320fb6e00910210328p538ef75lb52e53203ec42df9@mail.gmail.com> On Wed, Oct 21, 2009 at 11:15 AM, Michiel de Hoon wrote: > I think that we should avoid the situation that there are two PDB modules > in Biopython. Agreed. > Can we somehow merge Bio.PDB and p3d? Take the best features of p3d > and add them to Bio.PDB, or vice versa. That's what I was thinking. Note that Christian and Michael will have to re-license any such contributions (p3d uses the GNU GPL V2 which is not compatible). > If that is not possible, I think we should make a choice between Bio.PDB > and p3d. As Christian pointed out, the two have some non-overlapping functionality, so replacing Bio.PDB with pd3 isn't really an option (even if it was re-licensed). Peter From fufezan at uni-muenster.de Wed Oct 21 06:31:38 2009 From: fufezan at uni-muenster.de (Christian Fufezan) Date: Wed, 21 Oct 2009 12:31:38 +0200 Subject: [Biopython] Biopython & p3d In-Reply-To: <320fb6e00910210218n39b746bu6c5fd826efe6097f@mail.gmail.com> References: <2034B111-0AB9-4FE8-B1F9-62E753DA0FE8@uni-muenster.de> <320fb6e00910210218n39b746bu6c5fd826efe6097f@mail.gmail.com> Message-ID: <905F76E9-4BA4-4297-86E0-E6BDDDEEC624@uni-muenster.de> On 21 Oct 2009, at 11:18, Peter wrote: > On Wed, Oct 21, 2009 at 8:25 AM, Christian Fufezan > wrote: >> Hello Biopython, >> >> we ( Michael Specht & I ) published recently p3d, a python module for >> structural bioinformatics and were wondering if it wouldn't be a >> good good >> thing if could join the Biopython project. We understand that >> Biopython has >> already a PDB parser but we programmed an alternative version since >> we found >> the Biopython.pdb syntax to be too non-pythonian. One example why >> is shown >> below: >> >> Biopython: >> >> def test6(structure): >> '''get protein surrounding (5) of NAG''' >> bucket = set() >> atom_list=Selection.unfold_entities(structure,'A') >> ns = NeighborSearch(atom_list) >> for model in structure.get_list(): >> for chain in model.get_list(): >> for residue in chain.get_list(): > > I'm not very familiar with the NeighborSearch code, but > I'm pretty sure the above for loops can be just: > > for model in structure: > for chain in model: > for residue in chain: > ... > > And regarding detecting oxygen atoms, I think there is > a patch on bugzilla to record the (relatively) new atom > column from the PDB file (which will help with Hg and > mercury versus hydrogen). > > Still, I would agree with you that some parts of Bio.PDB > are not very pythonic - too many functions names get_*() > which could be replaced with properties. This is something > we could evolve gradually (add new properties, keep the > old methods in place but gradually deprecate them). > > Specific suggestions would be welcome. That's maybe the biggest difference between biopython and p3d, which will make it difficult to merge the two modules. A data structure that is build like that of Biopython.pdb imposes multiple nested loops and condition queries. p3ds data structure is not nested and gains strength through combination of sets and BSPTree This allows faster and more flexible looping. Looping over all alpha and beta-carbons for example and printing x-coordinates p3d: for atom in pdb.query('protein and atom type CB or atom type CA'): print atom.x Still I think both methods could exists side by side. If it is efficient - I don't know. Replacing biopythons.pdb parser was never the intention and I think it has features that are really good and fast! > >> def test6(pdb): >> ''' protein surrounding (5) of resname NAG''' >> bgl = pdb.query('resname NAG') >> bucket = pdb.query('protein and oxygen and within 5 of ',bgl) >> print ' found',len(bucket),' oxygens around NAG' >> return >> >> Certainly, Biopythons PDB module has its advantages and the is no >> way p3d >> could replace it, but both modules have their advantages :) The >> fact that >> biopythons.pdb parser uses a KTree written in C and we wrote one in >> python >> makes certain queries to the protein structure faster in Biopyhton; >> however >> if the query involves more complex demands, multiple loops are >> inevitable in >> biopython, whereas p3d offers a human readable query function that >> combines >> all aspects. The link to our publication is: >> http://www.biomedcentral.com/1471-2105/10/258 > > I remember skim reading it a month ago or so. I remember the final > line of > the abstract was a very strong opinion ("a perfect tool"), and I was > rather > surprised the reviewers and editor let you keep it - regardless of > any bias > I might feel to Biopython ;) > I guess it was a selling point ;) >> Looking forward to hear from you, maybe one can also envision a >> combined module with a new all advantages together. > > That would be a good outcome. > > From the snippet of code and the examples in the paper, the big > feature > you have that Bio.PDB lacks is "fancy selections", and that is > certainly > something which could be improved in Biopython. > Yes that was one thing that we were really missing. Also the fact that biopython requires the unfolded entity to be converted to vectors and so forth was a bit complex and we needed fast and direct access to the coordinates, the very essence of pdb files. > It is interesting you have implemented (invented?) a string based > language > with logical and, within etc. In some ways it reminds me of the > selection > formulae in VMD - have you used that 3D visualisation tool? > Yes I use VMD a lot and the inspiration came certainly from there. A few things are however unique in p3d, e.g. first residue of chain A and p3d supports residue 15 .. 20 to select a range of residues. Michael has coded the parser that translates the human readable query into set operations and functions and he even implemented a strategy in which new functions or query types can be build in in no time. E.g. "ligand containing sulfur" could be implemented in 5 min. He has done truly a great job on this. > This also reminds me of the SQL language for database selections, and > how classical SQL code with Python just used SQL statements within > Python strings. Have you ever used SQLAlchemy, and looked at how > they handle SQL statements like filters, ands, ors, etc with a clever > object based interface? Perhaps something like that could work for > a 3D structure query API. That certainly sounds very interesting. It would also allow to incorporate the actual pdb files into the database which would reduce loading and tree building times. Surveys, pattern screening could be done very fast. One could also imagine connecting other pdb databases, such as SCOP, Pfam or web services, e.g. PISCES. Regards, Christian From biopython at maubp.freeserve.co.uk Wed Oct 21 06:37:30 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Oct 2009 11:37:30 +0100 Subject: [Biopython] Biopython & p3d In-Reply-To: <905F76E9-4BA4-4297-86E0-E6BDDDEEC624@uni-muenster.de> References: <2034B111-0AB9-4FE8-B1F9-62E753DA0FE8@uni-muenster.de> <320fb6e00910210218n39b746bu6c5fd826efe6097f@mail.gmail.com> <905F76E9-4BA4-4297-86E0-E6BDDDEEC624@uni-muenster.de> Message-ID: <320fb6e00910210337r2a3b2eb9n36fef3a16ec02037@mail.gmail.com> On Wed, Oct 21, 2009 at 11:31 AM, Christian Fufezan wrote: >> This also reminds me of the SQL language for database selections, and >> how classical SQL code with Python just used SQL statements within >> Python strings. Have you ever used SQLAlchemy, and looked at how >> they handle SQL statements like filters, ands, ors, etc with a clever >> object based interface? Perhaps something like that could work for >> a 3D structure query API. > > That certainly sounds very interesting. It would also allow to incorporate > the actual pdb files into the database which would reduce loading and > tree building times. Surveys, pattern screening could be done very fast. > One could also imagine connecting other pdb databases, such as SCOP, > Pfam or web services, e.g. PISCES. I was actually suggesting having a object based API for building search terms instead of parsing a human friendly string. But yes, loading a PDB file into a database does have some advantages. Peter From biopython at maubp.freeserve.co.uk Wed Oct 21 07:01:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Oct 2009 12:01:35 +0100 Subject: [Biopython] Biopython & p3d In-Reply-To: <905F76E9-4BA4-4297-86E0-E6BDDDEEC624@uni-muenster.de> References: <2034B111-0AB9-4FE8-B1F9-62E753DA0FE8@uni-muenster.de> <320fb6e00910210218n39b746bu6c5fd826efe6097f@mail.gmail.com> <905F76E9-4BA4-4297-86E0-E6BDDDEEC624@uni-muenster.de> Message-ID: <320fb6e00910210401l737252deg78de117143395279@mail.gmail.com> On Wed, Oct 21, 2009 at 11:31 AM, Christian Fufezan wrote: > > A data structure that is build like that of Biopython.pdb imposes > multiple nested loops and condition queries. Not really - see below. > p3ds data structure is not nested and gains strength through combination > of sets and BSPTree > This allows faster and more flexible looping. Looping over all alpha and > beta-carbons for example and printing x-coordinates > > p3d: > for atom in pdb.query('protein and atom type CB or atom type CA'): > ? ? ? ?print atom.x The Bio.PDB structure, model or chain object do offer direct access to a "flat" list of atoms via the get_atoms() method. e.g. from Bio import PDB structure = Bio.PDB.PDBParser().get_structure("Test", "XXXX.pdb") for atom in structure.get_atoms() : if atom.name in ["CA", "CB"] : print atom.coord (I'd have to think a bit longer about how in general to restrict this to proteins, here that is implicit since CA and CB are protein specific) You can also of course use a list comprehension, e.g. to get all the x-coordinates (which I guess is what your example does), from Bio import PDB structure = Bio.PDB.PDBParser().get_structure("Test", "XXXX.pdb") x_list = [atom.coord[0] for atom in structure.get_atoms() \ if atom.name in ["CA", "CB"]] You can also drill down through the nested structure of models, chains and residues to get to the atoms that way. To me these are more Pythonic than the clever natural language parsing in p3d (which seems ideal for a user interface, rather than a programming API). Biopython might be improved by defining an atoms property (list or iterator?) instead of the get_atoms() method. One might also ask for x, y and z properties on the atom object to provide direct access to the three coordinates as floats. Do you think this sort of little thing would help improve Bio.PDB? > Still I think both methods could exists side by side. If it is efficient - I > don't know. Replacing biopythons.pdb parser was never the intention > and I think it has features that are really good and fast! Yes, it should be possible to offer nice nested access and nice flat access from the same objects. Internally the current Biopython PDB structure could perhaps be handled as filtered views of a complete list of all the atoms (using sets and trees or a database or whatever). That might make some things faster too. > Yes that was one thing that we were really missing. Also the fact that > biopython requires the unfolded entity to be converted to vectors and so > forth was a bit complex and we needed fast and direct access to the > coordinates, the very essence of pdb files. I'm not quite sure what you mean here by "vectors". Could you be a little more specific? Do you want NumPy style objects or something else? Peter From chapmanb at 50mail.com Wed Oct 21 08:34:22 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 21 Oct 2009 08:34:22 -0400 Subject: [Biopython] Adaptor trimmer and dimers In-Reply-To: <843737.47817.qm@web52003.mail.re2.yahoo.com> References: <20091019112441.GA72523@sobchak.mgh.harvard.edu> <843737.47817.qm@web52003.mail.re2.yahoo.com> Message-ID: <20091021123422.GD72523@sobchak.mgh.harvard.edu> Hi Anastasia; Thanks for the additional info. > I also had to add an additional test for the length of the alignment output, > as I got an index Error for the cases the adapter does not align at > all. Good catch on this. I updated the trimming code to handle that case: http://github.com/chapmanb/bcbb/blob/master/align/adaptor_trim.py > My main problem now is performance of this script: On a file of 19 > million reads of 76 bp it is running for more than 12 hours! So I copy > here my code and would be very grateful if someone could indicate parts > where it could be sped up. Peter had a good suggestion on profiling. The Python profile module is quick to learn and can quickly point you in the direction of the most used functions: http://docs.python.org/library/profile.html Based on reading your code there are a couple of things that stick out to me: - You are calling the pairwise2 alignment 3 times. You should call this once, assign the alignment information to a variable, and then perform your if/else tests on that. The updated trimming code above is a good example of doing this. - You are slicing SeqRecord objects, and then never using the sliced records. Your code doesn't look like adaptor trimming, but rather filtering out reads without a sequence. If you don't need the trimmed record, pass a string (str(rec1.seq) and str(rec2.seq)) to the handle_adaptor function instead of the record; the slicing is then done on a much simpler object and you avoid the substantial overhead of slicing up quality scores that are never used. If you end up needing trimmed fastq sequences, here is how I would reimplement your basic logic with the trimmer and Peter's suggestion: from Bio.SeqIO.QualityIO import FastqGeneralIterator from adaptor_trim import trim_adaptor_w_qual in_file = "test.fastq" out_file = "trimmed.fastq" in_handle = open(in_file) out_handle = open(out_file, "w") iterator = FastqGeneralIterator(in_handle) adaptor = "AAAAAAAAAAAAAAAAAAAA" num_errors = 2 while 1: try: title1, seq1, qual1 = iterator.next() title2, seq2, qual2 = iterator.next() except StopIteration: break tseq1, tqual1 = trim_adaptor_w_qual(seq1, qual1, adaptor, num_errors) tseq2, tqual2 = trim_adaptor_w_qual(seq2, qual2, adaptor, num_errors) # if neither has the adaptor if len(tseq1) == len(seq1) and len(tseq2) == len(seq2): out_handle.write("@%s\n%s\n+\n%s\n" % (title1, tseq1, tqual1)) out_handle.write("@%s\n%s\n+\n%s\n" % (title2, tseq2, tqual2)) out_handle.close() in_handle.close() Hope this helps, Brad From biopython at maubp.freeserve.co.uk Wed Oct 21 12:16:22 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Oct 2009 17:16:22 +0100 Subject: [Biopython] Deprecating Bio.Clustalw? Message-ID: <320fb6e00910210916l5d39aa2eje322f2a01e9ac020@mail.gmail.com> Dear all, In our most recent release, Biopython 1.52, Bio.Clustalw was declared obsolete. This is just a label to indicate that it will at some point be deprecated (issue a warning when used) and later it will be removed completely. The module provides two features - parsing Clustal alignments, and calling the clustalw command line tool. Bio.AlignIO took over the role for parsing alignments a year and a half ago with Biopython 1.46 (June 2008). More recently, Bio.Align.Applications took over the role for calling ClustalW in Biopython 1.51 (August 17, 2009) as part of an on going standardisation of our command line wrappers using the built in Python module subprocess. I recognise that Bio.Clustalw has been been widely used, and there are likely to be many existing scripts out there using it. Does leaving this module as "obsolete" for Biopython 1.53, and deprecating it in Biopython 1.54 sound like a good plan? If anyone is using it heavily, please say so - especially if you try and update your code to use Bio.AlignIO or subprocess and Bio.Align.Applications. Peter From fufezan at uni-muenster.de Wed Oct 21 14:22:48 2009 From: fufezan at uni-muenster.de (Christian Fufezan) Date: Wed, 21 Oct 2009 20:22:48 +0200 Subject: [Biopython] Biopython & p3d In-Reply-To: <320fb6e00910210401l737252deg78de117143395279@mail.gmail.com> References: <2034B111-0AB9-4FE8-B1F9-62E753DA0FE8@uni-muenster.de> <320fb6e00910210218n39b746bu6c5fd826efe6097f@mail.gmail.com> <905F76E9-4BA4-4297-86E0-E6BDDDEEC624@uni-muenster.de> <320fb6e00910210401l737252deg78de117143395279@mail.gmail.com> Message-ID: >> A data structure that is build like that of Biopython.pdb imposes >> multiple nested loops and condition queries. > > Not really - see below. if things get more complicated, there might be a need .... >> p3ds data structure is not nested and gains strength through >> combination >> of sets and BSPTree >> This allows faster and more flexible looping. Looping over all >> alpha and >> beta-carbons for example and printing x-coordinates >> >> p3d: >> for atom in pdb.query('protein and atom type CB or atom type CA'): >> print atom.x > > The Bio.PDB structure, model or chain object do offer direct access > to a "flat" list of atoms via the get_atoms() method. e.g. > > from Bio import PDB > structure = Bio.PDB.PDBParser().get_structure("Test", "XXXX.pdb") > for atom in structure.get_atoms() : > if atom.name in ["CA", "CB"] : print atom.coord > > (I'd have to think a bit longer about how in general to restrict > this to > proteins, here that is implicit since CA and CB are protein specific) > That would be the second condition to check ... if the search should be limited to certain atoms of chain A and C then one would require another check. Personally, I can not see the advantages of a nested structure, but then I am not an expert. > You can also of course use a list comprehension, e.g. to get all > the x-coordinates (which I guess is what your example does), > > from Bio import PDB > structure = Bio.PDB.PDBParser().get_structure("Test", "XXXX.pdb") > x_list = [atom.coord[0] for atom in structure.get_atoms() \ > if atom.name in ["CA", "CB"]] > > You can also drill down through the nested structure of models, > chains and residues to get to the atoms that way. > > To me these are more Pythonic than the clever natural language > parsing in p3d (which seems ideal for a user interface, rather than > a programming API). That is, I guess, a matter of taste. I am happy if an API helps me to reach my goal fast. x_list = [atom.x for atom in pdb.query('protein and atom type CB or atom type CA')] seems more intuitive and clearer than atom.coord[0] for atom in structure.get_atoms() if atom.name in ["CA", "CB"]. But I guess that's a matter of taste. Pythonian for me is readable source code. But again, that's a matter of taste. If things get more complex than the power of a human readable interface becomes clearer. For example consider you want to get all ALAs that are within a distance range of a point in space. in p3d, one can define the point in space by a p3d.vector.Vector, lets say V1 and then form a query similar to "within 20 of V1 and not within 10 of V1". Or all proteinogenic oxygens that are not part of the backbone and within 4 ? of a ligand, e.g. ATP. without knowing what kind of oxygens these could be (i.e. OG1, OG, OE1, OD1, OD2, OE2) one can easily formulate a query in the form of "protein and oxygen and not backbone and within 4 of resname ATP" The query can actually also be resolved to a set of set operations e.g. for atom in pdb.hash["resid"][20] & pdb.hash["oxygen"][""]: but the query function is simply to convenient ;) > Biopython might be improved by defining an > atoms property (list or iterator?) instead of the get_atoms() method. > agree. I would argue that p3d's atom/vector class seems the way to go. > One might also ask for x, y and z properties on the atom object > to provide direct access to the three coordinates as floats. Do > you think this sort of little thing would help improve Bio.PDB? > yes indeed, that is _the_ information a pdb module should offer without any addition. Better would be even if the atoms are treatable as vectors (see below). p3d has a series of atom object attributes that are convenient. >> Still I think both methods could exists side by side. If it is >> efficient - I >> don't know. Replacing biopythons.pdb parser was never the intention >> and I think it has features that are really good and fast! > > Yes, it should be possible to offer nice nested access and nice flat > access from the same objects. Internally the current Biopython PDB > structure could perhaps be handled as filtered views of a complete > list of all the atoms (using sets and trees or a database or > whatever). > That might make some things faster too. I agree to some extent. As above, I can only say that I cannot see the advantage of a nested data structure. Maybe you can explain with an example where drilling through the nested structure could come in handy. >> Yes that was one thing that we were really missing. Also the fact >> that >> biopython requires the unfolded entity to be converted to vectors >> and so >> forth was a bit complex and we needed fast and direct access to the >> coordinates, the very essence of pdb files. > > I'm not quite sure what you mean here by "vectors". Could you > be a little more specific? Do you want NumPy style objects or > something else? In p3d the atom objects are vectors, so writing an structural alignment script is straight forward (see e.g. http://p3d.fufezan.net/index.php?title=alignByATP ). Or to find the geometric centre of the protein/a residue/ a chain or a custom set is simply centre = p3d.vector.Vector() for atom in atoms: centre += atom centre = centre/len(atoms) So distances between two atoms are the length of their subtraction, e.g atomA.distanceTo(atomB) will yield the same as abs(atomA-atomB) Yes similar to a NumPy object, but without the big NumPy overhead and more specific to atoms, e.g. atom.resid, atom.chain, atom.beta, atom.x. From biopython at maubp.freeserve.co.uk Wed Oct 21 18:14:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Oct 2009 23:14:10 +0100 Subject: [Biopython] Biopython & p3d In-Reply-To: References: <2034B111-0AB9-4FE8-B1F9-62E753DA0FE8@uni-muenster.de> <320fb6e00910210218n39b746bu6c5fd826efe6097f@mail.gmail.com> <905F76E9-4BA4-4297-86E0-E6BDDDEEC624@uni-muenster.de> <320fb6e00910210401l737252deg78de117143395279@mail.gmail.com> Message-ID: <320fb6e00910211514r952732fw6c6d48f9be0e46c9@mail.gmail.com> On Wed, Oct 21, 2009 at 7:22 PM, Christian Fufezan wrote: >> Biopython might be improved by defining an atom >> property (list or iterator?) instead of the get_atoms() method. > > agree. ?I would argue that p3d's atom/vector class seems the way to go. We can probably have similar things for chains etc. Any other views on this? I never liked the get_* and set_* methods in Bio.PDB myself, and using Python properties seem more natural here (they may not have existing when Bio.PDB was first started - I'd have to check). [We should probably break out specific suggestions like this into new mailing list threads, and CC people like Thomas H.] >> One might also ask for x, y and z properties on the atom object >> to provide direct access to the three coordinates as floats. Do >> you think this sort of little thing would help improve Bio.PDB? >> > yes indeed, that is _the_ information a pdb module should offer > without any addition. Better would be even if the atoms are > treatable as vectors (see below). p3d has a series of atom > object attributes that are convenient. I would argue that the x-y-z triple (which Biopython has) is more important that separate x, y, and z floats. We seem to agree here. The Biopython atom's coord property is an x-y-z triple (as a one dimensional numpy array). The Bio.PDB code also defines its own vector objects on top of this, but my memory of the details is hazy here. As I recall, I personally stuck with the numpy objects in my scripts using Bio.PDB. >> Yes, it should be possible to offer nice nested access and nice flat >> access from the same objects. Internally the current Biopython PDB >> structure could perhaps be handled as filtered views of a complete >> list of all the atoms (using sets and trees or a database or whatever). >> That might make some things faster too. > > I agree to some extent. As above, I can only say that I > cannot see the advantage of a nested data structure. > Maybe you can explain with an example where drilling > through the nested structure could come in handy. The drill down is great for selecting a particular residue or chain (or for NMR, a particular model). It is also good for looping over these structures - e.g. to process psi/phi angles along a protein backbone. >>> Yes that was one thing that we were really missing. Also the fact that >>> biopython requires the unfolded entity to be converted to vectors and so >>> forth was a bit complex and we needed fast and direct access to the >>> coordinates, the very essence of pdb files. >> >> I'm not quite sure what you mean here by "vectors". Could you >> be a little more specific? Do you want NumPy style objects or >> something else? > > In p3d the atom objects are vectors, I don't immediately see what the intention is here. What does "adding" or "subtracting" two atom/vector objects give you? A new non-atom vector would be my guess? What about multiplying by a scaler? Again, getting a non-atom vector object back makes most sense. > so writing an structural alignment script is straight forward > (see e.g. http://p3d.fufezan.net/index.php?title=alignByATP). Structural alignment is not so different in Biopython - just the details. e.g. http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ > Or to find the geometric centre of the protein/a residue/ a chain > or a custom set is simply > centre = p3d.vector.Vector() > for atom in atoms: > ? ? ? ?centre += atom > centre = centre/len(atoms) And you can do all of that with the NumPy array of three coordinates accessed via atom.coord - in many respects it is a "vector". For example, with a typical Bio.PDB Residue object, the geometric center/centre is just one line: >>> centre = numpy.sum(atom.coord for atom in residue) / len(residue) >>> centre array([ -0.21274999, 2.609375 , 13.95149994], dtype=float32) The centre of mass would be more interesting to calculate, but for that we need the atomic masses. > So distances between two atoms are the length of their subtraction, e.g > atomA.distanceTo(atomB) will yield the same as abs(atomA-atomB) I guess your atomA-atomB returns a vector, and abs() gives its length. You can get the distance between to Bio.PDB atoms with atomA-atomB (and you don't need to stick an abs on it either, because our atoms are not trying to act like vectors - we can just return a float). > Yes similar to a NumPy object, but without the big NumPy overhead > and more specific to atoms, e.g. atom.resid, atom.chain, atom.beta, > atom.x. Well, yes, NumPy is a big project, and Bio.PDB is one of the main bits of Biopython that uses it. But it is very useful for numerical work, and a good choice here I think. And assuming you *like* numpy, having the Bio.PDB atom objects expose the x-y-z coordinates as a simple one dimensional numpy array of floats is very natural. You said early: >>> Also the fact that biopython requires the unfolded entity >>> to be converted to vectors and so forth was a bit complex >>> and we needed fast and direct access to the coordinates, >>> the very essence of pdb files." I disagree. The Biopython atom objects give "fast and direct access to the coordindates" via the coord property, which is a a one-dimensional numpy array (aka, a vector). For fast and efficient numerical operations there is no need to convert this into anything else (although a bespoke vector object may make things more elegant). Peter P.S. This thread is proving quite interesting :) From biopython at maubp.freeserve.co.uk Wed Oct 21 18:55:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 21 Oct 2009 23:55:11 +0100 Subject: [Biopython] Biopython on Jython Message-ID: <320fb6e00910211555m4379dd34wfecdd9373cfd0498@mail.gmail.com> Hi Kyle, You probably noticed I merged some of your fixes to get (the non C and non NumPy bits of) Biopython to work on Jython, but not all. Could you update your github branch to the trunk at some point? That would help in picking up more of your fixes. Many of the issues related to large python methods exceeding JVM size restrictions, something which Jython was going to try and fix in 2.5.1 (but didn't seem to be solved in the release candidate I was trying), see e.g. http://bugs.jython.org/issue527524 Do you (Kyle) know about more about the Jython plans and if/when they might resolve this? I would prefer to avoid any ugly Jython specific fixes in Biopython - especially if the next release of Jython may resolve many of these points. Thanks, Peter From biopython at maubp.freeserve.co.uk Thu Oct 22 05:15:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 22 Oct 2009 10:15:27 +0100 Subject: [Biopython] Biopython on Jython In-Reply-To: References: <320fb6e00910211555m4379dd34wfecdd9373cfd0498@mail.gmail.com> Message-ID: <320fb6e00910220215u323dd237r28e2b8a15b651e43@mail.gmail.com> Hi all, I probably should have started this thread with a more general question, is anyone other than Kyle interested in running Biopython under Jython? http://lists.open-bio.org/pipermail/biopython/2009-October/005734.html Some of the fixes this required are minor things that will also help with other Python variants like IronPython (e.g. unit tests shouldn't make any assumptions about the order of dictionary keys), and are worthwhile in their own right. Others (as discussed below) are less general... On Thu, Oct 22, 2009 at 5:47 AM, Kyle Ellrott wrote: > >> You probably noticed I merged some of your fixes to get (the non C and >> non NumPy bits of) Biopython to work on Jython, but not all. Could you >> update your github branch to the trunk at some point? That would help >> in picking up more of your fixes. > > I've tried to keep my branch up to speed with the mainline. ?But I didn't > branch my work from master, so it may harder to extract... True, but I can probably manage. >> Many of the issues related to large python methods exceeding JVM size >> restrictions, something which Jython was going to try and fix in 2.5.1 >> (but didn't seem to be solved in the release candidate I was trying), >> see e.g. http://bugs.jython.org/issue527524 >> Do you (Kyle) know about more about the Jython plans and if/when they >> might resolve this? I would prefer to avoid any ugly Jython specific >> fixes in Biopython - especially if the next release of Jython may >> resolve many of these points. > > One of the main Jython developers pointed this possible solution out to me. > From his email: > >> You may be interested to know that one of the things on my development >> backlog is to complete a Python bytecode compiler so that we can run >> arbitrarily long methods. This works because Jython 2.5.0 includes a VM to >> run Python bytecode (org.python.core.PyBytecode). That sounds like what I have seen references to online, originally targeted for Jython 2.5.1 but which seems to have slipped. >> In a pinch, you could do >> the same thing too now by creating a .pyc file with CPython instead of the >> $py.class file, then using "import pycimport" in a startup script to install >> that as a custom inporter. It's not terribly convenient however for >> distribution, unfortunately. > > It sounds like it would make the Jython BioPython code more 'hacky'. The pycimport thing does sound messy, I agree. > I managed to isolate all of the 'large method code' that was in BioPython. > The easiest way to fix those problems was to take large functions and split > them into 'a', 'b', 'c', etc,? functions. Yes, and for things like the unit tests I don't mind this. For some of the main code, the fix really didn't help with the readability of the code - which is why I am hoping the Python bytecode compiler in Jython happens soon. > One other side project to watch out for is ctypes for Jython.? I've heard > several of the Jython developers talking about it.? And if they get it to > work, C modules written for python, wrapped with the ctypes module, > may be able to work in Jython. That would be good. Another issue is NumPy on Jython, where even a slow compatibility library would be useful to us for getting Bio.PDB to work on Jython. Things like Bio.Cluster interface with the NumPy C code are of course not so feasible. I noticed you added something to the Biopython setup.py on your branch to assume NumPy will not be available under Jython (and not prompt the user about it being missing). I should merge that into the trunk... Peter From natassa_g_2000 at yahoo.com Thu Oct 22 05:29:35 2009 From: natassa_g_2000 at yahoo.com (natassa) Date: Thu, 22 Oct 2009 02:29:35 -0700 (PDT) Subject: [Biopython] Adaptor trimmer and dimers In-Reply-To: <20091021123422.GD72523@sobchak.mgh.harvard.edu> Message-ID: <258333.91161.qm@web52007.mail.re2.yahoo.com> Hi Brad, Thank you very much for your comments! Peter had a good suggestion on profiling. The Python profile module is quick to learn and can quickly point you in the direction of the most used functions: http://docs.python.org/library/profile.html I looked at the profile module, I am still not sure about the input type I may give to cProfile (my module name?) - it is syntax comprehension problem now, but i am sure i ll solve it ;-) - You are calling the pairwise2 alignment 3 times. You should call ? this once, assign the alignment information to a variable, and then ? perform your if/else tests on that. The updated trimming code above ? is a good example of doing this. Thanks! I forgot to clean up the code after I solved out this index error-this was my 'dirty' version when I was trying to understand this issue. - You are slicing SeqRecord objects, and then never using the sliced ? records. Your code doesn't look like adaptor trimming, but rather ? filtering out reads without a sequence. If you don't need the ? trimmed record, pass a string (str(rec1.seq) and str(rec2.seq)) to ? the handle_adaptor function instead of the record; the slicing is ? then done on a much simpler object and you avoid the substantial ? overhead of slicing up quality scores that are never used. Again, not very clean code as I have been oscillating between trimming/removing? for some days now. I finally decided that if I don't have a big proportion of nearly exact (max 2 errors) matches to the adaptor in my reads, I may just discard them, as trimming a 33/37 bp adaptor from a 55-bp read does not leave much anyway. You were right about passing a string to the function, I had not thought that passing the whole record would be more heavy. The revised script (for removing, but taking into account all your suggestions, so using the general iterator) is still running for very long, unfortunately without a profiler-I need to understand this module more.. Thanks for all suggestions! Anastasia Anastasia Gioti Post-Doc, Evolutionary Biology Department Upssala University Norbyv?gen 18D SE-752 36? UPPSALA anastasia.gioti at ebc.uu.se Tel: +46-18-471 2837 Fax: +46-18-471 6310 From mavata at gmail.com Thu Oct 22 05:45:13 2009 From: mavata at gmail.com (Manu Tamminen) Date: Thu, 22 Oct 2009 12:45:13 +0300 Subject: [Biopython] About BLAST parser Message-ID: <751C56D3-48C4-43F4-876B-BE2C023DAC48@gmail.com> I have a problem with the Biopython BLAST parser. I'm using the parser to extract relevant information from an XML result file into a tab- separated table. It seems the XML file occasionally contains errors that cause the script to abort. This is especially common and annoying with sequence alignments that contain thousands of sequences. Is it possible to write the script so that when an error occurs, the script would jump into the next sequence rather than abort completely? I will include below an example of such error. This error is about a mismatched tag - sometimes the error has also been about a missing tag. for blast_record in blast_records: File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/Bio/Blast/NCBIXML.py", line 660, in parse expat_parser.Parse(text, True) # End of XML record xml.parsers.expat.ExpatError: mismatched tag: line 82921, column 4 Any help appreciated! Thanks! Manu From biopython at maubp.freeserve.co.uk Thu Oct 22 05:56:32 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 22 Oct 2009 10:56:32 +0100 Subject: [Biopython] About BLAST parser In-Reply-To: <751C56D3-48C4-43F4-876B-BE2C023DAC48@gmail.com> References: <751C56D3-48C4-43F4-876B-BE2C023DAC48@gmail.com> Message-ID: <320fb6e00910220256j2c8e3399s19019d617bfa0420@mail.gmail.com> On Thu, Oct 22, 2009 at 10:45 AM, Manu Tamminen wrote: > I have a problem with the Biopython BLAST parser. I'm using the parser to > extract relevant information from an XML result file into a tab-separated > table. It seems the XML file occasionally contains errors that cause the > script to abort. This is especially common and annoying with sequence > alignments that contain thousands of sequences. > > Is it possible to write the script so that when an error occurs, the script > would jump into the next sequence rather than abort completely? I will > include below an example of such error. This error is about a mismatched tag > - sometimes the error has also been about a missing tag. > > ? ?for blast_record in blast_records: > ?File > "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/Bio/Blast/NCBIXML.py", > line 660, in parse > ? ?expat_parser.Parse(text, True) # End of XML record > xml.parsers.expat.ExpatError: mismatched tag: line 82921, column 4 XML is a strict file format with tags like having a closing tag . If the XML file is truncated or something, you can have mismatched tags (e.g. an without an ) which means the XML file is invalid. This is basically what that error message is about. I can make some suggestions that may help, but it first are you running BLAST locally or online? Are you saving the results to a file, or parsing directly from the handle? How many query sequences do you have? Peter From mavata at gmail.com Thu Oct 22 06:06:47 2009 From: mavata at gmail.com (Manu Tamminen) Date: Thu, 22 Oct 2009 13:06:47 +0300 Subject: [Biopython] About BLAST parser In-Reply-To: <320fb6e00910220256j2c8e3399s19019d617bfa0420@mail.gmail.com> References: <751C56D3-48C4-43F4-876B-BE2C023DAC48@gmail.com> <320fb6e00910220256j2c8e3399s19019d617bfa0420@mail.gmail.com> Message-ID: Hi Peter! Thanks for your prompt reply! I've run the BLAST analysis on a supercomputer cluster, saved the results into a XML file and then transferred the output file to my computer. I then run the script on my computer to parse the results into a tab separated file. With the current dataset I have 1115 sequences of around 500 bp each. Manu On Oct 22, 2009, at 12:56 PM, Peter wrote: > On Thu, Oct 22, 2009 at 10:45 AM, Manu Tamminen > wrote: >> I have a problem with the Biopython BLAST parser. I'm using the >> parser to >> extract relevant information from an XML result file into a tab- >> separated >> table. It seems the XML file occasionally contains errors that >> cause the >> script to abort. This is especially common and annoying with sequence >> alignments that contain thousands of sequences. >> >> Is it possible to write the script so that when an error occurs, >> the script >> would jump into the next sequence rather than abort completely? I >> will >> include below an example of such error. This error is about a >> mismatched tag >> - sometimes the error has also been about a missing tag. >> >> for blast_record in blast_records: >> File >> "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/ >> site-packages/Bio/Blast/NCBIXML.py", >> line 660, in parse >> expat_parser.Parse(text, True) # End of XML record >> xml.parsers.expat.ExpatError: mismatched tag: line 82921, column 4 > > XML is a strict file format with tags like having a closing > tag . If the XML file is truncated or something, you can > have mismatched tags (e.g. an without an ) which > means the XML file is invalid. This is basically what that error > message is about. > > I can make some suggestions that may help, but it first are you > running BLAST locally or online? Are you saving the results to > a file, or parsing directly from the handle? How many query > sequences do you have? > > Peter --- Manu Tamminen, M.Sc. University of Helsinki Department of Applied Chemistry and Microbiology, Division of Microbiology P.O. Box 56 00014 HELSINKI FINLAND tel: +358 (0)9191 57585 fax: +358 (0)9191 59322 e-mail: manu.tamminen at helsinki.fi home: http://www.mm.helsinki.fi/~mvtammin/ From biopython at maubp.freeserve.co.uk Thu Oct 22 06:19:02 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 22 Oct 2009 11:19:02 +0100 Subject: [Biopython] About BLAST parser In-Reply-To: References: <751C56D3-48C4-43F4-876B-BE2C023DAC48@gmail.com> <320fb6e00910220256j2c8e3399s19019d617bfa0420@mail.gmail.com> Message-ID: <320fb6e00910220319t55c7401bgc31f1502db1a639b@mail.gmail.com> On Thu, Oct 22, 2009 at 11:06 AM, Manu Tamminen wrote: > > Hi Peter! Thanks for your prompt reply! I've run the BLAST analysis on a > supercomputer cluster, saved the results into a XML file and then > transferred the output file to my computer. I then run the script on my > computer to parse the results into a tab separated file. With the current > dataset I have 1115 sequences of around 500 bp each. > Manu Based on the Biopython error message, I suspect your XML file is broken. How big is the XML file (MB). There are online tools for this, but uploading a large file is out of the question. You could also open the file in a suitable editor, go to the line number given in the Biopython error message, and look at the file by eye to see if there is anything obvious. It is possible that the XML file was corrupted when you copied it to your local machine (e.g. a network error). You could try zipping it up, and then copying it again. It is also possible that the XML file was corrupted on the disk on the cluster (rare, but this can happen). In this case you might be able to fix the XML by hand, or re-run it. Alternatively, it is possible that the file is valid, and the Biopython parser (or the Python library we use internally) has a bug. As long as the XML file isn't too big (say 10MB), you could email it to me personally (NOT the mailing list) and I can try and have a look at it. Personally, I would break up the task into jobs (maybe six jobs of up to 200 sequences each - or even one sequence per job). On most clusters this is a good idea anyway, as they can then be handled by different cluster nodes. For the analysis, you just have to parse the separate XML files. Any corrupted XML file will then only affect a few sequences, and checking it or re-running it is going to be much quicker and easier. Peter From mavata at gmail.com Thu Oct 22 06:34:55 2009 From: mavata at gmail.com (Manu Tamminen) Date: Thu, 22 Oct 2009 13:34:55 +0300 Subject: [Biopython] About BLAST parser In-Reply-To: <320fb6e00910220319t55c7401bgc31f1502db1a639b@mail.gmail.com> References: <751C56D3-48C4-43F4-876B-BE2C023DAC48@gmail.com> <320fb6e00910220256j2c8e3399s19019d617bfa0420@mail.gmail.com> <320fb6e00910220319t55c7401bgc31f1502db1a639b@mail.gmail.com> Message-ID: <69C1EBC7-8A0E-47EE-910C-478BE263125C@gmail.com> With all blast hits included, the output file is around 1 gigabyte. Therefore just opening and searching for the broken parts is challenging with regular text editors. Furthermore, I'm not very familiar with XML syntax and therefore would probably not recognize the broken parts. Breaking down the search into smaller parts sounds like a good idea. However, I'm also considering writing a more robust script. Would it be possible to make the script ignore the broken entries in the XML file and skip into next correct one? On Oct 22, 2009, at 1:19 PM, Peter wrote: > On Thu, Oct 22, 2009 at 11:06 AM, Manu Tamminen > wrote: >> >> Hi Peter! Thanks for your prompt reply! I've run the BLAST analysis >> on a >> supercomputer cluster, saved the results into a XML file and then >> transferred the output file to my computer. I then run the script >> on my >> computer to parse the results into a tab separated file. With the >> current >> dataset I have 1115 sequences of around 500 bp each. >> Manu > > Based on the Biopython error message, I suspect your XML file is > broken. How big is the XML file (MB). There are online tools for this, > but uploading a large file is out of the question. You could also open > the file in a suitable editor, go to the line number given in the > Biopython > error message, and look at the file by eye to see if there is anything > obvious. > > It is possible that the XML file was corrupted when you copied it to > your local machine (e.g. a network error). You could try zipping it > up, and then copying it again. It is also possible that the XML file > was corrupted on the disk on the cluster (rare, but this can happen). > In this case you might be able to fix the XML by hand, or re-run it. > > Alternatively, it is possible that the file is valid, and the > Biopython parser > (or the Python library we use internally) has a bug. As long as the > XML file isn't too big (say 10MB), you could email it to me personally > (NOT the mailing list) and I can try and have a look at it. > > Personally, I would break up the task into jobs (maybe six jobs of > up to 200 sequences each - or even one sequence per job). On > most clusters this is a good idea anyway, as they can then be > handled by different cluster nodes. For the analysis, you just have > to parse the separate XML files. Any corrupted XML file will then > only affect a few sequences, and checking it or re-running it is > going to be much quicker and easier. > > Peter --- Manu Tamminen, M.Sc. University of Helsinki Department of Applied Chemistry and Microbiology, Division of Microbiology P.O. Box 56 00014 HELSINKI FINLAND tel: +358 (0)9191 57585 fax: +358 (0)9191 59322 e-mail: manu.tamminen at helsinki.fi home: http://www.mm.helsinki.fi/~mvtammin/ From biopython at maubp.freeserve.co.uk Thu Oct 22 06:51:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 22 Oct 2009 11:51:45 +0100 Subject: [Biopython] About BLAST parser In-Reply-To: <69C1EBC7-8A0E-47EE-910C-478BE263125C@gmail.com> References: <751C56D3-48C4-43F4-876B-BE2C023DAC48@gmail.com> <320fb6e00910220256j2c8e3399s19019d617bfa0420@mail.gmail.com> <320fb6e00910220319t55c7401bgc31f1502db1a639b@mail.gmail.com> <69C1EBC7-8A0E-47EE-910C-478BE263125C@gmail.com> Message-ID: <320fb6e00910220351g3a10ce91mef097ceaf0ab5d0@mail.gmail.com> On Thu, Oct 22, 2009 at 11:34 AM, Manu Tamminen wrote: > > With all blast hits included, the output file is around 1 gigabyte. > Therefore just opening and searching for the broken parts is challenging > with regular text editors. Furthermore, I'm not very familiar with XML > syntax and therefore would probably not recognize the broken parts. There is probably a neat way to extract a chunk using Unix command line tools. Or just try something like this in Python: error_line = 82921 input_handle = open("really_big.xml") output_handle = open("fragment.txt", "w") for line_number, line in enumerate(input_handle) : if error_line - 1000 < error_line and error_line < error_line + 1000 : output_handle.write(line) input_handle.close() output_handle.close() I would still suggest you re-try copying it from the cluster to your machine, in case it was just a network error corrupting the machine. > Breaking down the search into smaller parts sounds like a good idea. > However, I'm also considering writing a more robust script. Would it be > possible to make the script ignore the broken entries in the XML file and > skip into next correct one? I think that will be tricky. Part of idea about XML is it is a strictly defined file format where there are standards about how to interpret and abort with bad data. Tolerant XML parsers are considered to be a bad thing. What should be possible is a simple script that removes the broken section of the file, giving a (partial) but valid XML file covering most of the sequences. It might be more effort than just re-doing the search (in parts this time). Peter From mavata at gmail.com Thu Oct 22 07:10:11 2009 From: mavata at gmail.com (Manu Tamminen) Date: Thu, 22 Oct 2009 14:10:11 +0300 Subject: [Biopython] About BLAST parser In-Reply-To: <320fb6e00910220351g3a10ce91mef097ceaf0ab5d0@mail.gmail.com> References: <751C56D3-48C4-43F4-876B-BE2C023DAC48@gmail.com> <320fb6e00910220256j2c8e3399s19019d617bfa0420@mail.gmail.com> <320fb6e00910220319t55c7401bgc31f1502db1a639b@mail.gmail.com> <69C1EBC7-8A0E-47EE-910C-478BE263125C@gmail.com> <320fb6e00910220351g3a10ce91mef097ceaf0ab5d0@mail.gmail.com> Message-ID: <1BC8860E-E03C-4654-A012-CBA9CD889370@gmail.com> Thanks very much for your help and suggestions! I think I'll manage from here on! Manu On Oct 22, 2009, at 1:51 PM, Peter wrote: > On Thu, Oct 22, 2009 at 11:34 AM, Manu Tamminen > wrote: >> >> With all blast hits included, the output file is around 1 gigabyte. >> Therefore just opening and searching for the broken parts is >> challenging >> with regular text editors. Furthermore, I'm not very familiar with >> XML >> syntax and therefore would probably not recognize the broken parts. > > There is probably a neat way to extract a chunk using Unix command > line tools. Or just try something like this in Python: > > error_line = 82921 > input_handle = open("really_big.xml") > output_handle = open("fragment.txt", "w") > for line_number, line in enumerate(input_handle) : > if error_line - 1000 < error_line and error_line < error_line + > 1000 : > output_handle.write(line) > input_handle.close() > output_handle.close() > > I would still suggest you re-try copying it from the cluster to your > machine, in case it was just a network error corrupting the machine. > >> Breaking down the search into smaller parts sounds like a good idea. >> However, I'm also considering writing a more robust script. Would >> it be >> possible to make the script ignore the broken entries in the XML >> file and >> skip into next correct one? > > I think that will be tricky. Part of idea about XML is it is a > strictly defined > file format where there are standards about how to interpret and abort > with bad data. Tolerant XML parsers are considered to be a bad thing. > > What should be possible is a simple script that removes the broken > section of the file, giving a (partial) but valid XML file covering > most > of the sequences. It might be more effort than just re-doing the > search > (in parts this time). > > Peter From biopython at maubp.freeserve.co.uk Thu Oct 22 07:13:22 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 22 Oct 2009 12:13:22 +0100 Subject: [Biopython] About BLAST parser In-Reply-To: <1BC8860E-E03C-4654-A012-CBA9CD889370@gmail.com> References: <751C56D3-48C4-43F4-876B-BE2C023DAC48@gmail.com> <320fb6e00910220256j2c8e3399s19019d617bfa0420@mail.gmail.com> <320fb6e00910220319t55c7401bgc31f1502db1a639b@mail.gmail.com> <69C1EBC7-8A0E-47EE-910C-478BE263125C@gmail.com> <320fb6e00910220351g3a10ce91mef097ceaf0ab5d0@mail.gmail.com> <1BC8860E-E03C-4654-A012-CBA9CD889370@gmail.com> Message-ID: <320fb6e00910220413h107142fdn992cc149e9afc099@mail.gmail.com> On Thu, Oct 22, 2009 at 12:10 PM, Manu Tamminen wrote: > > Thanks very much for your help and suggestions! I think I'll manage from > here on! > Manu Good luck, Peter From biopython at maubp.freeserve.co.uk Thu Oct 22 07:38:46 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 22 Oct 2009 12:38:46 +0100 Subject: [Biopython] Biopython on Jython In-Reply-To: <320fb6e00910220215u323dd237r28e2b8a15b651e43@mail.gmail.com> References: <320fb6e00910211555m4379dd34wfecdd9373cfd0498@mail.gmail.com> <320fb6e00910220215u323dd237r28e2b8a15b651e43@mail.gmail.com> Message-ID: <320fb6e00910220438o1f6363a5mb82b00d967491617@mail.gmail.com> On Thu, Oct 22, 2009 at 10:15 AM, Peter wrote: > Hi all, > > On Thu, Oct 22, 2009 at 5:47 AM, Kyle Ellrott wrote: >>> You probably noticed I merged some of your fixes to get (the non C and >>> non NumPy bits of) Biopython to work on Jython, but not all. Could you >>> update your github branch to the trunk at some point? That would help >>> in picking up more of your fixes. >> >> I've tried to keep my branch up to speed with the mainline. ?But I didn't >> branch my work from master, so it may harder to extract... > > True, but I can probably manage. Thanks for updating your branch to the trunk. I've grabbed the BLAST XML fix (and tweaked it) - thanks. I also made test_Entrez.py get skipped on Jython (although I just reused the missing dependency trick). See: http://bugzilla.open-bio.org/show_bug.cgi?id=2918 http://bugs.jython.org/issue1447 >>> Many of the issues related to large python methods exceeding JVM size >>> restrictions, something which Jython was going to try and fix in 2.5.1 >>> (but didn't seem to be solved in the release candidate I was trying), >>> see e.g. http://bugs.jython.org/issue527524 >>> ... This single issue covers the remaining test failures, and persists on Jython 2.5.1 (final). They may solve it in the next release, or I can look again at the work arounds on your branch. We must of course skip anything requiring C code, or NumPy, but most of Biopython is looking pretty good on Jython now. Good work Kyle :) Peter From mikelisanke at gmail.com Thu Oct 22 14:19:42 2009 From: mikelisanke at gmail.com (Mike Lisanke) Date: Thu, 22 Oct 2009 14:19:42 -0400 Subject: [Biopython] Windows installer does not find Python 2.63 with multiple pythons In-Reply-To: <320fb6e00910191429l481223d8iae9fbe78702fff70@mail.gmail.com> References: <8c5c6e580910191237w54e857bfs1107ba86df05db7c@mail.gmail.com> <320fb6e00910191429l481223d8iae9fbe78702fff70@mail.gmail.com> Message-ID: <8c5c6e580910221119o1d36426dkfdb2d96d681828c9@mail.gmail.com> Peter, The problem was python-2.6.3-amd64 for which their is a numpy-1.3.0-win-amd64-py2.6 installer. Is there a reason NumPy and BioPython have a specific dependency to work with the AMD64 build of python? I had assumed python would be considered the runtime environment for numpy and biopython and the dependency would only be language level. Its disappointing to think these problems are only caused by registry check dependencies in the windows installers of these applications. Thanks. On Mon, Oct 19, 2009 at 5:29 PM, Peter wrote: > On Mon, Oct 19, 2009 at 8:37 PM, Mike Lisanke > wrote: > > I had Python 3.0 installed prior to attempting a bio-python install. I > > installed Python 2.6 to its own directory, and a proper registry entry > was > > made in HKEY_LOCAL_MACHINE\SOFTWARE\Python, however; > > the bio-python can not find the Python 2.6 install. Is there a problem > > having multiple python installs? Thanks. > > On my Windows machine I have Python 2.4, 2.5 and 2.6 all co-existing > fine (and I used to have 2.3 as well). These were all default installs to > C:\Python26 etc, and I didn't have to do anything funny to the registry. > I can try and remember to check the registry settings on my machine > if you like... but for now I can only suggest you might try uninstalling > Python 2.6, perhaps clean the registry, and then reinstall Python 2.6. > > Peter > > P.S. > > I haven't tried putting Python 3.0 on my Windows machine (not that > I would bother, I would go straight to Python 3.1 now that it is out). > -- Best regards, Mike From biopython at maubp.freeserve.co.uk Thu Oct 22 15:45:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 22 Oct 2009 20:45:04 +0100 Subject: [Biopython] Windows installer does not find Python 2.63 with multiple pythons In-Reply-To: <8c5c6e580910221119o1d36426dkfdb2d96d681828c9@mail.gmail.com> References: <8c5c6e580910191237w54e857bfs1107ba86df05db7c@mail.gmail.com> <320fb6e00910191429l481223d8iae9fbe78702fff70@mail.gmail.com> <8c5c6e580910221119o1d36426dkfdb2d96d681828c9@mail.gmail.com> Message-ID: <320fb6e00910221245w3eff78dcj54aa3eb57990300a@mail.gmail.com> On Thu, Oct 22, 2009 at 7:19 PM, Mike Lisanke wrote: > Peter, > > The problem was python-2.6.3-amd64 for which their is a > numpy-1.3.0-win-amd64-py2.6 installer. Is there a reason > NumPy and BioPython have a specific dependency to > work with the AMD64 build of python? Are you running on 64 bit Windows then? XP or Vista? It sounds like you are trying to mix 32 and 64 bit versions of Python. If you installed the 64 bit version of Python and Numpy, then you will need a 64 bit compiled version of Biopython too - but we don't have one of those yet. We'd need a developer or a volunteer with a 64bit Windows machine to do this. You should be to install a 32 bit version of Python, http://python.org/ftp/python/2.6.3/python-2.6.3.msi plus the 32 bit Windows installer for Numpy: http://sourceforge.net/projects/numpy/files/NumPy/1.3.0/numpy-1.3.0-win32-superpack-python2.6.exe/download and the 32 bit Windows installer for Biopython: http://biopython.org/DIST/biopython-1.52.win32-py2.6.exe (i.e. look for win32 in the filenames, not amd64). Peter From michael.koeris at gmail.com Thu Oct 22 20:56:16 2009 From: michael.koeris at gmail.com (Michael S. Koeris) Date: Thu, 22 Oct 2009 20:56:16 -0400 Subject: [Biopython] Querying NCBI Message-ID: <263904DC-B930-4E1B-9C6C-510C83473640@gmail.com> I don't know if it's the servers today but when I ran this query as a regular efetch with 80+ gi numbers it ran for 30+min before i stopped it handle = Entrez.efetch(db='nucleotide',id=AccNo,rettype='gb') anyone else experiencing problems? I also noted that my outbound packet rate dropped to about 4kbp From mhdhussain at gmail.com Thu Oct 22 22:16:03 2009 From: mhdhussain at gmail.com (M. Hussain) Date: Fri, 23 Oct 2009 13:16:03 +1100 Subject: [Biopython] Python Codes for 3rd codon position Message-ID: Hi, I wonder if anybody could help to write a program to read a file in and print out the third codon position of two aligned sequences Thanks From biopython at maubp.freeserve.co.uk Fri Oct 23 05:04:56 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 23 Oct 2009 10:04:56 +0100 Subject: [Biopython] Python Codes for 3rd codon position In-Reply-To: References: Message-ID: <320fb6e00910230204x3f82a950ieeea1fe4a2b14bad@mail.gmail.com> On Fri, Oct 23, 2009 at 3:16 AM, M. Hussain wrote: > Hi, > > I wonder if anybody could help to write a program to read a file in and > print out the third codon position of two aligned sequences > > Thanks Could you explain in a little more detail what you want to do? Are your two sequences already aligned? Are there gaps in the alignment? Showing an example alignment and the data you want would help greatly. Regards, Peter From biopython at maubp.freeserve.co.uk Fri Oct 23 05:08:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 23 Oct 2009 10:08:06 +0100 Subject: [Biopython] Querying NCBI In-Reply-To: <263904DC-B930-4E1B-9C6C-510C83473640@gmail.com> References: <263904DC-B930-4E1B-9C6C-510C83473640@gmail.com> Message-ID: <320fb6e00910230208l6fdbb7e3i44083a9d7db14ecf@mail.gmail.com> On Fri, Oct 23, 2009 at 1:56 AM, Michael S. Koeris wrote: > I don't know if it's the servers today but when I ran this query as a > regular efetch with 80+ gi numbers it ran for 30+min before i stopped it > > handle = Entrez.efetch(db='nucleotide',id=AccNo,rettype='gb') > > anyone else experiencing problems? I was asleep, so no ;) Are you sending one single efetch call with 80+ GI numbers, or are your sending 80+ individual efetch calls, or something in between? That may make a difference. > I also noted that my outbound packet rate dropped to about 4kbp That suggests a local network issue. Did you include your email address as the NCBI request? If they have blocked or throttled your access (if they felt it was excessive), I would expect them to email you about it. Peter From chapmanb at 50mail.com Fri Oct 23 08:28:43 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 23 Oct 2009 08:28:43 -0400 Subject: [Biopython] Adaptor trimmer and dimers In-Reply-To: <258333.91161.qm@web52007.mail.re2.yahoo.com> References: <20091021123422.GD72523@sobchak.mgh.harvard.edu> <258333.91161.qm@web52007.mail.re2.yahoo.com> Message-ID: <20091023122843.GJ72523@sobchak.mgh.harvard.edu> Hi Anastasia; > Again, not very clean code as I have been oscillating between > trimming/removing? for some days now. I finally decided that if I > don't have a big proportion of nearly exact (max 2 errors) matches > to the adaptor in my reads, I may just discard them, as trimming a > 33/37 bp adaptor from a 55-bp read does not leave much anyway. > > The revised script > (for removing, but taking into account all your suggestions, so using > the general iterator) is still running for very long, This was written with the idea that the adaptor would be present in most of the sequences. This was the case with the data I was using it on -- expression profiling with short tags -- but does not sound like what you are tackling here. My approach speeds up the trimming by avoiding doing local alignments for many reads since an exact match is often found. Only in cases where the adaptor is missing or has one or more sequencing errors does the expensive local alignment need to be done. If most reads do not have adaptors, then this approach is algorithmically slow. Doing a local alignment for nearly every read is going to take time, independent of the implementation. Profiling this should reveal most of the time is spent in pairwise alignment. My suggestion would be to use a heuristic seed-based approach similar to what short query aligners do: - Break your adaptor into three smaller seed regions of 12bp - For each read: - Do a fast string find() with the seed regions to the read - If two or more of the seed regions match exactly, discard the read This will run much quicker and should catch a majority of the cases where you have reads. Regions with lots of errors, or errors spaced evenly through the adaptor, will be missed. Making the code tractable is probably worth that few that you'll let through. Hope this helps, Brad From michael.koeris at gmail.com Fri Oct 23 09:11:45 2009 From: michael.koeris at gmail.com (Michael S. Koeris) Date: Fri, 23 Oct 2009 09:11:45 -0400 Subject: [Biopython] Querying NCBI In-Reply-To: <320fb6e00910230208l6fdbb7e3i44083a9d7db14ecf@mail.gmail.com> References: <263904DC-B930-4E1B-9C6C-510C83473640@gmail.com> <320fb6e00910230208l6fdbb7e3i44083a9d7db14ecf@mail.gmail.com> Message-ID: <0D6CA88F-775A-4061-8EA2-7A56FCAE6461@gmail.com> I am submitting 80 single queries - alternatively i can batch them but then when I try to parse them out from the records object I get: >>> records > I don't know if this is a different object because it's batched >>> parser = GenBank.RecordParser() >>> recordGenBank = parser.parse(records) Traceback (most recent call last): File "", line 1, in File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/Bio/GenBank/__init__.py", line 172, in parse self._scanner.feed(handle, self._consumer) File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/Bio/GenBank/Scanner.py", line 380, in feed misc_lines, sequence_string = self.parse_footer() File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ python2.6/site-packages/Bio/GenBank/Scanner.py", line 762, in parse_footer raise ValueError("Premature end of file in sequence data") ValueError: Premature end of file in sequence data -- Michael S. Koeris michael.koeris at gmail.com On Oct 23, 2009, at 5:08 AM, Peter wrote: > On Fri, Oct 23, 2009 at 1:56 AM, Michael S. Koeris > wrote: >> I don't know if it's the servers today but when I ran this query as a >> regular efetch with 80+ gi numbers it ran for 30+min before i >> stopped it >> >> handle = Entrez.efetch(db='nucleotide',id=AccNo,rettype='gb') >> >> anyone else experiencing problems? > > I was asleep, so no ;) > > Are you sending one single efetch call with 80+ GI numbers, or > are your sending 80+ individual efetch calls, or something in > between? That may make a difference. > >> I also noted that my outbound packet rate dropped to about 4kbp > > That suggests a local network issue. > > Did you include your email address as the NCBI request? > If they have blocked or throttled your access (if they felt it > was excessive), I would expect them to email you about it. > > Peter From biopython at maubp.freeserve.co.uk Fri Oct 23 10:33:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 23 Oct 2009 15:33:35 +0100 Subject: [Biopython] Querying NCBI In-Reply-To: <0D6CA88F-775A-4061-8EA2-7A56FCAE6461@gmail.com> References: <263904DC-B930-4E1B-9C6C-510C83473640@gmail.com> <320fb6e00910230208l6fdbb7e3i44083a9d7db14ecf@mail.gmail.com> <0D6CA88F-775A-4061-8EA2-7A56FCAE6461@gmail.com> Message-ID: <320fb6e00910230733m726d573bhc0d052ff143e6c7e@mail.gmail.com> On Fri, Oct 23, 2009 at 2:11 PM, Michael S. Koeris wrote: > > I am submitting 80 single queries - alternatively i can batch them but then > when I try to parse them out from the records object I get: > >>>> records > > That looks like records is a URL handle object - probably you've mixed up your variable names. > I don't know if this is a different object because it's batched > >>>> parser = GenBank.RecordParser() >>>> recordGenBank = parser.parse(records) > Traceback (most recent call last): > ... > line 762, in parse_footer > ? ?raise ValueError("Premature end of file in sequence data") > ValueError: Premature end of file in sequence data That suggests either a parser bug, or simply a network error meaning the file was truncated. As you are trying to download 80 queries, I would strongly recommend you download them directly to files, and then parse the files. This also means you'll only need to do the downloading once as you work on the rest of the script (whatever you are trying to do with the data). Peter From biopython at maubp.freeserve.co.uk Fri Oct 23 10:43:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 23 Oct 2009 15:43:40 +0100 Subject: [Biopython] Querying NCBI In-Reply-To: <8F8DA214-8AC3-41B0-9D38-2AB7864205DC@gmail.com> References: <263904DC-B930-4E1B-9C6C-510C83473640@gmail.com> <320fb6e00910230208l6fdbb7e3i44083a9d7db14ecf@mail.gmail.com> <0D6CA88F-775A-4061-8EA2-7A56FCAE6461@gmail.com> <320fb6e00910230733m726d573bhc0d052ff143e6c7e@mail.gmail.com> <8F8DA214-8AC3-41B0-9D38-2AB7864205DC@gmail.com> Message-ID: <320fb6e00910230743k46b301ebl85189c5434a65b82@mail.gmail.com> On Fri, Oct 23, 2009 at 3:35 PM, Michael S. Koeris wrote: > > That's a good idea how do I do that though? Something like this: from Bio import Entrez Entrez.email = "michael.koeris at gmail.com" gi = "12345678" out_handle = open("%s.gbk" % gi, "w") network_handle = Entrez.efetch(db="nucleotide", id=gi, rettype="gb") for line in network_handle : out_handle.write(line) out_handle.close() network_handle.close() Stick that in a for loop if you want a separate file for each record. Is the Biopython tutorial not clear enough on this? Peter From biopython at maubp.freeserve.co.uk Fri Oct 23 10:56:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 23 Oct 2009 15:56:18 +0100 Subject: [Biopython] Querying NCBI In-Reply-To: References: <263904DC-B930-4E1B-9C6C-510C83473640@gmail.com> <320fb6e00910230208l6fdbb7e3i44083a9d7db14ecf@mail.gmail.com> <0D6CA88F-775A-4061-8EA2-7A56FCAE6461@gmail.com> <320fb6e00910230733m726d573bhc0d052ff143e6c7e@mail.gmail.com> <8F8DA214-8AC3-41B0-9D38-2AB7864205DC@gmail.com> <320fb6e00910230743k46b301ebl85189c5434a65b82@mail.gmail.com> Message-ID: <320fb6e00910230756t22d54402p5dba99a2c689e521@mail.gmail.com> On Fri, Oct 23, 2009 at 3:48 PM, Michael S. Koeris wrote: > > Thanks much! > > The tutorial actually just mentions parsing out from direct queries on page > 91. Could be useful to mention this approach to speed up queries. > Which version of the tutorial do you have? I'm looking at page 91 in the current PDF (included with Biopython 1.52) and that is the start of the section on EFetch. At the end of that section (bottom of page 93, start of page 94) is an example checking if a GenBank file exists locally, and if not, downloading it. http://biopython.org/DIST/docs/tutorial/Tutorial.pdf http://biopython.org/DIST/docs/tutorial/Tutorial.html I'm hoping you are looking at an older version, but if not, maybe we can re-order that section or something to make it clearer. Feedback on documentation is very useful. Peter P.S. Please CC the mailing list. From biopython at maubp.freeserve.co.uk Fri Oct 23 11:08:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 23 Oct 2009 16:08:29 +0100 Subject: [Biopython] Querying NCBI In-Reply-To: References: <263904DC-B930-4E1B-9C6C-510C83473640@gmail.com> <320fb6e00910230208l6fdbb7e3i44083a9d7db14ecf@mail.gmail.com> <0D6CA88F-775A-4061-8EA2-7A56FCAE6461@gmail.com> <320fb6e00910230733m726d573bhc0d052ff143e6c7e@mail.gmail.com> <8F8DA214-8AC3-41B0-9D38-2AB7864205DC@gmail.com> <320fb6e00910230743k46b301ebl85189c5434a65b82@mail.gmail.com> <320fb6e00910230756t22d54402p5dba99a2c689e521@mail.gmail.com> Message-ID: <320fb6e00910230808x3d5cc7cepe68f0f5c233e9132@mail.gmail.com> On Fri, Oct 23, 2009 at 4:06 PM, Michael S. Koeris wrote: > > On Oct 23, 2009, at 10:56 AM, Peter wrote: >> I'm hoping you are looking at an older version, but if not, maybe we >> can re-order that section or something to make it clearer. Feedback >> on documentation is very useful. >> >> Peter > > Yeah i must be looking at an older one - that example in the new version > is pretty clear! > > thanks again OK - great. Peter From mikelisanke at gmail.com Fri Oct 23 11:27:55 2009 From: mikelisanke at gmail.com (Mike Lisanke) Date: Fri, 23 Oct 2009 11:27:55 -0400 Subject: [Biopython] Fwd: Windows installer does not find Python 2.63 with multiple pythons In-Reply-To: <8c5c6e580910230826o44861568ld70c135a094b0694@mail.gmail.com> References: <8c5c6e580910191237w54e857bfs1107ba86df05db7c@mail.gmail.com> <320fb6e00910191429l481223d8iae9fbe78702fff70@mail.gmail.com> <8c5c6e580910221119o1d36426dkfdb2d96d681828c9@mail.gmail.com> <320fb6e00910221245w3eff78dcj54aa3eb57990300a@mail.gmail.com> <8c5c6e580910230826o44861568ld70c135a094b0694@mail.gmail.com> Message-ID: <8c5c6e580910230827t1711f8dcub1b8b2d05c552c4f@mail.gmail.com> ---------- Forwarded message ---------- From: Mike Lisanke Date: Fri, Oct 23, 2009 at 11:26 AM Subject: Re: [Biopython] Windows installer does not find Python 2.63 with multiple pythons To: Peter Peter, Yes. I got a clue when I saw Numpy (which worked (has a AMD64 build)). and failed when switched to and earlier python level (2.6 -> 2.5). Numpy only has a Win32 installer, and it reported the same failure with the python-2.5-amd64 registry values. If I can, I will prepare (the libraries?) for a Biopython-2.x-AMD64 package. I haven't installed a C/C++ build environment on my windows machine (yet), but; I'm adept at Linux and Windows C/C++ development. And, I'd like to have a 64bit Biopython . From your email, I now assume Biopython is not strictly python code (which should run on whatever python is installed). I'll dig into the source + documentation, but you probably can give me the short answer. Does this build from a GCC on windows (e.g. Cygwin or GnuWin32), or a Microsoft build environment (e.g. Visual C++)? And, I assume it is not cross-platform prepared from Linux (e.g. fake-root)? Thanks. On Thu, Oct 22, 2009 at 3:45 PM, Peter wrote: > On Thu, Oct 22, 2009 at 7:19 PM, Mike Lisanke > wrote: > > Peter, > > > > The problem was python-2.6.3-amd64 for which their is a > > numpy-1.3.0-win-amd64-py2.6 installer. Is there a reason > > NumPy and BioPython have a specific dependency to > > work with the AMD64 build of python? > > Are you running on 64 bit Windows then? XP or Vista? > > It sounds like you are trying to mix 32 and 64 bit versions > of Python. > > If you installed the 64 bit version of Python and Numpy, > then you will need a 64 bit compiled version of Biopython > too - but we don't have one of those yet. We'd need a > developer or a volunteer with a 64bit Windows machine > to do this. > > You should be to install a 32 bit version of Python, > > http://python.org/ftp/python/2.6.3/python-2.6.3.msi > > plus the 32 bit Windows installer for Numpy: > > > http://sourceforge.net/projects/numpy/files/NumPy/1.3.0/numpy-1.3.0-win32-superpack-python2.6.exe/download > > and the 32 bit Windows installer for Biopython: > > http://biopython.org/DIST/biopython-1.52.win32-py2.6.exe > > (i.e. look for win32 in the filenames, not amd64). > > Peter > -- Best regards, Mike -- Best regards, Mike From peter at maubp.freeserve.co.uk Fri Oct 23 11:47:55 2009 From: peter at maubp.freeserve.co.uk (Peter) Date: Fri, 23 Oct 2009 16:47:55 +0100 Subject: [Biopython] Fwd: Windows installer does not find Python 2.63 with multiple pythons In-Reply-To: <8c5c6e580910230827t1711f8dcub1b8b2d05c552c4f@mail.gmail.com> References: <8c5c6e580910191237w54e857bfs1107ba86df05db7c@mail.gmail.com> <320fb6e00910191429l481223d8iae9fbe78702fff70@mail.gmail.com> <8c5c6e580910221119o1d36426dkfdb2d96d681828c9@mail.gmail.com> <320fb6e00910221245w3eff78dcj54aa3eb57990300a@mail.gmail.com> <8c5c6e580910230826o44861568ld70c135a094b0694@mail.gmail.com> <8c5c6e580910230827t1711f8dcub1b8b2d05c552c4f@mail.gmail.com> Message-ID: <320fb6e00910230847m163960ceneeea268880c88bf2@mail.gmail.com> On Fri, Oct 23, 2009 at 4:27 PM, Mike Lisanke wrote: > > Peter, > > Yes. I got a clue when I saw Numpy (which worked (has a AMD64 build)). and > failed when switched to and earlier python level (2.6 -> 2.5). Numpy only > has a Win32 installer, and it reported the same failure with the > python-2.5-amd64 registry values. That makes sense. > If I can, I will prepare (the libraries?) for a Biopython-2.x-AMD64 package. > I haven't installed a C/C++ build environment on my windows machine (yet), > but; I'm adept at Linux and Windows C/C++ development. And, I'd like to have > a 64bit Biopython . From your email, I now assume Biopython is not > strictly python code (which should run on whatever python is installed). That is correct - Biopython includes some C code (like NumPy). > I'll dig into the source + documentation, but you probably can give me the > short answer. Does this build from a GCC on windows (e.g. Cygwin or > GnuWin32), or a Microsoft build environment (e.g. Visual C++)? And, I assume > it is not cross-platform prepared from Linux (e.g. fake-root)? Thanks. We compile the Biopython Windows 32 bit Installers on a 32 bit Windows XP machine. The compiler depends on which version of Python you want to use. See the "Installing from source on Windows" section of this document: http://biopython.org/DIST/docs/install/Installation.html http://biopython.org/DIST/docs/install/Installation.pdf You may be the first person to try this on 64 bit Windows. At least, no-one has responded to my email to the dev list yesterday: http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006901.html Peter From ap12 at sanger.ac.uk Fri Oct 23 11:57:53 2009 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Fri, 23 Oct 2009 16:57:53 +0100 Subject: [Biopython] fasta-m10 al_start and al_end? Message-ID: <82A8DBBC-6964-4433-AF9C-EF049160A2DF@sanger.ac.uk> Dear, I am using Biopython to parse a fasta alignment file: alignments = AlignIO.parse(open("fastaresults/ 78_Spneumoniae_ATCC700669/all_bases_435_1055_cds.fres"), "fasta-m10", seq_count=2) for alignment in alignments: record_query = alignment[0] record_match = alignment[1] print alignment._annotations["sw_score"], alignment._annotations["sw_ident"] print record_query.annotations["original_length"] # print record_query.annotations["al_start"], record_query.annotations["al_end"] I would like to print the start/end of each aligned sequences. I can see in Bio.AlignIO.FastaIO.next() that sq_len is stored in annotations: record.annotations["original_length"] = int(query_annotation["sq_len"]) but I cannot find a way of accessing at_start and al_end. Thanks in advance for your help. Kind regards, Anne. -- Dr Anne Pajon - Pathogen Genomics Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From biopython at maubp.freeserve.co.uk Fri Oct 23 14:40:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 23 Oct 2009 19:40:12 +0100 Subject: [Biopython] fasta-m10 al_start and al_end? In-Reply-To: <82A8DBBC-6964-4433-AF9C-EF049160A2DF@sanger.ac.uk> References: <82A8DBBC-6964-4433-AF9C-EF049160A2DF@sanger.ac.uk> Message-ID: <320fb6e00910231140w46243f9bp9751b6476e3e55b0@mail.gmail.com> On Fri, Oct 23, 2009 at 4:57 PM, Anne Pajon wrote: > Dear, > > I am using Biopython to parse a fasta alignment file: > > ? ?alignments = > AlignIO.parse(open("fastaresults/78_Spneumoniae_ATCC700669/all_bases_435_1055_cds.fres"), > "fasta-m10", seq_count=2) > ? ?for alignment in alignments: > > ? ? ? ?record_query = alignment[0] > ? ? ? ?record_match = alignment[1] > > ? ? ? ?print alignment._annotations["sw_score"], > alignment._annotations["sw_ident"] > ? ? ? ?print record_query.annotations["original_length"] > ? ? ? ?# print record_query.annotations["al_start"], > record_query.annotations["al_end"] > > I would like to print the start/end of each aligned sequences. > > I can see in Bio.AlignIO.FastaIO.next() that sq_len is stored in > annotations: > ? ? ? ?record.annotations["original_length"] = > int(query_annotation["sq_len"]) > but I cannot find a way of accessing at_start and al_end. > > Thanks in advance for your help. > Kind regards, > Anne. Hi Anne, That's a good question, but the answer may be a little disappointing. That information isn't currently recorded in the SeqRecord, partly because at the time I didn't need it, but mainly I was undecided about if the start location should be converted into python counting or not (zero based versus one based). What would you prefer? My inclination is python counting. Peter P.S. Most of the alignment level annotation is recorded, but is currently hidden in a "private" property (leading underscore). You can access this, but be warned that this will change in future - Improving the alignment object is something I am working on for a future release. From biopython at maubp.freeserve.co.uk Mon Oct 26 06:04:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Oct 2009 10:04:21 +0000 Subject: [Biopython] fasta-m10 al_start and al_end? In-Reply-To: <1E613DD7-3CF3-4F12-8C61-D12440F5AE1D@sanger.ac.uk> References: <82A8DBBC-6964-4433-AF9C-EF049160A2DF@sanger.ac.uk> <320fb6e00910231140w46243f9bp9751b6476e3e55b0@mail.gmail.com> <1E613DD7-3CF3-4F12-8C61-D12440F5AE1D@sanger.ac.uk> Message-ID: <320fb6e00910260304j417fa20ep2b353b6e74ca055d@mail.gmail.com> On Fri, Oct 23, 2009 at 11:00 PM, Anne Pajon wrote: > > Hi Peter, > > Thanks for your fast answer. > > I've already discovered the _annotations and I am prepared to update my > code as soon as a better solution is provided. Good. > Concerning the al_start and al_end, I am looking for a solution very soon, > as I am working on an annotation pipeline prototype in python. What would be > your recommendation? Writing a parser myself, using another tool (but which > one?), or helping storing this information in SeqRecord in biopython as it > is almost there. Thanks to let me know. I would rather not add them directly to the SeqRecord annotations dictionary because that will make doing something meaningful with slicing (the SeqRecord, or in future the Alignment) much harder. I think the best way to handle these is in the Alignment object, but this isn't really supported at the moment. Are you happy to run a development version of Biopython, or at least to update the file Bio/AlignIO/FastaIO.py? I'm thinking in the short term we can record these bits of information as private properties of the SeqRecord, i.e. _al_start and _al_end Would that suit you for now? Peter From biopython at maubp.freeserve.co.uk Mon Oct 26 10:17:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Oct 2009 14:17:50 +0000 Subject: [Biopython] fasta-m10 al_start and al_end? In-Reply-To: <320fb6e00910260304j417fa20ep2b353b6e74ca055d@mail.gmail.com> References: <82A8DBBC-6964-4433-AF9C-EF049160A2DF@sanger.ac.uk> <320fb6e00910231140w46243f9bp9751b6476e3e55b0@mail.gmail.com> <1E613DD7-3CF3-4F12-8C61-D12440F5AE1D@sanger.ac.uk> <320fb6e00910260304j417fa20ep2b353b6e74ca055d@mail.gmail.com> Message-ID: <320fb6e00910260717r50974e1epb7ee94d41ff5aa6c@mail.gmail.com> On Mon, Oct 26, 2009 at 10:04 AM, Peter wrote: > On Fri, Oct 23, 2009 at 11:00 PM, Anne Pajon wrote: >> >> Hi Peter, >> >> Thanks for your fast answer. >> >> I've already discovered the _annotations and I am prepared to update my >> code as soon as a better solution is provided. > > Good. > >> Concerning the al_start and al_end, I am looking for a solution very soon, >> as I am working on an annotation pipeline prototype in python. What would be >> your recommendation? Writing a parser myself, using another tool (but which >> one?), or helping storing this information in SeqRecord in biopython as it >> is almost there. Thanks to let me know. > > I would rather not add them directly to the SeqRecord annotations > dictionary because that will make doing something meaningful with > slicing (the SeqRecord, or in future the Alignment) much harder. I > think the best way to handle these is in the Alignment object, but > this isn't really supported at the moment. > > Are you happy to run a development version of Biopython, or at least > to update the file Bio/AlignIO/FastaIO.py? I'm thinking in the short > term we can record these bits of information as private properties of > the SeqRecord, i.e. _al_start and _al_end Make that _al_start and _al_end (to match the field names used in the FASTA output). This change is in the repository now, which you can grab via github. See http://www.biopython.org/wiki/SourceCode As with any "private" variables (leading underscore), they are not really intended for public use, but should at least solve your immediate requirement for now. Peter From eric.talevich at gmail.com Mon Oct 26 11:44:23 2009 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 26 Oct 2009 11:44:23 -0400 Subject: [Biopython] fasta-m10 al_start and al_end? Message-ID: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> > > On Fri, Oct 23, 2009 at 4:57 PM, Anne Pajon wrote: > > Dear, > > > > I am using Biopython to parse a fasta alignment file: > > > ... > > > > I would like to print the start/end of each aligned sequences. > > > > I can see in Bio.AlignIO.FastaIO.next() that sq_len is stored in > > annotations: > > ? ? ? ?record.annotations["original_length"] = > > int(query_annotation["sq_len"]) > > but I cannot find a way of accessing at_start and al_end. > > > > Thanks in advance for your help. > > Kind regards, > > Anne. > > Hi Anne, > > That's a good question, but the answer may be a little > disappointing. > > That information isn't currently recorded in the SeqRecord, > partly because at the time I didn't need it, but mainly I was > undecided about if the start location should be converted > into python counting or not (zero based versus one based). > What would you prefer? My inclination is python counting. > > Peter > > P.S. Most of the alignment level annotation is recorded, > but is currently hidden in a "private" property (leading > underscore). You can access this, but be warned that this > will change in future - Improving the alignment object is > something I am working on for a future release. > > Hi Peter, Here's +1 for Python counting. That would match SeqFeature and the ProteinDomain class in Bio.Tree.PhyloXML. While we're on this topic -- I have some unpublished code for rendering an alignment object in HTML, with plans for colorization, conservation profiles, etc. I rolled my own alignment class since the one in Bio.Align.Generic didn't have the attributes (start, end, selected columns) for a particular file format I was parsing. It's not urgent, but at some point could you publish your plans for the Alignment classes so I (and probably others) can stay/become compatible? Thanks, Eric From biopython at maubp.freeserve.co.uk Mon Oct 26 12:07:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 26 Oct 2009 16:07:04 +0000 Subject: [Biopython] fasta-m10 al_start and al_end? In-Reply-To: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com> Message-ID: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com> On Mon, Oct 26, 2009 at 3:44 PM, Eric Talevich wrote: > Hi Peter, > > Here's +1 for Python counting. That would match SeqFeature and the > ProteinDomain class in Bio.Tree.PhyloXML. > > While we're on this topic -- I have some unpublished code for rendering an > alignment object in HTML, with plans for colorization, conservation > profiles, etc. I rolled my own alignment class since the one in > Bio.Align.Generic didn't have the attributes (start, end, selected columns) > for a particular file format I was parsing. It's not urgent, but at some > point could you publish your plans for the Alignment classes so I (and > probably others) can stay/become compatible? My rough work in progress in on github - at the moment I'm still trying things out, and don't assume anything is set in stone. If you want to have a play with this code, feedback is very welcome - probably best on the dev list rather than here. See: http://github.com/peterjc/biopython/tree/seqrecords (a lot of the alignment things I want to support, like slicing and adding are very closely linked to doing the same operations to SeqRecords) Peter From yvan.strahm at bccs.uib.no Tue Oct 27 05:41:43 2009 From: yvan.strahm at bccs.uib.no (Yvan Strahm) Date: Tue, 27 Oct 2009 10:41:43 +0100 Subject: [Biopython] how to validate fasta format Message-ID: <4AE6C057.9050604@bccs.uib.no> Hello All, Is it possible to validate a sequence format, for example while the sequence is parsed by SeqIO.parse and using IUPAC.py? Or should I try to search for illegal characters in .seq? Cheers, yvan From biopython at maubp.freeserve.co.uk Tue Oct 27 06:08:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Oct 2009 10:08:41 +0000 Subject: [Biopython] how to validate fasta format In-Reply-To: <4AE6C057.9050604@bccs.uib.no> References: <4AE6C057.9050604@bccs.uib.no> Message-ID: <320fb6e00910270308s2b51893fucd3be6307946e2b5@mail.gmail.com> On Tue, Oct 27, 2009 at 9:41 AM, Yvan Strahm wrote: > Hello All, > > Is it possible to validate a sequence format, for example while the sequence > is parsed by SeqIO.parse and using IUPAC.py? Or should I try to search for > illegal characters in .seq? > > Cheers, > yvan It depends on what you mean by validate - if you want to check for specific letters against a whitelist, then currently you would have to look at the letters in the sequence. I would use sets for this. e.g. wanted = set("ACGT") for record in SeqIO.parse(handle, "fasta") : if not wanted.isuperset(record.seq) : print "Bad: %s" % record.id Making the Seq object validate against explicit alphabets (where the allowed letters are given) is something I have wondered about for the future. Peter From yvan.strahm at bccs.uib.no Tue Oct 27 08:03:11 2009 From: yvan.strahm at bccs.uib.no (Yvan Strahm) Date: Tue, 27 Oct 2009 13:03:11 +0100 Subject: [Biopython] how to validate fasta format In-Reply-To: <320fb6e00910270308s2b51893fucd3be6307946e2b5@mail.gmail.com> References: <4AE6C057.9050604@bccs.uib.no> <320fb6e00910270308s2b51893fucd3be6307946e2b5@mail.gmail.com> Message-ID: <4AE6E17F.2030407@bccs.uib.no> Peter wrote: > On Tue, Oct 27, 2009 at 9:41 AM, Yvan Strahm wrote: >> Hello All, >> >> Is it possible to validate a sequence format, for example while the sequence >> is parsed by SeqIO.parse and using IUPAC.py? Or should I try to search for >> illegal characters in .seq? >> >> Cheers, >> yvan > > It depends on what you mean by validate - if you want to check for > specific letters against a whitelist, then currently you would have to > look at the letters in the sequence. I would use sets for this. e.g. > > wanted = set("ACGT") > for record in SeqIO.parse(handle, "fasta") : > if not wanted.isuperset(record.seq) : > print "Bad: %s" % record.id > > Making the Seq object validate against explicit alphabets (where > the allowed letters are given) is something I have wondered about > for the future. > > Peter Thanks for the quick reply. Yes by validating I mainly meant check for the correct alphabet in the Seq object but also the correct header's format. So I guess, I have to trust the user.... ;-) thanks again yvan From biopython at maubp.freeserve.co.uk Tue Oct 27 08:36:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Oct 2009 12:36:52 +0000 Subject: [Biopython] how to validate fasta format In-Reply-To: <4AE6E17F.2030407@bccs.uib.no> References: <4AE6C057.9050604@bccs.uib.no> <320fb6e00910270308s2b51893fucd3be6307946e2b5@mail.gmail.com> <4AE6E17F.2030407@bccs.uib.no> Message-ID: <320fb6e00910270536v4a2a4c22xc734085dfc5c6091@mail.gmail.com> On Tue, Oct 27, 2009 at 12:03 PM, Yvan Strahm wrote: > Yes by validating I mainly meant check for the correct alphabet in the Seq > object but also the correct header's format. So I guess, I have to trust the > user.... ;-) The FASTA header is basically free format - almost anything is valid, although some tools object to things like pipes and underscores. You will need to test the data in terms of your own criteria. Peter From biopython at maubp.freeserve.co.uk Tue Oct 27 09:20:58 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Oct 2009 13:20:58 +0000 Subject: [Biopython] how to validate fasta format In-Reply-To: <1256649260.5941.7.camel@Neo> References: <4AE6C057.9050604@bccs.uib.no> <320fb6e00910270308s2b51893fucd3be6307946e2b5@mail.gmail.com> <1256649260.5941.7.camel@Neo> Message-ID: <320fb6e00910270620ofd3cca2pc59fd30b86dab7f7@mail.gmail.com> On Tue, Oct 27, 2009 at 1:14 PM, Steve Darnell wrote: > > Greetings, > > This particular thread addresses a topic we've revisited lately, > ambiguity codes (particularly in the amino acid alphabet). ?I would like > to query the group for their opinion of the remaining 6 characters after > you remove the 20 standard amino acids. ?Here's our list: > > B - Asn or Asp > J - Ile or Leu > O - ??? > U - seleno-Cys > X - Any > Z - Gln or Glu Your list is incomplete. According to the Biopython ExtendedIUPACProtein alphabet docstring, which is based on the IUPAC standards or recommendations: B = "Asx"; Aspartic acid (R) or Asparagine (N) X = "Xxx"; Unknown or 'other' amino acid Z = "Glx"; Glutamic acid (E) or Glutamine (Q) J = "Xle"; Leucine (L) or Isoleucine (I), used in mass-spec (NMR) U = "Sec"; Selenocysteine O = "Pyl"; Pyrrolysine In practice, X is also often used to mean any amino acid or a stop codon too (although this really would benefit from a more explicit character in my personal opinion). Peter From darnells at dnastar.com Tue Oct 27 09:14:20 2009 From: darnells at dnastar.com (Steve Darnell) Date: Tue, 27 Oct 2009 08:14:20 -0500 Subject: [Biopython] how to validate fasta format In-Reply-To: <320fb6e00910270308s2b51893fucd3be6307946e2b5@mail.gmail.com> References: <4AE6C057.9050604@bccs.uib.no> <320fb6e00910270308s2b51893fucd3be6307946e2b5@mail.gmail.com> Message-ID: <1256649260.5941.7.camel@Neo> Greetings, This particular thread addresses a topic we've revisited lately, ambiguity codes (particularly in the amino acid alphabet). I would like to query the group for their opinion of the remaining 6 characters after you remove the 20 standard amino acids. Here's our list: B - Asn or Asp J - Ile or Leu O - ??? U - seleno-Cys X - Any Z - Gln or Glu ~Steve On Tue, 2009-10-27 at 10:08 +0000, Peter wrote: > On Tue, Oct 27, 2009 at 9:41 AM, Yvan Strahm wrote: > > Hello All, > > > > Is it possible to validate a sequence format, for example while the sequence > > is parsed by SeqIO.parse and using IUPAC.py? Or should I try to search for > > illegal characters in .seq? > > > > Cheers, > > yvan > > It depends on what you mean by validate - if you want to check for > specific letters against a whitelist, then currently you would have to > look at the letters in the sequence. I would use sets for this. e.g. > > wanted = set("ACGT") > for record in SeqIO.parse(handle, "fasta") : > if not wanted.isuperset(record.seq) : > print "Bad: %s" % record.id > > Making the Seq object validate against explicit alphabets (where > the allowed letters are given) is something I have wondered about > for the future. > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From dalloliogm at gmail.com Tue Oct 27 09:41:36 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Tue, 27 Oct 2009 14:41:36 +0100 Subject: [Biopython] how to validate fasta format In-Reply-To: <320fb6e00910270536v4a2a4c22xc734085dfc5c6091@mail.gmail.com> References: <4AE6C057.9050604@bccs.uib.no> <320fb6e00910270308s2b51893fucd3be6307946e2b5@mail.gmail.com> <4AE6E17F.2030407@bccs.uib.no> <320fb6e00910270536v4a2a4c22xc734085dfc5c6091@mail.gmail.com> Message-ID: <5aa3b3570910270641r47615ad7wf51e42cca7d846a3@mail.gmail.com> On Tue, Oct 27, 2009 at 1:36 PM, Peter wrote: > On Tue, Oct 27, 2009 at 12:03 PM, Yvan Strahm > wrote: > > Yes by validating I mainly meant check for the correct alphabet in the > Seq > > object but also the correct header's format. So I guess, I have to trust > the > > user.... ;-) > > The FASTA header is basically free format - almost anything is valid, > although some tools object to things like pipes and underscores. > You will need to test the data in terms of your own criteria. > > In principle is as you say, but if you want to implement a validator, I would take into account that: - many programs fail if the first character after the '>' is a space - the first word after the '>' is usually considered as being the name of the sequence; further descriptions must be separed by spaces or '|' - the sequence is continuous and it should not be interrupted by blank lines Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Tue Oct 27 10:07:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 27 Oct 2009 14:07:05 +0000 Subject: [Biopython] how to validate fasta format In-Reply-To: <5aa3b3570910270641r47615ad7wf51e42cca7d846a3@mail.gmail.com> References: <4AE6C057.9050604@bccs.uib.no> <320fb6e00910270308s2b51893fucd3be6307946e2b5@mail.gmail.com> <4AE6E17F.2030407@bccs.uib.no> <320fb6e00910270536v4a2a4c22xc734085dfc5c6091@mail.gmail.com> <5aa3b3570910270641r47615ad7wf51e42cca7d846a3@mail.gmail.com> Message-ID: <320fb6e00910270707w7a9ab424m43564e2de1acbe46@mail.gmail.com> On Tue, Oct 27, 2009 at 1:41 PM, Giovanni Marco Dall'Olio wrote: > > In principle is as you say, but if you want to implement a validator, I > would take into account that: > - many programs fail if the first character after the '>' is a space Good point. I'd interpret that a record without a name/identifier, but with a description. We should double check Biopython does handle this gracefully. > - the first word after the '>' is usually considered as being the name of > the sequence; further descriptions must be separed by spaces or '|' I'm not sure what you mean about the pipe (|) in descriptions - this is basically a case of anything is allowed, but some tools are fussy. > - the sequence is continuous and it should not be interrupted by blank lines I think according to the original FASTA tools, blank lines are fine. But again, some tools are fussy. Here Biopython should tolerate this on input, and not do it on output. i.e. FASTA "validation" always depends on what you are going it for. Another example, preparing data for TMHMM it is sensible to impose a minimum length on the sequence - but a short or even zero length sequence is valid in FASTA files in general. Peter From bassbabyface at yahoo.com Tue Oct 27 11:12:13 2009 From: bassbabyface at yahoo.com (Ben O'Loghlin) Date: Wed, 28 Oct 2009 02:12:13 +1100 Subject: [Biopython] Entrez.read return value is typed as a string?? Message-ID: <01aa01ca5717$dec90220$9c5b0660$@com> Hi all, I'm new to BioPython, having spent < 4 hours playing with it, and I'm mighty impressed with what it can do for me once I get it working. Unfortunately I've spent about 3.5 of those hours inanely grappling with Entrez.read, so I turn to more experienced BioPythoneers for assistance. I'm trying to use Entrez to extract and manipulate records from PubMed, and I'm stumped. I was expecting the return value of Entrez.read to be a structured object, and instead it seems to return a string which would require further parsing to do anything useful with. I'm not sure if this is the expected output and I have misunderstood, or if PubMed is just returning results in unexpected formats which break the parser in Entrez.read, or if Bio just doesn't work after midnight (2:06 am Australian EST). Is anyone able/willing to assist? The goal here is to have some way of extracting individual fields from the returned records, e.g. print out the Abstract for PMID 17206916. I'm using BioPython 1.5.2 and Python 2.6.4 on Vista. Script and output below... Many thanks in advance, Ben ######################################################################### # Biotest.py ######################################################################### from Bio import Entrez PMID = "17206916" database = "pubmed" # Fetch the full article details handle1 = Entrez.efetch(db=database, id=PMID) full = handle1.read() print "\nProperties of full record object: " print type(full) print print full[0:180] #Fetch and print the summary details handle2 = Entrez.esummary(db=database, id=PMID) summary = handle2.read() print "\nProperties of summary record object: " print type(summary) print print summary[0:300] ######################################################################### ######################################################################### # Output from Biotest.py ######################################################################### C:\Data\Personal\Dev\Python\PubMed>c:\Python26\python.exe biotest.py Properties of full record object: PmFetch response
Pubmed-entry ::= {
  pmid 17206916,
  medent {
    em std {
      year 2007,
      month 1,
      day 8
    },
    ci

Properties of summary record object:






        17206916
        2006
        
References: <01aa01ca5717$dec90220$9c5b0660$@com>
Message-ID: <320fb6e00910270842p4ae53933k91f5ce16a7b4d7e2@mail.gmail.com>

On Tue, Oct 27, 2009 at 3:12 PM, Ben O'Loghlin  wrote:
> Hi all,
>
> I'm new to BioPython, having spent < 4 hours playing with it, and I'm mighty
> impressed with what it can do for me once I get it working. Unfortunately
> I've spent about 3.5 of those hours inanely grappling with Entrez.read, so I
> turn to more experienced BioPythoneers for assistance.

Oh dear - were you working though the Entrez chapter in the Tutorial?
If not, what where you looking at?

> I'm trying to use Entrez to extract and manipulate records from PubMed, and
> I'm stumped. I was expecting the return value of Entrez.read to be a
> structured object, and instead it seems to return a string which would
> require further parsing to do anything useful with.

That doesn't sound right. The Bio.Entrez.read() should take a handle,
in XML format, and return a nested collection of python objects.

> I'm not sure if this is the expected output and I have misunderstood, or if
> PubMed is just returning results in unexpected formats which break the
> parser in Entrez.read, or if Bio just doesn't work after midnight (2:06 am
> Australian EST).
>
> Is anyone able/willing to assist? The goal here is to have some way of
> extracting individual fields from the returned records, e.g. print out the
> Abstract for PMID 17206916.

First of all, handles give access to data via the read() and other methods,
like readline()

>>> from Bio import Entrez
>>> handle = Entrez.efetch(db="pubmed", id="17206916")
>>> print handle.readline()
PmFetch response

So you see by default, the NCBI is returning HTML. We can ask for XML:

>>> handle = Entrez.efetch(db="pubmed", id="17206916", retmode="XML")
>>> print handle.readline()


You could parse this with Bio.Entrez.read() if you wanted to:

>>> handle = Entrez.efetch(db="pubmed", id="17206916", retmode="XML")
>>> record = Entrez.read(handle)
>>> print record
[{u'MedlineCitation': ... ]

Or, rather than XML designed for a computer to parse, you could ask for
the plain text MEDLINE format,

>>> handle = Entrez.efetch(db="pubmed", id="17206916", retmode="text", rettype="medline")
>>> print handle.read()
PMID- 17206916
OWN - NLM
STAT- MEDLINE
DA  - 20070108
DCOM- 20070130
...

Does that help?

Peter

From biopython at maubp.freeserve.co.uk  Tue Oct 27 11:51:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 27 Oct 2009 15:51:12 +0000
Subject: [Biopython] Entrez.read return value is typed as a string??
In-Reply-To: <320fb6e00910270842p4ae53933k91f5ce16a7b4d7e2@mail.gmail.com>
References: <01aa01ca5717$dec90220$9c5b0660$@com>
	<320fb6e00910270842p4ae53933k91f5ce16a7b4d7e2@mail.gmail.com>
Message-ID: <320fb6e00910270851n3db7984dv861b9d225ead878e@mail.gmail.com>

On Tue, Oct 27, 2009 at 3:42 PM, Peter  wrote:
> On Tue, Oct 27, 2009 at 3:12 PM, Ben O'Loghlin  wrote:
>> I'm trying to use Entrez to extract and manipulate records from PubMed, and
>> I'm stumped. I was expecting the return value of Entrez.read to be a
>> structured object, and instead it seems to return a string which would
>> require further parsing to do anything useful with.
>
> That doesn't sound right. The Bio.Entrez.read() should take a handle,
> in XML format, and return a nested collection of python objects.

I think I've worked out what you may have been doing wrong - trying
to feed HTML into Bio.Entrez.read(). I would have expected a helpful
error message, but it returns an empty string. I've filed Bug 2938.

http://bugzilla.open-bio.org/show_bug.cgi?id=2938

Michiel - could you take a look at this please?

Thanks,

Peter

From danielchubb at gmail.com  Tue Oct 27 14:55:39 2009
From: danielchubb at gmail.com (Daniel Chubb)
Date: Tue, 27 Oct 2009 18:55:39 +0000
Subject: [Biopython] Bio.PDB.ResidueDepth help
Message-ID: 

Hi, I'm trying to calculate residue depth using this module and I'd really
appreciate it if someone could help me make some sense out of the output.


Here is some code:

>>> from Bio.PDB import *
>>> parser=PDBParser()
>>> structure=parser.get_structure("scr",'/.../d1t3ta3.pdb')
>>> model=structure[0]
>>> rd=ResidueDepth(model, '/.../d1t3ta3.pdb')
>>> for i in rd:
...     print i

I then get this output:

...

 ...
(, (941.50269996685836,
938.52026632473292))
(, (943.30248293205898,
935.73449250166789))
(, (956.22610923774971,
929.58401500468858))
(, (946.39762766474189,
929.1969204628009))
(, (980.35736194344759,
952.50174666095472))
(, (943.33749438200709,
941.41471544399076))
(, (1005.0456481617543,
1021.4687548192563))
(, (998.26228815878574,
1014.7065537464257))
(, (954.34720196525564,
933.69587405187428))
(, (865.68049599904009,
859.80537822913527))
(, (888.74360153732255,
871.36588689619543))
(, (887.82610875300952,
870.97697239966283))
(, (882.65307575266002,
870.71143243803749))
(, (1038.6138896432872,
986.73921610486354))
(, (1036.0337702261368,
984.51578671438835))

....

As I understand it, the two values in the tuple (e.g. (941.50269996685836,
938.52026632473292)) for residue 1) are residue depth and Ca depth. But
those values don't seem to make sense to me. Are they not supposed to be in
Angstroms? They range in my output from about 865 to 1200, I would expect
some to be 0 (or around that).

Could anyone point out what has gone wrong/what I'm doing wrong?


Thanks a lot for the help

Daniel Chubb




From laszlo at vpac.org  Tue Oct 27 18:23:22 2009
From: laszlo at vpac.org (Laszlo Kun)
Date: Wed, 28 Oct 2009 09:23:22 +1100 (EST)
Subject: [Biopython] KOBAS - KEGG Orthology Based Annotation System XML file
 empty problem
In-Reply-To: <973378923.5269591256682126088.JavaMail.root@mail.vpac.org>
Message-ID: <1915267259.5269661256682202048.JavaMail.root@mail.vpac.org>

Dear All,

I am trying to install for a user the KOBAS software, which is done apparently, but after about 3 hours is felt over with
the error message:

======================
[rossh at tango Ov_KOBAS]$ cat NY.e789941
Traceback (most recent call last):
File "/usr/local/python/2.6.2-gcc/bin/blast2ko.py", line 90, in 
annots = dict([ (i.query, i) for i in annotator.annotate() ])
File
"/usr/local/python/2.6.2-gcc/lib/python2.6/site-packages/kobas/annot.py",
line 151, in annotate
for record in self.reader:
File
"/usr/local/python/2.6.2-gcc/lib/python2.6/site-packages/Bio/Blast/NCBIXML.py",
line 605, in parse
raise ValueError("Your XML file was empty")
ValueError: Your XML file was empty

=============================


The script appears to have completed the blast section against the KOBAS
database, but has fallen over on the annotation pass.

I haven't come across this error before.

Thanks again for your help.

cheers,
Laszlo 

From biopython at maubp.freeserve.co.uk  Wed Oct 28 06:48:24 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 28 Oct 2009 10:48:24 +0000
Subject: [Biopython] KOBAS - KEGG Orthology Based Annotation System XML
	file empty problem
In-Reply-To: <1915267259.5269661256682202048.JavaMail.root@mail.vpac.org>
References: <973378923.5269591256682126088.JavaMail.root@mail.vpac.org>
	<1915267259.5269661256682202048.JavaMail.root@mail.vpac.org>
Message-ID: <320fb6e00910280348u5bdb7860te59db883ae995362@mail.gmail.com>

On Tue, Oct 27, 2009 at 10:23 PM, Laszlo Kun  wrote:
> Dear All,
>
> I am trying to install for a user the KOBAS software, which is
> done apparently, but after about 3 hours is felt over with
> the error message:
>
> ======================
> [rossh at tango Ov_KOBAS]$ cat NY.e789941
> Traceback (most recent call last):
> File "/usr/local/python/2.6.2-gcc/bin/blast2ko.py", line 90, in 
> annots = dict([ (i.query, i) for i in annotator.annotate() ])
> File
> "/usr/local/python/2.6.2-gcc/lib/python2.6/site-packages/kobas/annot.py",
> line 151, in annotate
> for record in self.reader:
> File
> "/usr/local/python/2.6.2-gcc/lib/python2.6/site-packages/Bio/Blast/NCBIXML.py",
> line 605, in parse
> raise ValueError("Your XML file was empty")
> ValueError: Your XML file was empty
>
> =============================
>
> The script appears to have completed the blast section
> against the KOBAS database, but has fallen over on
> the annotation pass.
>
> I haven't come across this error before.
>
> Thanks again for your help.
>
> cheers,
> Laszlo

Hi Laszlo,

Have you previously ever had KOBAS working? I would
guess this is your first attempt...

The error message from Biopython seems quite clear,
KOBAS is trying to parse an empty XML file. This may
have been due to a problem calling BLAST - which
they probably do via Biopython. Have you checked
your installation of standalone NCBI blast (i.e. the
command line tool blastall) is working? I don't know
what NCBI databases are needed, probably nr.

Unfortunately, there is anther issue here too...

KOBAS is described here:

Mao et al. (2005) Bioinformatics 21(19) pp. 3787-93
http://dx.doi.org/10.1093/bioinformatics/bti430

Wu et al. (2006) Nucleic Acids Research 34
http://dx.doi.org/10.1093/nar/gkl167

The link given in the original paper seems to be dead now:
http://genome.cbi.pku.edu.cn/download.html

Their second paper gives http://kobas.cbi.pku.edu.cn/
which includes links to download their source code.
I had a quick look at this (KOBOS 1.1.0), and it has
not been updated recently. As you are using Python
2.6, you'll see some harmless deprecation warnings
about the sets module (a trivial issue to fix).

What version of Biopython do you have installed?

Their website says they need Biopython 1.24 or later,
but this isn't true. Their file fasta.py uses Biopython's
Bio.SeqIO module which was added in Biopython 1.43.
Their file annot.py uses Bio.Blast.NCBIXML.parse
function, which was also added in Biopython 1.43.

Also, and perhaps most importantly (as mentioned in
the first paper) they are using Martel for parsing KEGG.
We have dropped Martel, and Biopython 1.50 was
the last release to include it. I'm not sure at what
point in the pipeline they use KEGG, but I guess
this will cause trouble after the BLAST step. We
*could* provide the final version of Martel as a
separate standalone package - I'd need to find
half a day free. Note I would strongly recommend
using mxTextTools version 2 (not version 3) as
something about the unicode related API changes
are known to cause some subtle problem with
Martel as used in older versions of Biopython.

I think you (or Biopython) need to get in touch with
the KOBAS authors. They can at least tell us what
version of Biopython they used to delvelop KOBAS
1.1.0. Also, they may have already updated their
code for the webservice, and just not updated the
download files.

Regards,

Peter

From pengyu.ut at gmail.com  Wed Oct 28 18:20:32 2009
From: pengyu.ut at gmail.com (Peng Yu)
Date: Wed, 28 Oct 2009 17:20:32 -0500
Subject: [Biopython] How to get sequences upstream of TSS of genes?
In-Reply-To: <5aa3b3570910160129g3a5e2f0fr1297db95d779e807@mail.gmail.com>
References: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com>
	<5aa3b3570910160129g3a5e2f0fr1297db95d779e807@mail.gmail.com>
Message-ID: <366c6f340910281520w227bbe88o914f889f693588f0@mail.gmail.com>

On Fri, Oct 16, 2009 at 3:29 AM, Giovanni Marco Dall'Olio
 wrote:
> On Thu, Oct 15, 2009 at 11:17 PM, Peng Yu  wrote:
>> I have a set of genes. I want to get the 5kb sequence that is upstream
>> of the TSS's of each gene.
>
> You can do that with biomart:
> - http://www.ensembl.org/biomart/martview/a90f00892a48e04d438f762f551bf48a/a90f00892a48e04d438f762f551bf48a
>
> select Ensembl56 as database, Mus Musculus as species, go to Filters
> and fill the 'Id list limit' form to add the required geneIds, then go
> to Attributes, select Sequences and then check 'Upstream Flank -
> 5000'.

If I want both 5kb upstream of TSS and .5kb downstream of TSS, is
there a way to do so?


> As for doing that in python, I am not sure there are python interfaces
> to BioMart. Galaxy (http://main.g2.bx.psu.edu/) is written in python,
> so they must have written a library for that somewhere, but I don't
> know their code.
>
> If you use R (remember that you can mix python and R with rpy2) there
> is a nice module in bioconductor called BioMart.
>
>
>> I have the following specific questions. Could somebody help me? Thank you!
>>
>> Which database I can access to get mouse genome?
>> Give a gene name what function I should call to get the gene's location?
>> _______________________________________________
>> Biopython mailing list ?- ?Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>
>
>
> --
> Giovanni Dall'Olio, phd student
> Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)
>
> My blog on bioinformatics: http://bioinfoblog.it
>


From bassbabyface at yahoo.com  Wed Oct 28 23:19:09 2009
From: bassbabyface at yahoo.com (Ben O'Loghlin)
Date: Thu, 29 Oct 2009 14:19:09 +1100
Subject: [Biopython] Entrez.read return value is typed as a string??
In-Reply-To: <320fb6e00910270842p4ae53933k91f5ce16a7b4d7e2@mail.gmail.com>
References: <01aa01ca5717$dec90220$9c5b0660$@com>
	<320fb6e00910270842p4ae53933k91f5ce16a7b4d7e2@mail.gmail.com>
Message-ID: <001901ca5846$96f69d60$c4e3d820$@com>

Hi Peter,

Many thanks for your post, you cleared up a world of confusion for me.

A few answers/comments:

>> Oh dear - were you working though the Entrez chapter in the Tutorial?
>> If not, what where you looking at?

No, I didn't find the tutorial until you mentioned it. I came across
BioPython by Googling "python pubmed", the most relevant hit on the first
screenful seemed to be the first one, at
http://baoilleach.blogspot.com/2008/02/searching-pubmed-with-python.html.

This brief blog describes access via the Bio.EUtils package which seems to
have disappeared, and it took me about 45 mins to realise that it was no
longer in the distro and to track down Bio.Entrez.

Then Googling BioPython Entrez, the first hit took me to the documentation
(I missed spotting the tutorial link!) and all subsequent attempts were
based on reading this doco and the source code, and scratching my head and
trying random things.

>So you see by default, the NCBI is returning HTML. We can ask for XML:
>
>>>> handle = Entrez.efetch(db="pubmed", id="17206916", retmode="XML")
>>>> print handle.readline()
>

This all makes sense now, I wasn't aware of the different 'retmode' options.
The Bio.Entrez.efetch() documentation points me to
http://www.ncbi.nlm.nih.gov/corehtml/query/static/efetch_help.html, which
doesn't mention the 'retmode' or 'rettype' parameters. In fact I couldn't
find any explicit reference to it in the Tutorial either, just the use of
'rettype=text' in one of the example code snippets.

I subsequently tracked down this page
http://www.ncbi.nlm.nih.gov/corehtml/query/static/efetchlit_help.html which
does at least indicate the different rettypes and retmodes available.
 
>You could parse this with Bio.Entrez.read() if you wanted to:
>
>>>> handle = Entrez.efetch(db="pubmed", id="17206916", retmode="XML")
>>>> record = Entrez.read(handle)
>>>> print record
>[{u'MedlineCitation': ... ]

I'm interested in using this format, however I don't understand how to
read/write fields and subtrees of the object type
'Bio.Entrez.Parser.ListElement' returned by Entrez.read(handle) with retmode
XML. 

I'm finding it hard to track down references to this [{u'x':['y']}] object
format in Python , possibly due to the fact that I can't get Google to
search for strings like [{u'. I am however appreciative that there appears
to be a u'SpaceFlightMission' tag in Pubmed's default rettype. :)

I'm also a little confused about why handle.read() returns a string in XML
format whereas Entrez.read(handle) returns the
Bio.Entrez.Parser.ListElement. In fact I only knew about this latter method
from your email, since the example in the Bio.Entrez doco only uses the
handle.read() syntax, and doesn't mention that there's any distinction, nor
which might be more appropriate for which task. 

> Does that help?

Immensely.

If you (or any other Bio.Wizards) have the time and the inclination to help
me further, I'd be very grateful for any thoughts relevant to my ponderings
above.

Thanks again,

Ben



From mjldehoon at yahoo.com  Wed Oct 28 23:50:07 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Wed, 28 Oct 2009 20:50:07 -0700 (PDT)
Subject: [Biopython] Entrez.read return value is typed as a string??
In-Reply-To: <001901ca5846$96f69d60$c4e3d820$@com>
Message-ID: <109726.94290.qm@web62408.mail.re1.yahoo.com>



--- On Wed, 10/28/09, Ben O'Loghlin  wrote:
> >>>> handle = Entrez.efetch(db="pubmed", id="17206916", retmode="XML")
> >>>> record = Entrez.read(handle)
> >>>> print record
> >[{u'MedlineCitation': ... ]
> 
> I'm interested in using this format, however I don't
> understand how to
> read/write fields and subtrees of the object type
> 'Bio.Entrez.Parser.ListElement' returned by
> Entrez.read(handle) with retmode XML. 
> 
> I'm finding it hard to track down references to this
> [{u'x':['y']}] object format in Python ...

Look at the outermost two brackets [].
You can treat this object as a Python list.

So if record = [{u'x':['y']}],
then record[0] = {u'x':['y']}

Now look at the two outermost braces {}.
You can treat record[0] as a dictionary.
So record[0]['x'] will return ['y'].
Which can then be treated as a Python list.

--Michiel.


      

From dejmail at gmail.com  Thu Oct 29 00:53:32 2009
From: dejmail at gmail.com (Liam Thompson)
Date: Thu, 29 Oct 2009 06:53:32 +0200
Subject: [Biopython] losing information
Message-ID: 

hi everyone

I'm running a simple script to remove genbank records from a GB file
that I have indentified as undesirable. The only
problem is that when the script is run, all the annotation info (CDS
etc) for entries is lost, only the sequence and ID is kept.
I was wondering if there is an option I am missing, or if I am using
an incorrect variable type somewhere. I just
can't seem to get all the info written.

from Bio import SeqIO

outhandle = open("HBV_seqs.gb", "w")
inhandle = open("all_hbv_seqs_reannotated.gb", "rU")
newrecords = []
badlist = list(open("deletionrecords.txt", "rU"))
badrecord=[]

for items in badlist:
    badrecord.append(items[:-1])

for record in SeqIO.parse(inhandle, "genbank"):
    if record.name not in badrecord:
            newrecords.append(record)

print "writing records..."
SeqIO.write(newrecords, outhandle, "genbank")
print "writing done"
outhandle.close()


I would appreciate any pointers.

Thanks
Liam

From dalloliogm at gmail.com  Thu Oct 29 05:21:15 2009
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Thu, 29 Oct 2009 10:21:15 +0100
Subject: [Biopython] How to get sequences upstream of TSS of genes?
In-Reply-To: <366c6f340910281520w227bbe88o914f889f693588f0@mail.gmail.com>
References: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com> 
	<5aa3b3570910160129g3a5e2f0fr1297db95d779e807@mail.gmail.com> 
	<366c6f340910281520w227bbe88o914f889f693588f0@mail.gmail.com>
Message-ID: <5aa3b3570910290221t289b8e90sa3b722da7e4d5ded@mail.gmail.com>

I suppose it is Flank(Transcript), with upstream=5000 and downstream=5000
-
http://www.ensembl.org/biomart/martview/7675ba9923b086fb5d3a76f753cd5c98/7675ba9923b086fb5d3a76f753cd5c98

it seems you have to execute the query two times, one for upstream and one
for downstream.


On Wed, Oct 28, 2009 at 11:20 PM, Peng Yu  wrote:

> On Fri, Oct 16, 2009 at 3:29 AM, Giovanni Marco Dall'Olio
>  wrote:
> > On Thu, Oct 15, 2009 at 11:17 PM, Peng Yu  wrote:
> >> I have a set of genes. I want to get the 5kb sequence that is upstream
> >> of the TSS's of each gene.
> >
> > You can do that with biomart:
> > -
> http://www.ensembl.org/biomart/martview/a90f00892a48e04d438f762f551bf48a/a90f00892a48e04d438f762f551bf48a
> >
> > select Ensembl56 as database, Mus Musculus as species, go to Filters
> > and fill the 'Id list limit' form to add the required geneIds, then go
> > to Attributes, select Sequences and then check 'Upstream Flank -
> > 5000'.
>
> If I want both 5kb upstream of TSS and .5kb downstream of TSS, is
> there a way to do so?
>
>
> > As for doing that in python, I am not sure there are python interfaces
> > to BioMart. Galaxy (http://main.g2.bx.psu.edu/) is written in python,
> > so they must have written a library for that somewhere, but I don't
> > know their code.
> >
> > If you use R (remember that you can mix python and R with rpy2) there
> > is a nice module in bioconductor called BioMart.
> >
> >
> >> I have the following specific questions. Could somebody help me? Thank
> you!
> >>
> >> Which database I can access to get mouse genome?
> >> Give a gene name what function I should call to get the gene's location?
> >> _______________________________________________
> >> Biopython mailing list  -  Biopython at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biopython
> >>
> >
> >
> >
> > --
> > Giovanni Dall'Olio, phd student
> > Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)
> >
> > My blog on bioinformatics: http://bioinfoblog.it
> >
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>



-- 
Giovanni Dall'Olio, phd student
Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)

My blog on bioinformatics: http://bioinfoblog.it

From biopython at maubp.freeserve.co.uk  Thu Oct 29 06:13:04 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 29 Oct 2009 10:13:04 +0000
Subject: [Biopython] losing information
In-Reply-To: 
References: 
Message-ID: <320fb6e00910290313x48c8de1ew37efe425b632fa87@mail.gmail.com>

On Thu, Oct 29, 2009 at 4:53 AM, Liam Thompson  wrote:
> hi everyone
>
> I'm running a simple script to remove genbank records from
> a GB file that I have indentified as undesirable. The only
> problem is that when the script is run, all the annotation
> info (CDS etc) for entries is lost, only the sequence and ID
> is kept. I was wondering if there is an option I am missing,
> or if I am using an incorrect variable type somewhere. I just
> can't seem to get all the info written.

I guess since you are losing the CDS features you have an
old version of Biopython. From 1.51 onwards we do write
out the feature table, see:
http://www.biopython.org/wiki/SeqIO#File_Formats

However, using Bio.SeqIO to parse and write GenBank files
is still lossy. References are not (yet) written out for example.

There are alternatives: Internally Bio.SeqIO is using
Bio.GenBank to parse the files, and this offers two parsers,
one giving SeqRecord objects (used by SeqIO), and one
giving GenBank specific Records. This later parser should
do a better jobs of preserving the data on output.

That said, I would approach your problem in a very different
way. I would NOT parse the file into objects at all - I would
just loop over the lines, toggling between desired or not,
and outputting the lines for desired records as is. This
assumes your criteria for "desired" is simple to define,
e.g. a list of LOCUS identifiers.

Peter

From biopython at maubp.freeserve.co.uk  Thu Oct 29 06:29:43 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 29 Oct 2009 10:29:43 +0000
Subject: [Biopython] Entrez.read return value is typed as a string??
In-Reply-To: <001901ca5846$96f69d60$c4e3d820$@com>
References: <01aa01ca5717$dec90220$9c5b0660$@com>
	<320fb6e00910270842p4ae53933k91f5ce16a7b4d7e2@mail.gmail.com>
	<001901ca5846$96f69d60$c4e3d820$@com>
Message-ID: <320fb6e00910290329g78330f61qc5b524b4ba4f5f81@mail.gmail.com>

On Thu, Oct 29, 2009 at 3:19 AM, Ben O'Loghlin  wrote:
>
> Hi Peter,
>
> Many thanks for your post, you cleared up a world of confusion for me.
>
> A few answers/comments:
>
>>> Oh dear - were you working though the Entrez chapter in the Tutorial?
>>> If not, what where you looking at?
>
> No, I didn't find the tutorial until you mentioned it.

Did you look at the Biopython website at all? We do try and highlight
the Tutorial as it is the primary documentation, especially for newcomers.
Perhaps you can suggest how to make it more prominent? A fresh set
of eyes can give useful perspective.

> I came across
> BioPython by Googling "python pubmed", the most relevant hit on the first
> screenful seemed to be the first one, at
> http://baoilleach.blogspot.com/2008/02/searching-pubmed-with-python.html.
>
> This brief blog describes access via the Bio.EUtils package which seems to
> have disappeared, and it took me about 45 mins to realise that it was no
> longer in the distro and to track down Bio.Entrez.

Deprecations are recorded in the DEPRECATED file included with the
source code, the latest version can be viewed here:
http://github.com/biopython/biopython/blob/master/DEPRECATED

The removal of Bio.EUtils happened in Biopython 1.52, and was in this
case also noted in the NEWS file, but not the actual release notice:
http://github.com/biopython/biopython/blob/master/NEWS
http://news.open-bio.org/news/2009/09/biopython-release-152/

> Then Googling BioPython Entrez, the first hit took me to the documentation
> (I missed spotting the tutorial link!) and all subsequent attempts were
> based on reading this doco and the source code, and scratching my head and
> trying random things.

Do you mean the API documentation, available via Python though the help
command and viable online here:

http://www.biopython.org/DIST/docs/api/Bio.Entrez-module.html

You can probably tell we put more effort into the Tutorial as an introduction
document.

>>So you see by default, the NCBI is returning HTML. We can ask for XML:
>>
>>>>> handle = Entrez.efetch(db="pubmed", id="17206916", retmode="XML")
>>>>> print handle.readline()
>>
>
> This all makes sense now, I wasn't aware of the different 'retmode' options.
> The Bio.Entrez.efetch() documentation points me to
> http://www.ncbi.nlm.nih.gov/corehtml/query/static/efetch_help.html, which
> doesn't mention the 'retmode' or 'rettype' parameters. In fact I couldn't
> find any explicit reference to it in the Tutorial either, just the use of
> 'rettype=text' in one of the example code snippets.
>
> I subsequently tracked down this page
> http://www.ncbi.nlm.nih.gov/corehtml/query/static/efetchlit_help.html which
> does at least indicate the different rettypes and retmodes available.

I agree the NCBI Entrez documentation is very unhelpful to beginners.
We do try and make this easier in our tutorial, but perhaps "retmode"
and "rettype" need to be discussed more on the EFetch section (they
are mentioned a little later in the chapter in the context of other formats)

>>You could parse this with Bio.Entrez.read() if you wanted to:
>>
>>>>> handle = Entrez.efetch(db="pubmed", id="17206916", retmode="XML")
>>>>> record = Entrez.read(handle)
>>>>> print record
>>[{u'MedlineCitation': ... ]
>
> I'm interested in using this format, however I don't understand how to
> read/write fields and subtrees of the object type
> 'Bio.Entrez.Parser.ListElement' returned by Entrez.read(handle) with retmode
> XML.
>
> I'm finding it hard to track down references to this [{u'x':['y']}] object
> format in Python , possibly due to the fact that I can't get Google to
> search for strings like [{u'. I am however appreciative that there appears
> to be a u'SpaceFlightMission' tag in Pubmed's default rettype. :)

Michiel has tried to answer this. Are you familiar with the basic Python
datatypes?

> I'm also a little confused about why handle.read() returns a string in XML
> format whereas Entrez.read(handle) returns the
> Bio.Entrez.Parser.ListElement. In fact I only knew about this latter method
> from your email, since the example in the Bio.Entrez doco only uses the
> handle.read() syntax, and doesn't mention that there's any distinction, nor
> which might be more appropriate for which task.

In handle.read(), read is a method of an object called handle, in this
case a handle to a network connection.

In Entrez.read(), read is a function of the Entrez module.

In Python, xxx.yyy() means either the "yyy" method of object "xxx" (where
"xxx" is a variable), or the "yyy" could be a function or class of the module
"xxx".

>> Does that help?
>
> Immensely.
>
> If you (or any other Bio.Wizards) have the time and the inclination to help
> me further, I'd be very grateful for any thoughts relevant to my ponderings
> above.

I would suggest you read through some Python introductions, and then
go through the Biopython tutorial again. We have to assume our readers
know a bit of Python - and my guess is from your questions that many
of your issues are with Python itself rather than Biopython. But you are
learning :)

Peter

From dejmail at gmail.com  Thu Oct 29 06:52:23 2009
From: dejmail at gmail.com (Liam Thompson)
Date: Thu, 29 Oct 2009 12:52:23 +0200
Subject: [Biopython] losing information
In-Reply-To: <320fb6e00910290313x48c8de1ew37efe425b632fa87@mail.gmail.com>
References: 
	<320fb6e00910290313x48c8de1ew37efe425b632fa87@mail.gmail.com>
Message-ID: 

Hi Peter

Thanks for the helpful reply as always. I upgraded to 1.51 from 1.49,
but it made
no difference, the information is still lost. You are right that it
would be better not
to write the data to file, and just check over the file, and I will
try to incorporate
this into the next few functions I'm adding.

Let me attempt the Bio.Genbank feature

Regards
Liam

On Thu, Oct 29, 2009 at 12:13 PM, Peter  wrote:
> On Thu, Oct 29, 2009 at 4:53 AM, Liam Thompson  wrote:
>> hi everyone
>>
>> I'm running a simple script to remove genbank records from
>> a GB file that I have indentified as undesirable. The only
>> problem is that when the script is run, all the annotation
>> info (CDS etc) for entries is lost, only the sequence and ID
>> is kept. I was wondering if there is an option I am missing,
>> or if I am using an incorrect variable type somewhere. I just
>> can't seem to get all the info written.
>
> I guess since you are losing the CDS features you have an
> old version of Biopython. From 1.51 onwards we do write
> out the feature table, see:
> http://www.biopython.org/wiki/SeqIO#File_Formats
>
> However, using Bio.SeqIO to parse and write GenBank files
> is still lossy. References are not (yet) written out for example.
>
> There are alternatives: Internally Bio.SeqIO is using
> Bio.GenBank to parse the files, and this offers two parsers,
> one giving SeqRecord objects (used by SeqIO), and one
> giving GenBank specific Records. This later parser should
> do a better jobs of preserving the data on output.
>
> That said, I would approach your problem in a very different
> way. I would NOT parse the file into objects at all - I would
> just loop over the lines, toggling between desired or not,
> and outputting the lines for desired records as is. This
> assumes your criteria for "desired" is simple to define,
> e.g. a list of LOCUS identifiers.
>
> Peter
>

From biopython at maubp.freeserve.co.uk  Thu Oct 29 07:07:09 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 29 Oct 2009 11:07:09 +0000
Subject: [Biopython] Entrez.read return value is typed as a string??
In-Reply-To: <320fb6e00910290329g78330f61qc5b524b4ba4f5f81@mail.gmail.com>
References: <01aa01ca5717$dec90220$9c5b0660$@com>
	<320fb6e00910270842p4ae53933k91f5ce16a7b4d7e2@mail.gmail.com>
	<001901ca5846$96f69d60$c4e3d820$@com>
	<320fb6e00910290329g78330f61qc5b524b4ba4f5f81@mail.gmail.com>
Message-ID: <320fb6e00910290407r15e1c7d5h246de938a8229ad@mail.gmail.com>

On Thu, Oct 29, 2009 at 10:29 AM, Peter  wrote:
> On Thu, Oct 29, 2009 at 3:19 AM, Ben O'Loghlin  wrote:
>> This all makes sense now, I wasn't aware of the different 'retmode' options.
>> The Bio.Entrez.efetch() documentation points me to
>> http://www.ncbi.nlm.nih.gov/corehtml/query/static/efetch_help.html, which
>> doesn't mention the 'retmode' or 'rettype' parameters. In fact I couldn't
>> find any explicit reference to it in the Tutorial either, just the use of
>> 'rettype=text' in one of the example code snippets.
>>
>> I subsequently tracked down this page
>> http://www.ncbi.nlm.nih.gov/corehtml/query/static/efetchlit_help.html which
>> does at least indicate the different rettypes and retmodes available.
>
> I agree the NCBI Entrez documentation is very unhelpful to beginners.
> We do try and make this easier in our tutorial, but perhaps "retmode"
> and "rettype" need to be discussed more on the EFetch section (they
> are mentioned a little later in the chapter in the context of other formats)

I've tried to make the EFetch section of the Biopython tutorial clearer
for the next release - thanks for the feedback.

Peter

From biopython at maubp.freeserve.co.uk  Thu Oct 29 07:09:39 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 29 Oct 2009 11:09:39 +0000
Subject: [Biopython] losing information
In-Reply-To: 
References: 
	<320fb6e00910290313x48c8de1ew37efe425b632fa87@mail.gmail.com>
	
Message-ID: <320fb6e00910290409s4470ec7ufb15e0556c6d4d89@mail.gmail.com>

On Thu, Oct 29, 2009 at 10:52 AM, Liam Thompson  wrote:
> Hi Peter
>
> Thanks for the helpful reply as always. I upgraded to 1.51 from 1.49,
> but it made no difference, the information is still lost.

That is curious. Could you tell use a specific GenBank record showing
this problem (e.g. an accession number or a URL)?

By the way - Biopython 1.52 has been out for a month, although I
don't recall any major changes in the GenBank support right now.

> You are right that it would be better not to write the data to file, and just
> check over the file, and I will try to incorporate this into the next few
> functions I'm adding.

That would be best I think.

> Let me attempt the Bio.Genbank feature

If you really want to. The API is a bit different to Bio.SeqIO ;)

Peter

From biopython at maubp.freeserve.co.uk  Thu Oct 29 08:15:36 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 29 Oct 2009 12:15:36 +0000
Subject: [Biopython] losing information
In-Reply-To: 
References: 
	<320fb6e00910290313x48c8de1ew37efe425b632fa87@mail.gmail.com>
	
	<320fb6e00910290409s4470ec7ufb15e0556c6d4d89@mail.gmail.com>
	
Message-ID: <320fb6e00910290515n72c99ec2ye9f3c6ab61361b1e@mail.gmail.com>

On Thu, Oct 29, 2009 at 11:48 AM, Liam Thompson  wrote:
> Hi Peter
>
> There are 2000 records, but they all behave the same way
>
> I have attached 2 files, to show just 2 of them change.
>
> Thanks
> Liam

The mailing list doesn't like attachments, but I got them and
had a look. This is odd. I just tied a conversion using 1.52+
(i.e. the latest code in the repository) with:

from Bio import SeqIO
count = SeqIO.convert("original.txt", "gb", "new.txt", "gb")
print "Converted %i records" % count

or, equivalently for pre-Biopython 1.52 you can use:

from Bio import SeqIO
records = SeqIO.parse(open("original.txt"), "gb")
handle = open("new.txt", "w")
count = SeqIO.write(records, handle,  "gb")
handle.close()
print "Converted %i records" % count

See this blog post introducing the convert function:
http://news.open-bio.org/news/2009/09/biopython-convert-function/

Either way, I am seeing the features preserved (although
some of the qualifiers are in a different order). As I said
before, I thought this would work on 1.51 too - but maybe
I was wrong. Could you upgrade to 1.52 and retry?

Peter

From biopython at maubp.freeserve.co.uk  Thu Oct 29 10:04:20 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 29 Oct 2009 14:04:20 +0000
Subject: [Biopython] losing information
In-Reply-To: <320fb6e00910290313x48c8de1ew37efe425b632fa87@mail.gmail.com>
References: 
	<320fb6e00910290313x48c8de1ew37efe425b632fa87@mail.gmail.com>
Message-ID: <320fb6e00910290704n605aaf4fr56af80e5463eb35c@mail.gmail.com>

On Thu, Oct 29, 2009 at 10:13 AM, Peter  wrote:
> On Thu, Oct 29, 2009 at 4:53 AM, Liam Thompson  wrote:
>> hi everyone
>>
>> I'm running a simple script to remove genbank records from
>> a GB file that I have indentified as undesirable. ...
>
> ...
>
> That said, I would approach your problem in a very different
> way. I would NOT parse the file into objects at all - I would
> just loop over the lines, toggling between desired or not,
> and outputting the lines for desired records as is. This
> assumes your criteria for "desired" is simple to define,
> e.g. a list of LOCUS identifiers.

If you can just look at the LOCUS line, this is very easy in
Python (you don't need Biopython at all). It will also be very
fast as there is no complicated parsing and object creation.
e.g.

wanted = set(["AB493847", "AB493848"])
inp_handle = open("original.txt")
out_handle = open("new.txt", "w")
save = False
for line in inp_handle :
    if line.startswith("LOCUS") : #start of record
        save = line.split()[1] in wanted
    if save :
        out_handle.write(line)
    if line.strip() == "//" : #end of record
        save = False
inp_handle.close()
out_handle.close()

I've written this using a set of good record identifiers. If you have a
list of bad records, just switch round the "in" check.

If you need to access something like the annotation, or the sequence,
then it does make sense to parse the records - but keep a copy of
the raw GenBank record as a string to use for output. One way to
do this is to use StringIO.

Peter

From bassbabyface at yahoo.com  Thu Oct 29 10:59:45 2009
From: bassbabyface at yahoo.com (Ben O'Loghlin)
Date: Fri, 30 Oct 2009 01:59:45 +1100
Subject: [Biopython] Entrez.read return value is typed as a string??
In-Reply-To: <109726.94290.qm@web62408.mail.re1.yahoo.com>
References: <001901ca5846$96f69d60$c4e3d820$@com>
	<109726.94290.qm@web62408.mail.re1.yahoo.com>
Message-ID: <005001ca58a8$75a41cc0$60ec5640$@com>

Thanks Michiel.

What is the function of the 'u' in the string discussed below? That's the
bit that's got me confused.

Best regards,
Ben

p.s. assistance on this list is fast and useful. Nice!

-----Original Message-----
From: Michiel de Hoon [mailto:mjldehoon at yahoo.com] 
Sent: Thursday, 29 October 2009 2:50 PM
To: 'Peter'; Ben O'Loghlin
Cc: biopython at biopython.org
Subject: Re: [Biopython] Entrez.read return value is typed as a string??



--- On Wed, 10/28/09, Ben O'Loghlin  wrote:
> >>>> handle = Entrez.efetch(db="pubmed", id="17206916", retmode="XML")
> >>>> record = Entrez.read(handle)
> >>>> print record
> >[{u'MedlineCitation': ... ]
> 
> I'm interested in using this format, however I don't
> understand how to
> read/write fields and subtrees of the object type
> 'Bio.Entrez.Parser.ListElement' returned by
> Entrez.read(handle) with retmode XML. 
> 
> I'm finding it hard to track down references to this
> [{u'x':['y']}] object format in Python ...

Look at the outermost two brackets [].
You can treat this object as a Python list.

So if record = [{u'x':['y']}],
then record[0] = {u'x':['y']}

Now look at the two outermost braces {}.
You can treat record[0] as a dictionary.
So record[0]['x'] will return ['y'].
Which can then be treated as a Python list.

--Michiel.


      



From biopython at maubp.freeserve.co.uk  Thu Oct 29 11:37:21 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 29 Oct 2009 15:37:21 +0000
Subject: [Biopython] Entrez.read return value is typed as a string??
In-Reply-To: <005001ca58a8$75a41cc0$60ec5640$@com>
References: <001901ca5846$96f69d60$c4e3d820$@com>
	<109726.94290.qm@web62408.mail.re1.yahoo.com>
	<005001ca58a8$75a41cc0$60ec5640$@com>
Message-ID: <320fb6e00910290837w5861226dsb0a9f4a9fb4acd1f@mail.gmail.com>

On Thu, Oct 29, 2009 at 2:59 PM, Ben O'Loghlin  wrote:
> Thanks Michiel.
>
> What is the function of the 'u' in the string discussed below?
> That's the bit that's got me confused.
>
> Best regards,
> Ben
>
> p.s. assistance on this list is fast and useful. Nice!

Again, its a bit of Python basics rather than anything Biopython
specific. The u is for unicode, thus "fred" gives a normal string
while u"fred" gives a unicode string. Unless you are messing
about with odd foreign characters (e.g. letters with accents) you
won't have to worry about this. Python 3 gets rid of the dichotomy
by using unicode for all strings.

Peter

From bassbabyface at yahoo.com  Sat Oct 31 20:58:10 2009
From: bassbabyface at yahoo.com (Ben O'Loghlin)
Date: Sun, 1 Nov 2009 11:58:10 +1100
Subject: [Biopython] Entrez.read return value is typed as a string??
In-Reply-To: <320fb6e00910290837w5861226dsb0a9f4a9fb4acd1f@mail.gmail.com>
References: <001901ca5846$96f69d60$c4e3d820$@com>	
	<109726.94290.qm@web62408.mail.re1.yahoo.com>	
	<005001ca58a8$75a41cc0$60ec5640$@com>
	<320fb6e00910290837w5861226dsb0a9f4a9fb4acd1f@mail.gmail.com>
Message-ID: <016101ca5a8e$633f0210$29bd0630$@com>

Thanks Peter, another small step up the learning curve!

Ben

-----Original Message-----
From: p.j.a.cock at googlemail.com [mailto:p.j.a.cock at googlemail.com] On Behalf
Of Peter
Sent: Friday, 30 October 2009 2:37 AM
To: Ben O'Loghlin
Cc: Michiel de Hoon; biopython at biopython.org
Subject: Re: [Biopython] Entrez.read return value is typed as a string??

On Thu, Oct 29, 2009 at 2:59 PM, Ben O'Loghlin 
wrote:
> Thanks Michiel.
>
> What is the function of the 'u' in the string discussed below?
> That's the bit that's got me confused.
>
> Best regards,
> Ben
>
> p.s. assistance on this list is fast and useful. Nice!

Again, its a bit of Python basics rather than anything Biopython
specific. The u is for unicode, thus "fred" gives a normal string
while u"fred" gives a unicode string. Unless you are messing
about with odd foreign characters (e.g. letters with accents) you
won't have to worry about this. Python 3 gets rid of the dichotomy
by using unicode for all strings.

Peter



From biopython at maubp.freeserve.co.uk  Thu Oct  1 08:06:22 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 1 Oct 2009 09:06:22 +0100
Subject: [Biopython] get back raw records with SeqIO?
In-Reply-To: <4995FD92-374F-4B4F-947B-CC9175B06BCD@u.washington.edu>
References: <44353069-545B-4248-BAD1-958363EE8F29@u.washington.edu>
	<320fb6e00909250250q417516ffu5f75ceea7732ff72@mail.gmail.com>
	<4995FD92-374F-4B4F-947B-CC9175B06BCD@u.washington.edu>
Message-ID: <320fb6e00910010106t4126292bs2c9fac1db85fbd32@mail.gmail.com>

On Thu, Oct 1, 2009 at 12:14 AM, Cedar McKay  wrote:
>
>> Why do you want to do this? I'd like to understand the desired
>> usage.
>
> I didn't have a specific technical reason.

OK - if you come up with a good use case example, please let us know.

> It just seemed like everything was going towards using SeqIO and things
> like Bio.Fasta were being deprecated, so I wanted to get ahead of the
> curve there. But if Bio.Genbank is going to be around for a long time,
> I don't have any problem with doing it that way.

For more complicated file formats (e.g. GenBank, SwissProt, ACE,
PHRED, ...) mapping the data into SeqRecord objects isn't 100%
perfect. Here Bio.SeqIO really is just a unifing API sitting on top
of file format specific parsers (which live in other modules), which
is good enough for most tasks. Unless/until the SeqRecord objects
are a full mapping, any more file format specific data-structure still
has its uses - and thus I see no immediate pressure to remove
Bio.GenBank etc.

Unlike some of the Bio.SeqIO parsers, for "fasta" we don't use
an underlying module (such as Bio.Fasta), and the SeqRecord
can capture all of the annotation in the raw file. One reason
for this is at the time, Bio.Fasta still used Martel and was
noticeably slower than the pure python code I adopted for
FASTA files in SeqIO. Since then Bio.Fasta has lost all the
Martel dependencies (which meant the loss of the old indexing
code, indirectly leading to the Bio.SeqIO.index() function as
per our previous discussions). This means that the remaining
code in Bio.Fasta is now redundant. Maybe we could have
just left Bio.Fasta alone, sitting quietly but tagged obsolete,
but it is clearer to remove redundancy.

Peter

P.S. For the record, Bio.Fasta was declared obsolete in
Biopython 1.48 (Sept 2008), and deprecated in Biopython
1.51 (Aug 2009).


From denzel.dz.li at gmail.com  Mon Oct  5 17:38:38 2009
From: denzel.dz.li at gmail.com (Denzel Li)
Date: Mon, 5 Oct 2009 13:38:38 -0400
Subject: [Biopython] Combine nexus files but not concatenating them
Message-ID: 

Hi all:
I notice there is a solution for combining nexus files as appeared in the
cookbook
(http://biopython.org/wiki/Concatenate_nexus ).  However, in the example the
alignments are concatenated. What if I want is, for example, the following
two files are combined into one file as shown in "combinedFile.nex".

# file1.nex
b1 GGG
b2 GGT

# file2.nex
b1 AAA
b2 AAT


# combinedFile.nex
begin data;
  dimensions ntax=2 nchar=6
[alignment from file1.nex]
b1 GGG
b2 GGT
[alignment from file2.nex]
b1 AAA
b2 AAT
;end;

begin sets;
charset a1=1-3;
charset a2=4-6;
end;

Any suggestion is highly appreciated. Thank you.

Best,
Denzel


From biopython at maubp.freeserve.co.uk  Mon Oct  5 19:42:48 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 5 Oct 2009 20:42:48 +0100
Subject: [Biopython] Combine nexus files but not concatenating them
In-Reply-To: 
References: 
Message-ID: <320fb6e00910051242n54d6229fyacc23119401715e@mail.gmail.com>

On Mon, Oct 5, 2009 at 6:38 PM, Denzel Li  wrote:
> Hi all:
> I notice there is a solution for combining nexus files as appeared in the
> cookbook
> (http://biopython.org/wiki/Concatenate_nexus ). ?However, in the example the
> alignments are concatenated. What if I want is, for example, the following
> two files are combined into one file as shown in "combinedFile.nex".

I was under the impression that NEXUS files should only hold
one alignment matrix. Why do you need it done this way? Isn't
your example basically the same thing but interleaved?

Peter



From denzel.dz.li at gmail.com  Mon Oct  5 20:00:06 2009
From: denzel.dz.li at gmail.com (Denzel Li)
Date: Mon, 5 Oct 2009 16:00:06 -0400
Subject: [Biopython] Combine nexus files but not concatenating them
In-Reply-To: <320fb6e00910051242n54d6229fyacc23119401715e@mail.gmail.com>
References: 
	<320fb6e00910051242n54d6229fyacc23119401715e@mail.gmail.com>
Message-ID: 

Hi Peter:
Yes, it is basically the same thing returned by "nexus.combine" but
"interleaved".  A further question is that, is it possible to split one
nexus into several nexus according to the Charset (or partition) defined in
the file. Like in the concatenation example (
http://biopython.org/wiki/Concatenate_nexus ), split the combined file into
btCOI.nex,btCOII.nex and btITS.nex.

Thanks,
Denzel


On Mon, Oct 5, 2009 at 3:42 PM, Peter wrote:

> On Mon, Oct 5, 2009 at 6:38 PM, Denzel Li  wrote:
> > Hi all:
> > I notice there is a solution for combining nexus files as appeared in the
> > cookbook
> > (http://biopython.org/wiki/Concatenate_nexus ).  However, in the example
> the
> > alignments are concatenated. What if I want is, for example, the
> following
> > two files are combined into one file as shown in "combinedFile.nex".
>
> I was under the impression that NEXUS files should only hold
> one alignment matrix. Why do you need it done this way? Isn't
> your example basically the same thing but interleaved?
>
> Peter
>


From biopython at maubp.freeserve.co.uk  Mon Oct  5 20:31:53 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 5 Oct 2009 21:31:53 +0100
Subject: [Biopython] Combine nexus files but not concatenating them
In-Reply-To: 
References: 
	<320fb6e00910051242n54d6229fyacc23119401715e@mail.gmail.com>
	
Message-ID: <320fb6e00910051331u2f27c7ecmde744d29fb3ba2ab@mail.gmail.com>

On Mon, Oct 5, 2009 at 9:00 PM, Denzel Li  wrote:
> Hi Peter:
> Yes, it is basically the same thing returned by "nexus.combine" but
> "interleaved".

Surely whether or not the data is interleaved is immaterial to the
meaning. Does the combined version following our wiki not work
for some 3rd party tool?

> A further question is that, is it possible to split one nexus
> into several nexus according to the Charset (or partition)
> defined in the file. Like in the concatenation example
> (http://biopython.org/wiki/Concatenate_nexus ), split the
> combined file into btCOI.nex,btCOII.nex and btITS.nex.

Does the write_nexus_data_partitions() method of the Nexus
object do what you want?

Peter


From harekrishna at gmail.com  Tue Oct  6 21:07:52 2009
From: harekrishna at gmail.com (Austin Davis-Richardson)
Date: Tue, 6 Oct 2009 17:07:52 -0400
Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary() results
Message-ID: 

Howdy,

I'm using BioPython to generate a table of accession numbers and their
corresponding TaxIDs.  The fastest way I can do this is 20 at a time
(20 per 3 seconds rather than 1 per 3 seconds).

However, this results in a problem.

whenever my script receives a result from NCBI that is blank such as
there being no value for TaxID, BioPython crashes with the error:

  File "taxcollector3.py", line 39, in getTaxID
    record = Entrez.read(handle)
  File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py",
line 259, in read
    record = handler.run(handle)
  File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
line 90, in run
    self.parser.ParseFile(handle)
  File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
line 191, in endElement
    value = IntegerElement(value)
ValueError: invalid literal for int() with base 10: ''


my code looks like this:  Where gids is a string of comma-separated GIDs
(I get the GIDs from the accession numbers using
eEntrez.esearch(db="nucleotide", rettype="text", term=accessions))

			handle = Entrez.esummary(db="nucleotide", id=gids)
			record = Entrez.read(handle)


The only solution I can come up with is searching one at a time, but
this is very slow.  (I have about 300,000 accession numbers)

Does anyone know perhaps a patch or a solution for this?  Or maybe an
easier way to get a TaxID from an accession number?

Thanks,
Austin Davis-Richardson


From mjldehoon at yahoo.com  Wed Oct  7 02:11:36 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Tue, 6 Oct 2009 19:11:36 -0700 (PDT)
Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary()
	results
In-Reply-To: 
Message-ID: <362834.37683.qm@web62401.mail.re1.yahoo.com>

You could try the following (with biopython 1.52):

handle = Entrez.esummary(db="nucleotide", id=gids)
records = Entrez.parse(handle)
while True:
    try:
        record = records.next()
    except StopIteration:
        break
    except:
        print "Skipping record"


We should probably modify Bio.Entrez so that empty "integer" values are treated correctly.


--Michiel.

--- On Tue, 10/6/09, Austin Davis-Richardson  wrote:

> From: Austin Davis-Richardson 
> Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary() results
> To: biopython at lists.open-bio.org
> Date: Tuesday, October 6, 2009, 5:07 PM
> Howdy,
> 
> I'm using BioPython to generate a table of accession
> numbers and their
> corresponding TaxIDs.? The fastest way I can do this
> is 20 at a time
> (20 per 3 seconds rather than 1 per 3 seconds).
> 
> However, this results in a problem.
> 
> whenever my script receives a result from NCBI that is
> blank such as
> there being no value for TaxID, BioPython crashes with the
> error:
> 
> ? File "taxcollector3.py", line 39, in getTaxID
> ? ? record = Entrez.read(handle)
> ? File
> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py",
> line 259, in read
> ? ? record = handler.run(handle)
> ? File
> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> line 90, in run
> ? ? self.parser.ParseFile(handle)
> ? File
> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> line 191, in endElement
> ? ? value = IntegerElement(value)
> ValueError: invalid literal for int() with base 10: ''
> 
> 
> my code looks like this:? Where gids is a string of
> comma-separated GIDs
> (I get the GIDs from the accession numbers using
> eEntrez.esearch(db="nucleotide", rettype="text",
> term=accessions))
> 
> ??? ??? ???
> handle = Entrez.esummary(db="nucleotide", id=gids)
> ??? ??? ???
> record = Entrez.read(handle)
> 
> 
> The only solution I can come up with is searching one at a
> time, but
> this is very slow.? (I have about 300,000 accession
> numbers)
> 
> Does anyone know perhaps a patch or a solution for
> this?? Or maybe an
> easier way to get a TaxID from an accession number?
> 
> Thanks,
> Austin Davis-Richardson
> _______________________________________________
> Biopython mailing list? -? Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
> 


      



From biopython at maubp.freeserve.co.uk  Wed Oct  7 09:29:36 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 7 Oct 2009 10:29:36 +0100
Subject: [Biopython] Combine nexus files but not concatenating them
In-Reply-To: 
References: 
	<320fb6e00910051242n54d6229fyacc23119401715e@mail.gmail.com>
	
	<320fb6e00910051331u2f27c7ecmde744d29fb3ba2ab@mail.gmail.com>
	
Message-ID: <320fb6e00910070229n1b78542dj82998de13cf7eed7@mail.gmail.com>

On Wed, Oct 7, 2009 at 4:22 AM, Denzel Li  wrote:
> Hi Peter:
> Thank you for the help. Both functions work well. By the way, will
> "standard" datatype or "mixed" datatype be supported in Bio:Nexus:Nexus?
>
> Best,
> Denzel

Hi Denzel,

I CC'd the list - please try and keep replies send there.

I'm glad Bio.Nexus is working well for you.

Regarding the finer details of the NEXUS file format and the Biopython
code, I am not an expert - we need Frank or Cymon to comment. If
you could give us a couple of examples of what you are asking for it
would probably be much clearer (to me at least).

Regards,

Peter


From biopython at maubp.freeserve.co.uk  Wed Oct  7 11:17:30 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 7 Oct 2009 12:17:30 +0100
Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary()
	results
In-Reply-To: <362834.37683.qm@web62401.mail.re1.yahoo.com>
References: 
	<362834.37683.qm@web62401.mail.re1.yahoo.com>
Message-ID: <320fb6e00910070417w26236a62ifece2e2610256609@mail.gmail.com>

On Wed, Oct 7, 2009 at 3:11 AM, Michiel de Hoon  wrote:
>
> We should probably modify Bio.Entrez so that empty "integer" values are treated correctly.
>

Does "correctly" mean a default value? I see Brad has just commited a change to
use -1 in this case, but perhaps None is also a good choice? Can we
alternatively
leave this bit of the data structure empty?

Peter


From chapmanb at 50mail.com  Wed Oct  7 11:17:37 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 7 Oct 2009 07:17:37 -0400
Subject: [Biopython] Skipping over blank/erroneous
	Entrez.esummary()	results
In-Reply-To: 
References: 
Message-ID: <20091007111737.GC84267@sobchak.mgh.harvard.edu>

Hi Austin;

> I'm using BioPython to generate a table of accession numbers and their
> corresponding TaxIDs.  The fastest way I can do this is 20 at a time
> (20 per 3 seconds rather than 1 per 3 seconds).
> 
> However, this results in a problem.
> 
> whenever my script receives a result from NCBI that is blank such as
> there being no value for TaxID, BioPython crashes with the error:
> 
>   File "taxcollector3.py", line 39, in getTaxID
>     record = Entrez.read(handle)
>   File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py",
> line 259, in read
>     record = handler.run(handle)
>   File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> line 90, in run
>     self.parser.ParseFile(handle)
>   File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> line 191, in endElement
>     value = IntegerElement(value)
> ValueError: invalid literal for int() with base 10: ''

In addition to Michiel's workaround, I checked in a small change
which could at least circumvent the error you are reporting:

http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279

It affects only one file, so if you don't want to pull the latest
from GitHub, you can download just that file and replace it in your
Biopython library:

http://github.com/biopython/biopython/blob/master/Bio/Entrez/Parser.py

Ideally, we should have a test case to cover this. Could you let us
know specific GIs that are causing the problem? The group of 20 is
fine if you haven't narrowed it further than that. This'll also help
us check if there are any other problems with these records.

Thanks for reporting this,
Brad


From mjldehoon at yahoo.com  Wed Oct  7 12:19:01 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Wed, 7 Oct 2009 05:19:01 -0700 (PDT)
Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary()
	results
In-Reply-To: <20091007111737.GC84267@sobchak.mgh.harvard.edu>
Message-ID: <826538.32828.qm@web62406.mail.re1.yahoo.com>

> In addition to Michiel's workaround, I checked in a small
> change
> which could at least circumvent the error you are
> reporting:
> 
> http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279

Sorry, but that change introduces two bugs. First, we should be able to distinguish between -1 and missing values. More importantly, we want to be able to add attributes to value. Since -1 is an integer instead of an object, it won't allow that.

Can you revert this change?

--Michiel

--- On Wed, 10/7/09, Brad Chapman  wrote:

> From: Brad Chapman 
> Subject: Re: [Biopython] Skipping over blank/erroneous Entrez.esummary() results
> To: "Austin Davis-Richardson" 
> Cc: biopython at lists.open-bio.org
> Date: Wednesday, October 7, 2009, 7:17 AM
> Hi Austin;
> 
> > I'm using BioPython to generate a table of accession
> numbers and their
> > corresponding TaxIDs.? The fastest way I can do
> this is 20 at a time
> > (20 per 3 seconds rather than 1 per 3 seconds).
> > 
> > However, this results in a problem.
> > 
> > whenever my script receives a result from NCBI that is
> blank such as
> > there being no value for TaxID, BioPython crashes with
> the error:
> > 
> >???File "taxcollector3.py", line 39, in
> getTaxID
> >? ???record = Entrez.read(handle)
> >???File
> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py",
> > line 259, in read
> >? ???record = handler.run(handle)
> >???File
> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> > line 90, in run
> >? ???self.parser.ParseFile(handle)
> >???File
> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> > line 191, in endElement
> >? ???value = IntegerElement(value)
> > ValueError: invalid literal for int() with base 10:
> ''
> 
> In addition to Michiel's workaround, I checked in a small
> change
> which could at least circumvent the error you are
> reporting:
> 
> http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279
> 
> It affects only one file, so if you don't want to pull the
> latest
> from GitHub, you can download just that file and replace it
> in your
> Biopython library:
> 
> http://github.com/biopython/biopython/blob/master/Bio/Entrez/Parser.py
> 
> Ideally, we should have a test case to cover this. Could
> you let us
> know specific GIs that are causing the problem? The group
> of 20 is
> fine if you haven't narrowed it further than that. This'll
> also help
> us check if there are any other problems with these
> records.
> 
> Thanks for reporting this,
> Brad
> _______________________________________________
> Biopython mailing list? -? Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
> 


      



From chapmanb at 50mail.com  Wed Oct  7 12:32:27 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 7 Oct 2009 08:32:27 -0400
Subject: [Biopython] Skipping over blank/erroneous
	Entrez.esummary()	results
In-Reply-To: <826538.32828.qm@web62406.mail.re1.yahoo.com>
References: <20091007111737.GC84267@sobchak.mgh.harvard.edu>
	<826538.32828.qm@web62406.mail.re1.yahoo.com>
Message-ID: <20091007123227.GD84267@sobchak.mgh.harvard.edu>

Peter and Michiel;

> > In addition to Michiel's workaround, I checked in a small
> > change which could at least circumvent the error you are
> > reporting:
> > 
> > http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279

Peter:
> Does "correctly" mean a default value? I see Brad has just commited a change to
> use -1 in this case, but perhaps None is also a good choice? Can we
> alternatively
> leave this bit of the data structure empty?

Michiel:
> Sorry, but that change introduces two bugs. First, we should be able
> to distinguish between -1 and missing values. More importantly, we
> want to be able to add attributes to value. Since -1 is an integer
> instead of an object, it won't allow that.
>
> Can you revert this change?

Thanks guys -- not the best choice. How do you feel about just passing
it along as an empty string and only doing the integer conversion if we
actually have data to convert?

http://github.com/biopython/biopython/commit/1fff8038e4fa9e2643851a70118e3227ccbea44e

So now missing values are empty strings, as passed, instead of any
sort of integer interpretation of them.

Brad


From harekrishna at gmail.com  Wed Oct  7 20:11:03 2009
From: harekrishna at gmail.com (Austin Davis-Richardson)
Date: Wed, 7 Oct 2009 16:11:03 -0400
Subject: [Biopython] Biopython Digest, Vol 82, Issue 3
In-Reply-To: 
References: 
Message-ID: 

I'm confused now.  In the latest version

http://github.com/biopython/biopython/commit/1fff8038e4fa9e2643851a70118e3227ccbea44e

Missing values are empty strings so if I did something like

record = Entrez.read(handle)

for item in record:
    myList.append += item['TaxId']

myList should be something like :
[ '1234', '2434', '', '9970' ]
where myList[2] is the result of a missing value

However, when I run my script.  I find no blank spaces despite knowing
that there are some that should have missing values.
Which screws things up later when I zip tax ID's with their
corresponding accession number:

zip (accessions, taxids)

I'm all for using '1' (root) or '-1' for missing values.


2009/10/7  :
> Send Biopython mailing list submissions to
> ? ? ? ?biopython at lists.open-bio.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
> ? ? ? ?http://lists.open-bio.org/mailman/listinfo/biopython
> or, via email, send a message with subject or body 'help' to
> ? ? ? ?biopython-request at lists.open-bio.org
>
> You can reach the person managing the list at
> ? ? ? ?biopython-owner at lists.open-bio.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Biopython digest..."
>
>
> Today's Topics:
>
> ? 1. Skipping over blank/erroneous Entrez.esummary() results
> ? ? ?(Austin Davis-Richardson)
> ? 2. Re: Skipping over blank/erroneous Entrez.esummary() ? ? ? results
> ? ? ?(Michiel de Hoon)
> ? 3. Re: Combine nexus files but not concatenating them (Peter)
> ? 4. Re: Skipping over blank/erroneous Entrez.esummary() ? ? ? results
> ? ? ?(Peter)
> ? 5. Re: Skipping over blank/erroneous Entrez.esummary() ? ? ? results
> ? ? ?(Brad Chapman)
> ? 6. Re: Skipping over blank/erroneous Entrez.esummary() ? ? ? results
> ? ? ?(Michiel de Hoon)
> ? 7. Re: Skipping over blank/erroneous Entrez.esummary() ? ? ? results
> ? ? ?(Brad Chapman)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 6 Oct 2009 17:07:52 -0400
> From: Austin Davis-Richardson 
> Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary()
> ? ? ? ?results
> To: biopython at lists.open-bio.org
> Message-ID:
> ? ? ? ?
> Content-Type: text/plain; charset=ISO-8859-1
>
> Howdy,
>
> I'm using BioPython to generate a table of accession numbers and their
> corresponding TaxIDs. ?The fastest way I can do this is 20 at a time
> (20 per 3 seconds rather than 1 per 3 seconds).
>
> However, this results in a problem.
>
> whenever my script receives a result from NCBI that is blank such as
> there being no value for TaxID, BioPython crashes with the error:
>
> ?File "taxcollector3.py", line 39, in getTaxID
> ? ?record = Entrez.read(handle)
> ?File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py",
> line 259, in read
> ? ?record = handler.run(handle)
> ?File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> line 90, in run
> ? ?self.parser.ParseFile(handle)
> ?File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> line 191, in endElement
> ? ?value = IntegerElement(value)
> ValueError: invalid literal for int() with base 10: ''
>
>
> my code looks like this: ?Where gids is a string of comma-separated GIDs
> (I get the GIDs from the accession numbers using
> eEntrez.esearch(db="nucleotide", rettype="text", term=accessions))
>
> ? ? ? ? ? ? ? ? ? ? ? ?handle = Entrez.esummary(db="nucleotide", id=gids)
> ? ? ? ? ? ? ? ? ? ? ? ?record = Entrez.read(handle)
>
>
> The only solution I can come up with is searching one at a time, but
> this is very slow. ?(I have about 300,000 accession numbers)
>
> Does anyone know perhaps a patch or a solution for this? ?Or maybe an
> easier way to get a TaxID from an accession number?
>
> Thanks,
> Austin Davis-Richardson
>
>
> ------------------------------
>
> Message: 2
> Date: Tue, 6 Oct 2009 19:11:36 -0700 (PDT)
> From: Michiel de Hoon 
> Subject: Re: [Biopython] Skipping over blank/erroneous
> ? ? ? ?Entrez.esummary() ? ? ? results
> To: biopython at lists.open-bio.org, ? ? ? Austin Davis-Richardson
> ? ? ? ?
> Message-ID: <362834.37683.qm at web62401.mail.re1.yahoo.com>
> Content-Type: text/plain; charset=iso-8859-1
>
> You could try the following (with biopython 1.52):
>
> handle = Entrez.esummary(db="nucleotide", id=gids)
> records = Entrez.parse(handle)
> while True:
> ? ?try:
> ? ? ? ?record = records.next()
> ? ?except StopIteration:
> ? ? ? ?break
> ? ?except:
> ? ? ? ?print "Skipping record"
>
>
> We should probably modify Bio.Entrez so that empty "integer" values are treated correctly.
>
>
> --Michiel.
>
> --- On Tue, 10/6/09, Austin Davis-Richardson  wrote:
>
>> From: Austin Davis-Richardson 
>> Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary() results
>> To: biopython at lists.open-bio.org
>> Date: Tuesday, October 6, 2009, 5:07 PM
>> Howdy,
>>
>> I'm using BioPython to generate a table of accession
>> numbers and their
>> corresponding TaxIDs.? The fastest way I can do this
>> is 20 at a time
>> (20 per 3 seconds rather than 1 per 3 seconds).
>>
>> However, this results in a problem.
>>
>> whenever my script receives a result from NCBI that is
>> blank such as
>> there being no value for TaxID, BioPython crashes with the
>> error:
>>
>> ? File "taxcollector3.py", line 39, in getTaxID
>> ? ? record = Entrez.read(handle)
>> ? File
>> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py",
>> line 259, in read
>> ? ? record = handler.run(handle)
>> ? File
>> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
>> line 90, in run
>> ? ? self.parser.ParseFile(handle)
>> ? File
>> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
>> line 191, in endElement
>> ? ? value = IntegerElement(value)
>> ValueError: invalid literal for int() with base 10: ''
>>
>>
>> my code looks like this:? Where gids is a string of
>> comma-separated GIDs
>> (I get the GIDs from the accession numbers using
>> eEntrez.esearch(db="nucleotide", rettype="text",
>> term=accessions))
>>
>> ??? ??? ???
>> handle = Entrez.esummary(db="nucleotide", id=gids)
>> ??? ??? ???
>> record = Entrez.read(handle)
>>
>>
>> The only solution I can come up with is searching one at a
>> time, but
>> this is very slow.? (I have about 300,000 accession
>> numbers)
>>
>> Does anyone know perhaps a patch or a solution for
>> this?? Or maybe an
>> easier way to get a TaxID from an accession number?
>>
>> Thanks,
>> Austin Davis-Richardson
>> _______________________________________________
>> Biopython mailing list? -? Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>
>
>
>
>
>
> ------------------------------
>
> Message: 3
> Date: Wed, 7 Oct 2009 10:29:36 +0100
> From: Peter 
> Subject: Re: [Biopython] Combine nexus files but not concatenating
> ? ? ? ?them
> To: Denzel Li 
> Cc: Biopython Mailing List 
> Message-ID:
> ? ? ? ?<320fb6e00910070229n1b78542dj82998de13cf7eed7 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> On Wed, Oct 7, 2009 at 4:22 AM, Denzel Li  wrote:
>> Hi Peter:
>> Thank you for the help. Both functions work well. By the way, will
>> "standard" datatype or "mixed" datatype be supported in Bio:Nexus:Nexus?
>>
>> Best,
>> Denzel
>
> Hi Denzel,
>
> I CC'd the list - please try and keep replies send there.
>
> I'm glad Bio.Nexus is working well for you.
>
> Regarding the finer details of the NEXUS file format and the Biopython
> code, I am not an expert - we need Frank or Cymon to comment. If
> you could give us a couple of examples of what you are asking for it
> would probably be much clearer (to me at least).
>
> Regards,
>
> Peter
>
>
> ------------------------------
>
> Message: 4
> Date: Wed, 7 Oct 2009 12:17:30 +0100
> From: Peter 
> Subject: Re: [Biopython] Skipping over blank/erroneous
> ? ? ? ?Entrez.esummary() ? ? ? results
> To: Michiel de Hoon 
> Cc: biopython at lists.open-bio.org, ? ? ? Austin Davis-Richardson
> ? ? ? ?
> Message-ID:
> ? ? ? ?<320fb6e00910070417w26236a62ifece2e2610256609 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> On Wed, Oct 7, 2009 at 3:11 AM, Michiel de Hoon  wrote:
>>
>> We should probably modify Bio.Entrez so that empty "integer" values are treated correctly.
>>
>
> Does "correctly" mean a default value? I see Brad has just commited a change to
> use -1 in this case, but perhaps None is also a good choice? Can we
> alternatively
> leave this bit of the data structure empty?
>
> Peter
>
>
> ------------------------------
>
> Message: 5
> Date: Wed, 7 Oct 2009 07:17:37 -0400
> From: Brad Chapman 
> Subject: Re: [Biopython] Skipping over blank/erroneous
> ? ? ? ?Entrez.esummary() ? ? ? results
> To: Austin Davis-Richardson 
> Cc: biopython at lists.open-bio.org
> Message-ID: <20091007111737.GC84267 at sobchak.mgh.harvard.edu>
> Content-Type: text/plain; charset=us-ascii
>
> Hi Austin;
>
>> I'm using BioPython to generate a table of accession numbers and their
>> corresponding TaxIDs. ?The fastest way I can do this is 20 at a time
>> (20 per 3 seconds rather than 1 per 3 seconds).
>>
>> However, this results in a problem.
>>
>> whenever my script receives a result from NCBI that is blank such as
>> there being no value for TaxID, BioPython crashes with the error:
>>
>> ? File "taxcollector3.py", line 39, in getTaxID
>> ? ? record = Entrez.read(handle)
>> ? File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py",
>> line 259, in read
>> ? ? record = handler.run(handle)
>> ? File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
>> line 90, in run
>> ? ? self.parser.ParseFile(handle)
>> ? File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
>> line 191, in endElement
>> ? ? value = IntegerElement(value)
>> ValueError: invalid literal for int() with base 10: ''
>
> In addition to Michiel's workaround, I checked in a small change
> which could at least circumvent the error you are reporting:
>
> http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279
>
> It affects only one file, so if you don't want to pull the latest
> from GitHub, you can download just that file and replace it in your
> Biopython library:
>
> http://github.com/biopython/biopython/blob/master/Bio/Entrez/Parser.py
>
> Ideally, we should have a test case to cover this. Could you let us
> know specific GIs that are causing the problem? The group of 20 is
> fine if you haven't narrowed it further than that. This'll also help
> us check if there are any other problems with these records.
>
> Thanks for reporting this,
> Brad
>
>
> ------------------------------
>
> Message: 6
> Date: Wed, 7 Oct 2009 05:19:01 -0700 (PDT)
> From: Michiel de Hoon 
> Subject: Re: [Biopython] Skipping over blank/erroneous
> ? ? ? ?Entrez.esummary() ? ? ? results
> To: Austin Davis-Richardson , ? ?Brad Chapman
> ? ? ? ?
> Cc: biopython at lists.open-bio.org
> Message-ID: <826538.32828.qm at web62406.mail.re1.yahoo.com>
> Content-Type: text/plain; charset=iso-8859-1
>
>> In addition to Michiel's workaround, I checked in a small
>> change
>> which could at least circumvent the error you are
>> reporting:
>>
>> http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279
>
> Sorry, but that change introduces two bugs. First, we should be able to distinguish between -1 and missing values. More importantly, we want to be able to add attributes to value. Since -1 is an integer instead of an object, it won't allow that.
>
> Can you revert this change?
>
> --Michiel
>
> --- On Wed, 10/7/09, Brad Chapman  wrote:
>
>> From: Brad Chapman 
>> Subject: Re: [Biopython] Skipping over blank/erroneous Entrez.esummary() results
>> To: "Austin Davis-Richardson" 
>> Cc: biopython at lists.open-bio.org
>> Date: Wednesday, October 7, 2009, 7:17 AM
>> Hi Austin;
>>
>> > I'm using BioPython to generate a table of accession
>> numbers and their
>> > corresponding TaxIDs.? The fastest way I can do
>> this is 20 at a time
>> > (20 per 3 seconds rather than 1 per 3 seconds).
>> >
>> > However, this results in a problem.
>> >
>> > whenever my script receives a result from NCBI that is
>> blank such as
>> > there being no value for TaxID, BioPython crashes with
>> the error:
>> >
>> >???File "taxcollector3.py", line 39, in
>> getTaxID
>> >? ???record = Entrez.read(handle)
>> >???File
>> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py",
>> > line 259, in read
>> >? ???record = handler.run(handle)
>> >???File
>> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
>> > line 90, in run
>> >? ???self.parser.ParseFile(handle)
>> >???File
>> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
>> > line 191, in endElement
>> >? ???value = IntegerElement(value)
>> > ValueError: invalid literal for int() with base 10:
>> ''
>>
>> In addition to Michiel's workaround, I checked in a small
>> change
>> which could at least circumvent the error you are
>> reporting:
>>
>> http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279
>>
>> It affects only one file, so if you don't want to pull the
>> latest
>> from GitHub, you can download just that file and replace it
>> in your
>> Biopython library:
>>
>> http://github.com/biopython/biopython/blob/master/Bio/Entrez/Parser.py
>>
>> Ideally, we should have a test case to cover this. Could
>> you let us
>> know specific GIs that are causing the problem? The group
>> of 20 is
>> fine if you haven't narrowed it further than that. This'll
>> also help
>> us check if there are any other problems with these
>> records.
>>
>> Thanks for reporting this,
>> Brad
>> _______________________________________________
>> Biopython mailing list? -? Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>
>
>
>
>
>
> ------------------------------
>
> Message: 7
> Date: Wed, 7 Oct 2009 08:32:27 -0400
> From: Brad Chapman 
> Subject: Re: [Biopython] Skipping over blank/erroneous
> ? ? ? ?Entrez.esummary() ? ? ? results
> To: Michiel de Hoon 
> Cc: Austin Davis-Richardson ,
> ? ? ? ?biopython at lists.open-bio.org
> Message-ID: <20091007123227.GD84267 at sobchak.mgh.harvard.edu>
> Content-Type: text/plain; charset=us-ascii
>
> Peter and Michiel;
>
>> > In addition to Michiel's workaround, I checked in a small
>> > change which could at least circumvent the error you are
>> > reporting:
>> >
>> > http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279
>
> Peter:
>> Does "correctly" mean a default value? I see Brad has just commited a change to
>> use -1 in this case, but perhaps None is also a good choice? Can we
>> alternatively
>> leave this bit of the data structure empty?
>
> Michiel:
>> Sorry, but that change introduces two bugs. First, we should be able
>> to distinguish between -1 and missing values. More importantly, we
>> want to be able to add attributes to value. Since -1 is an integer
>> instead of an object, it won't allow that.
>>
>> Can you revert this change?
>
> Thanks guys -- not the best choice. How do you feel about just passing
> it along as an empty string and only doing the integer conversion if we
> actually have data to convert?
>
> http://github.com/biopython/biopython/commit/1fff8038e4fa9e2643851a70118e3227ccbea44e
>
> So now missing values are empty strings, as passed, instead of any
> sort of integer interpretation of them.
>
> Brad
>
>
> ------------------------------
>
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>
> End of Biopython Digest, Vol 82, Issue 3
> ****************************************
>



-- 
AGDR



From chapmanb at 50mail.com  Wed Oct  7 20:29:11 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 7 Oct 2009 16:29:11 -0400
Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary()
In-Reply-To: 
References: 
	
Message-ID: <20091007202911.GI92415@sobchak.mgh.harvard.edu>

Hi Austin;
That is strange. That change may have unintended consequences
downstream. Could you send along a GI number that is causing
problems? If you revert that change and run the code printing out GI
numbers at each step, let me know the specific ones that are leading
to the initial error.

Once we have something reproducible to work with, we should be able
to track it down and provide a fix.

Thanks,
Brad

> I'm confused now.  In the latest version
> 
> http://github.com/biopython/biopython/commit/1fff8038e4fa9e2643851a70118e3227ccbea44e
> 
> Missing values are empty strings so if I did something like
> 
> record = Entrez.read(handle)
> 
> for item in record:
>     myList.append += item['TaxId']
> 
> myList should be something like :
> [ '1234', '2434', '', '9970' ]
> where myList[2] is the result of a missing value
> 
> However, when I run my script.  I find no blank spaces despite knowing
> that there are some that should have missing values.
> Which screws things up later when I zip tax ID's with their
> corresponding accession number:
> 
> zip (accessions, taxids)
> 
> I'm all for using '1' (root) or '-1' for missing values.
> 
> 
> 2009/10/7  :
> > Send Biopython mailing list submissions to
> > ? ? ? ?biopython at lists.open-bio.org
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> > ? ? ? ?http://lists.open-bio.org/mailman/listinfo/biopython
> > or, via email, send a message with subject or body 'help' to
> > ? ? ? ?biopython-request at lists.open-bio.org
> >
> > You can reach the person managing the list at
> > ? ? ? ?biopython-owner at lists.open-bio.org
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of Biopython digest..."
> >
> >
> > Today's Topics:
> >
> > ? 1. Skipping over blank/erroneous Entrez.esummary() results
> > ? ? ?(Austin Davis-Richardson)
> > ? 2. Re: Skipping over blank/erroneous Entrez.esummary() ? ? ? results
> > ? ? ?(Michiel de Hoon)
> > ? 3. Re: Combine nexus files but not concatenating them (Peter)
> > ? 4. Re: Skipping over blank/erroneous Entrez.esummary() ? ? ? results
> > ? ? ?(Peter)
> > ? 5. Re: Skipping over blank/erroneous Entrez.esummary() ? ? ? results
> > ? ? ?(Brad Chapman)
> > ? 6. Re: Skipping over blank/erroneous Entrez.esummary() ? ? ? results
> > ? ? ?(Michiel de Hoon)
> > ? 7. Re: Skipping over blank/erroneous Entrez.esummary() ? ? ? results
> > ? ? ?(Brad Chapman)
> >
> >
> > ----------------------------------------------------------------------
> >
> > Message: 1
> > Date: Tue, 6 Oct 2009 17:07:52 -0400
> > From: Austin Davis-Richardson 
> > Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary()
> > ? ? ? ?results
> > To: biopython at lists.open-bio.org
> > Message-ID:
> > ? ? ? ?
> > Content-Type: text/plain; charset=ISO-8859-1
> >
> > Howdy,
> >
> > I'm using BioPython to generate a table of accession numbers and their
> > corresponding TaxIDs. ?The fastest way I can do this is 20 at a time
> > (20 per 3 seconds rather than 1 per 3 seconds).
> >
> > However, this results in a problem.
> >
> > whenever my script receives a result from NCBI that is blank such as
> > there being no value for TaxID, BioPython crashes with the error:
> >
> > ?File "taxcollector3.py", line 39, in getTaxID
> > ? ?record = Entrez.read(handle)
> > ?File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py",
> > line 259, in read
> > ? ?record = handler.run(handle)
> > ?File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> > line 90, in run
> > ? ?self.parser.ParseFile(handle)
> > ?File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> > line 191, in endElement
> > ? ?value = IntegerElement(value)
> > ValueError: invalid literal for int() with base 10: ''
> >
> >
> > my code looks like this: ?Where gids is a string of comma-separated GIDs
> > (I get the GIDs from the accession numbers using
> > eEntrez.esearch(db="nucleotide", rettype="text", term=accessions))
> >
> > ? ? ? ? ? ? ? ? ? ? ? ?handle = Entrez.esummary(db="nucleotide", id=gids)
> > ? ? ? ? ? ? ? ? ? ? ? ?record = Entrez.read(handle)
> >
> >
> > The only solution I can come up with is searching one at a time, but
> > this is very slow. ?(I have about 300,000 accession numbers)
> >
> > Does anyone know perhaps a patch or a solution for this? ?Or maybe an
> > easier way to get a TaxID from an accession number?
> >
> > Thanks,
> > Austin Davis-Richardson
> >
> >
> > ------------------------------
> >
> > Message: 2
> > Date: Tue, 6 Oct 2009 19:11:36 -0700 (PDT)
> > From: Michiel de Hoon 
> > Subject: Re: [Biopython] Skipping over blank/erroneous
> > ? ? ? ?Entrez.esummary() ? ? ? results
> > To: biopython at lists.open-bio.org, ? ? ? Austin Davis-Richardson
> > ? ? ? ?
> > Message-ID: <362834.37683.qm at web62401.mail.re1.yahoo.com>
> > Content-Type: text/plain; charset=iso-8859-1
> >
> > You could try the following (with biopython 1.52):
> >
> > handle = Entrez.esummary(db="nucleotide", id=gids)
> > records = Entrez.parse(handle)
> > while True:
> > ? ?try:
> > ? ? ? ?record = records.next()
> > ? ?except StopIteration:
> > ? ? ? ?break
> > ? ?except:
> > ? ? ? ?print "Skipping record"
> >
> >
> > We should probably modify Bio.Entrez so that empty "integer" values are treated correctly.
> >
> >
> > --Michiel.
> >
> > --- On Tue, 10/6/09, Austin Davis-Richardson  wrote:
> >
> >> From: Austin Davis-Richardson 
> >> Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary() results
> >> To: biopython at lists.open-bio.org
> >> Date: Tuesday, October 6, 2009, 5:07 PM
> >> Howdy,
> >>
> >> I'm using BioPython to generate a table of accession
> >> numbers and their
> >> corresponding TaxIDs.? The fastest way I can do this
> >> is 20 at a time
> >> (20 per 3 seconds rather than 1 per 3 seconds).
> >>
> >> However, this results in a problem.
> >>
> >> whenever my script receives a result from NCBI that is
> >> blank such as
> >> there being no value for TaxID, BioPython crashes with the
> >> error:
> >>
> >> ? File "taxcollector3.py", line 39, in getTaxID
> >> ? ? record = Entrez.read(handle)
> >> ? File
> >> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py",
> >> line 259, in read
> >> ? ? record = handler.run(handle)
> >> ? File
> >> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> >> line 90, in run
> >> ? ? self.parser.ParseFile(handle)
> >> ? File
> >> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> >> line 191, in endElement
> >> ? ? value = IntegerElement(value)
> >> ValueError: invalid literal for int() with base 10: ''
> >>
> >>
> >> my code looks like this:? Where gids is a string of
> >> comma-separated GIDs
> >> (I get the GIDs from the accession numbers using
> >> eEntrez.esearch(db="nucleotide", rettype="text",
> >> term=accessions))
> >>
> >> ??? ??? ???
> >> handle = Entrez.esummary(db="nucleotide", id=gids)
> >> ??? ??? ???
> >> record = Entrez.read(handle)
> >>
> >>
> >> The only solution I can come up with is searching one at a
> >> time, but
> >> this is very slow.? (I have about 300,000 accession
> >> numbers)
> >>
> >> Does anyone know perhaps a patch or a solution for
> >> this?? Or maybe an
> >> easier way to get a TaxID from an accession number?
> >>
> >> Thanks,
> >> Austin Davis-Richardson
> >> _______________________________________________
> >> Biopython mailing list? -? Biopython at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biopython
> >>
> >
> >
> >
> >
> >
> >
> > ------------------------------
> >
> > Message: 3
> > Date: Wed, 7 Oct 2009 10:29:36 +0100
> > From: Peter 
> > Subject: Re: [Biopython] Combine nexus files but not concatenating
> > ? ? ? ?them
> > To: Denzel Li 
> > Cc: Biopython Mailing List 
> > Message-ID:
> > ? ? ? ?<320fb6e00910070229n1b78542dj82998de13cf7eed7 at mail.gmail.com>
> > Content-Type: text/plain; charset=ISO-8859-1
> >
> > On Wed, Oct 7, 2009 at 4:22 AM, Denzel Li  wrote:
> >> Hi Peter:
> >> Thank you for the help. Both functions work well. By the way, will
> >> "standard" datatype or "mixed" datatype be supported in Bio:Nexus:Nexus?
> >>
> >> Best,
> >> Denzel
> >
> > Hi Denzel,
> >
> > I CC'd the list - please try and keep replies send there.
> >
> > I'm glad Bio.Nexus is working well for you.
> >
> > Regarding the finer details of the NEXUS file format and the Biopython
> > code, I am not an expert - we need Frank or Cymon to comment. If
> > you could give us a couple of examples of what you are asking for it
> > would probably be much clearer (to me at least).
> >
> > Regards,
> >
> > Peter
> >
> >
> > ------------------------------
> >
> > Message: 4
> > Date: Wed, 7 Oct 2009 12:17:30 +0100
> > From: Peter 
> > Subject: Re: [Biopython] Skipping over blank/erroneous
> > ? ? ? ?Entrez.esummary() ? ? ? results
> > To: Michiel de Hoon 
> > Cc: biopython at lists.open-bio.org, ? ? ? Austin Davis-Richardson
> > ? ? ? ?
> > Message-ID:
> > ? ? ? ?<320fb6e00910070417w26236a62ifece2e2610256609 at mail.gmail.com>
> > Content-Type: text/plain; charset=ISO-8859-1
> >
> > On Wed, Oct 7, 2009 at 3:11 AM, Michiel de Hoon  wrote:
> >>
> >> We should probably modify Bio.Entrez so that empty "integer" values are treated correctly.
> >>
> >
> > Does "correctly" mean a default value? I see Brad has just commited a change to
> > use -1 in this case, but perhaps None is also a good choice? Can we
> > alternatively
> > leave this bit of the data structure empty?
> >
> > Peter
> >
> >
> > ------------------------------
> >
> > Message: 5
> > Date: Wed, 7 Oct 2009 07:17:37 -0400
> > From: Brad Chapman 
> > Subject: Re: [Biopython] Skipping over blank/erroneous
> > ? ? ? ?Entrez.esummary() ? ? ? results
> > To: Austin Davis-Richardson 
> > Cc: biopython at lists.open-bio.org
> > Message-ID: <20091007111737.GC84267 at sobchak.mgh.harvard.edu>
> > Content-Type: text/plain; charset=us-ascii
> >
> > Hi Austin;
> >
> >> I'm using BioPython to generate a table of accession numbers and their
> >> corresponding TaxIDs. ?The fastest way I can do this is 20 at a time
> >> (20 per 3 seconds rather than 1 per 3 seconds).
> >>
> >> However, this results in a problem.
> >>
> >> whenever my script receives a result from NCBI that is blank such as
> >> there being no value for TaxID, BioPython crashes with the error:
> >>
> >> ? File "taxcollector3.py", line 39, in getTaxID
> >> ? ? record = Entrez.read(handle)
> >> ? File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py",
> >> line 259, in read
> >> ? ? record = handler.run(handle)
> >> ? File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> >> line 90, in run
> >> ? ? self.parser.ParseFile(handle)
> >> ? File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> >> line 191, in endElement
> >> ? ? value = IntegerElement(value)
> >> ValueError: invalid literal for int() with base 10: ''
> >
> > In addition to Michiel's workaround, I checked in a small change
> > which could at least circumvent the error you are reporting:
> >
> > http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279
> >
> > It affects only one file, so if you don't want to pull the latest
> > from GitHub, you can download just that file and replace it in your
> > Biopython library:
> >
> > http://github.com/biopython/biopython/blob/master/Bio/Entrez/Parser.py
> >
> > Ideally, we should have a test case to cover this. Could you let us
> > know specific GIs that are causing the problem? The group of 20 is
> > fine if you haven't narrowed it further than that. This'll also help
> > us check if there are any other problems with these records.
> >
> > Thanks for reporting this,
> > Brad
> >
> >
> > ------------------------------
> >
> > Message: 6
> > Date: Wed, 7 Oct 2009 05:19:01 -0700 (PDT)
> > From: Michiel de Hoon 
> > Subject: Re: [Biopython] Skipping over blank/erroneous
> > ? ? ? ?Entrez.esummary() ? ? ? results
> > To: Austin Davis-Richardson , ? ?Brad Chapman
> > ? ? ? ?
> > Cc: biopython at lists.open-bio.org
> > Message-ID: <826538.32828.qm at web62406.mail.re1.yahoo.com>
> > Content-Type: text/plain; charset=iso-8859-1
> >
> >> In addition to Michiel's workaround, I checked in a small
> >> change
> >> which could at least circumvent the error you are
> >> reporting:
> >>
> >> http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279
> >
> > Sorry, but that change introduces two bugs. First, we should be able to distinguish between -1 and missing values. More importantly, we want to be able to add attributes to value. Since -1 is an integer instead of an object, it won't allow that.
> >
> > Can you revert this change?
> >
> > --Michiel
> >
> > --- On Wed, 10/7/09, Brad Chapman  wrote:
> >
> >> From: Brad Chapman 
> >> Subject: Re: [Biopython] Skipping over blank/erroneous Entrez.esummary() results
> >> To: "Austin Davis-Richardson" 
> >> Cc: biopython at lists.open-bio.org
> >> Date: Wednesday, October 7, 2009, 7:17 AM
> >> Hi Austin;
> >>
> >> > I'm using BioPython to generate a table of accession
> >> numbers and their
> >> > corresponding TaxIDs.? The fastest way I can do
> >> this is 20 at a time
> >> > (20 per 3 seconds rather than 1 per 3 seconds).
> >> >
> >> > However, this results in a problem.
> >> >
> >> > whenever my script receives a result from NCBI that is
> >> blank such as
> >> > there being no value for TaxID, BioPython crashes with
> >> the error:
> >> >
> >> >???File "taxcollector3.py", line 39, in
> >> getTaxID
> >> >? ???record = Entrez.read(handle)
> >> >???File
> >> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py",
> >> > line 259, in read
> >> >? ???record = handler.run(handle)
> >> >???File
> >> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> >> > line 90, in run
> >> >? ???self.parser.ParseFile(handle)
> >> >???File
> >> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> >> > line 191, in endElement
> >> >? ???value = IntegerElement(value)
> >> > ValueError: invalid literal for int() with base 10:
> >> ''
> >>
> >> In addition to Michiel's workaround, I checked in a small
> >> change
> >> which could at least circumvent the error you are
> >> reporting:
> >>
> >> http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279
> >>
> >> It affects only one file, so if you don't want to pull the
> >> latest
> >> from GitHub, you can download just that file and replace it
> >> in your
> >> Biopython library:
> >>
> >> http://github.com/biopython/biopython/blob/master/Bio/Entrez/Parser.py
> >>
> >> Ideally, we should have a test case to cover this. Could
> >> you let us
> >> know specific GIs that are causing the problem? The group
> >> of 20 is
> >> fine if you haven't narrowed it further than that. This'll
> >> also help
> >> us check if there are any other problems with these
> >> records.
> >>
> >> Thanks for reporting this,
> >> Brad
> >> _______________________________________________
> >> Biopython mailing list? -? Biopython at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biopython
> >>
> >
> >
> >
> >
> >
> >
> > ------------------------------
> >
> > Message: 7
> > Date: Wed, 7 Oct 2009 08:32:27 -0400
> > From: Brad Chapman 
> > Subject: Re: [Biopython] Skipping over blank/erroneous
> > ? ? ? ?Entrez.esummary() ? ? ? results
> > To: Michiel de Hoon 
> > Cc: Austin Davis-Richardson ,
> > ? ? ? ?biopython at lists.open-bio.org
> > Message-ID: <20091007123227.GD84267 at sobchak.mgh.harvard.edu>
> > Content-Type: text/plain; charset=us-ascii
> >
> > Peter and Michiel;
> >
> >> > In addition to Michiel's workaround, I checked in a small
> >> > change which could at least circumvent the error you are
> >> > reporting:
> >> >
> >> > http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279
> >
> > Peter:
> >> Does "correctly" mean a default value? I see Brad has just commited a change to
> >> use -1 in this case, but perhaps None is also a good choice? Can we
> >> alternatively
> >> leave this bit of the data structure empty?
> >
> > Michiel:
> >> Sorry, but that change introduces two bugs. First, we should be able
> >> to distinguish between -1 and missing values. More importantly, we
> >> want to be able to add attributes to value. Since -1 is an integer
> >> instead of an object, it won't allow that.
> >>
> >> Can you revert this change?
> >
> > Thanks guys -- not the best choice. How do you feel about just passing
> > it along as an empty string and only doing the integer conversion if we
> > actually have data to convert?
> >
> > http://github.com/biopython/biopython/commit/1fff8038e4fa9e2643851a70118e3227ccbea44e
> >
> > So now missing values are empty strings, as passed, instead of any
> > sort of integer interpretation of them.
> >
> > Brad
> >
> >
> > ------------------------------
> >
> > _______________________________________________
> > Biopython mailing list ?- ?Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
> >
> >
> > End of Biopython Digest, Vol 82, Issue 3
> > ****************************************
> >
> 
> 
> 
> -- 
> AGDR
> 
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From denzel.dz.li at gmail.com  Wed Oct  7 23:23:17 2009
From: denzel.dz.li at gmail.com (Denzel Li)
Date: Wed, 7 Oct 2009 19:23:17 -0400
Subject: [Biopython] Combine nexus files but not concatenating them
In-Reply-To: <320fb6e00910070229n1b78542dj82998de13cf7eed7@mail.gmail.com>
References: 
	<320fb6e00910051242n54d6229fyacc23119401715e@mail.gmail.com>
	
	<320fb6e00910051331u2f27c7ecmde744d29fb3ba2ab@mail.gmail.com>
	
	<320fb6e00910070229n1b78542dj82998de13cf7eed7@mail.gmail.com>
Message-ID: 

Hi Peter:
Regarding the Nexus datatype supported in Bio:Nexus:Nexus, I mean nexus like
the following, where the datatype is a "mixing" of "standard" and "DNA".
According to the function Bio:Nexus:Nexus._format (line 696), these
datatypes are not supported yet. I am just wondering does the team has the
plan to support these data types.
------------
# Nexus
Begin data;
    Dimensions ntax=2 nchar=1000;
    Format datatype=mixed(Standard:1-5,DNA:6-1000) interleave=yes gap=-
missing=?;
    Matrix
[morphology]
s1 10010
s2  20011
s3  20010
s4  10020
[Gene 1]
s1 ACGT
s2 AAGT
s3 ACGA
s4 ACGT
...
; end;
---------------

Best,
Denzel

On Wed, Oct 7, 2009 at 5:29 AM, Peter wrote:

> On Wed, Oct 7, 2009 at 4:22 AM, Denzel Li  wrote:
> > Hi Peter:
> > Thank you for the help. Both functions work well. By the way, will
> > "standard" datatype or "mixed" datatype be supported in Bio:Nexus:Nexus?
> >
> > Best,
> > Denzel
>
> Hi Denzel,
>
> I CC'd the list - please try and keep replies send there.
>
> I'm glad Bio.Nexus is working well for you.
>
> Regarding the finer details of the NEXUS file format and the Biopython
> code, I am not an expert - we need Frank or Cymon to comment. If
> you could give us a couple of examples of what you are asking for it
> would probably be much clearer (to me at least).
>
> Regards,
>
> Peter
>


From biopython at maubp.freeserve.co.uk  Thu Oct  8 08:54:39 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 8 Oct 2009 09:54:39 +0100
Subject: [Biopython] Combine nexus files but not concatenating them
In-Reply-To: 
References: 
	<320fb6e00910051242n54d6229fyacc23119401715e@mail.gmail.com>
	
	<320fb6e00910051331u2f27c7ecmde744d29fb3ba2ab@mail.gmail.com>
	
	<320fb6e00910070229n1b78542dj82998de13cf7eed7@mail.gmail.com>
	
Message-ID: <320fb6e00910080154m2e75a38eh7e120e807103773b@mail.gmail.com>

On Thu, Oct 8, 2009 at 12:23 AM, Denzel Li  wrote:
> Hi Peter:
> Regarding the Nexus datatype supported in Bio:Nexus:Nexus, I mean nexus like
> the following, where the datatype is a "mixing" of "standard" and "DNA".
> According to the function Bio:Nexus:Nexus._format (line 696), these
> datatypes are not supported yet. I am just wondering does the team has the
> plan to support these data types.

Oh right - in in your example, the digits encode morphology, but they could
also be phenotypes, or some other characteristic like gene copy number.

As to Bio.Nexus supporting this, hopefully Frank or Cymon can comment.

If Bio.Nexus did support this, then from the Bio.AlignIO point of view, with
the current object structure we'd have to use a sequence object (holding
both the digits, and the DNA) for the sequence strings (e.g. for s1 in your
example, Seq("10010ACGT")) with a generic single letter alphabet. This
would lose the fact that the first five characters are digits, but the rest are
DNA. This isn't ideal, and would probably cause trouble for Nexus output
(writing such alignments).

Would you want to try and deal with such "mixed" alignments via the
Bio.AlignIO interface?

Peter


From ibdeno at gmail.com  Mon Oct 12 08:11:38 2009
From: ibdeno at gmail.com (Miguel Ortiz Lombardia)
Date: Mon, 12 Oct 2009 10:11:38 +0200
Subject: [Biopython] Problems parsing with PSIBlastParser
Message-ID: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>

Dear list members,

I have a problem with NCBIStandalone.PSIBlastParser, which I need to  
use instead of NCBIXML since the latter one lacks some record  
properties that I need.

My code used to work until recently (say three months) and now it  
seems something has changed in the latest biopython (1.52-1, I install  
it on an intel OSX 10.5.8 via fink). I get the same problem  
irrespectively of whether I use python 2.5 or 2.6.

Here follows the relevant part of the code:

####

     blast_out, error_info = NCBIStandalone.blastpgp(
         blastcmd='/usr/local/blast-2.2.18/bin/blastpgp',
         database='/opt/BlastDBs/' + db,
         infile=file,
         npasses=passes,
         program='blastpgp',
         descriptions='500',
         alignments='1000',
         align_view='0',
         matrix_outfile=outbase + '.' + db + '.' + str(passes) +  
'.pssm')

     b_parser = NCBIStandalone.PSIBlastParser()

     b_record = b_parser.parse(blast_out)

####

And this is the error that I now get:

####

   File "/Users/mol/bin/lpbl.py", line 64, in doblast
     b_record = b_parser.parse(blast_out)
   File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py",  
line 777, in parse
     self._scanner.feed(handle, self._consumer)
   File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py",  
line 97, in feed
     self._scan_rounds(uhandle, consumer)
   File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py",  
line 234, in _scan_rounds
     self._scan_alignments(uhandle, consumer)
   File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py",  
line 376, in _scan_alignments
     self._scan_pairwise_alignments(uhandle, consumer)
   File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py",  
line 386, in _scan_pairwise_alignments
     self._scan_one_pairwise_alignment(uhandle, consumer)
   File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py",  
line 398, in _scan_one_pairwise_alignment
     self._scan_hsp(uhandle, consumer)
   File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py",  
line 433, in _scan_hsp
     self._scan_hsp_alignment(uhandle, consumer)
   File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py",  
line 464, in _scan_hsp_alignment
     read_and_call(uhandle, consumer.query, start='Query')
   File "/sw/lib/python2.6/site-packages/Bio/ParserSupport.py", line  
303, in read_and_call
     method(line)
   File "/sw/lib/python2.6/site-packages/Bio/Blast/NCBIStandalone.py",  
line 1138, in query
     raise ValueError("I could not find the query in line\n%s" % line)
ValueError: I could not find the query in line
Query: 0    -

####

Now, the interesting thing is that if I run blastpgp directly and  
catch the output to a file, this file never includes such a line as:

Query: 0    -

Actually, if I modify my code so it reads this output file, the  
PSIBlastParser processes it without error.

I have found that something may have changed in NCBIStandalone  
recently, namely, this bit:

     _query_re = re.compile(r"Query(:?) \s*(\d+)\s*(.+) (\d+)")
     def query(self, line):
         m = self._query_re.search(line)
         if m is None:
             raise ValueError("I could not find the query in line\n%s"  
% line)

Anyone has a clue?

Thank you!


-- Miguel



From biopython at maubp.freeserve.co.uk  Mon Oct 12 09:19:33 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 12 Oct 2009 10:19:33 +0100
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>
References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>
Message-ID: <320fb6e00910120219g46a85467ia9fe30131380d932@mail.gmail.com>

On Mon, Oct 12, 2009 at 9:11 AM, Miguel Ortiz Lombardia
 wrote:
> Dear list members,
>
> I have a problem with NCBIStandalone.PSIBlastParser, which I need to use
> instead of NCBIXML since the latter one lacks some record properties that I
> need.
>
> My code used to work until recently (say three months) and now it seems
> something has changed in the latest biopython (1.52-1, I install it on an
> intel OSX 10.5.8 via fink). I get the same problem irrespectively of whether
> I use python 2.5 or 2.6.

You definitely didn't upgrade your copy of BLAST at the same time?

Could you file a bug please. Then run PSI-BLAST "by hand" and
record the plain text output to a file, and upload the file to Bugzilla.
Note you have to file the bug before it will let you upload a file. Having
the XML output could be helpful too.

http://bugzilla.open-bio.org/enter_bug.cgi?product=Biopython

Also, can we use your BLAST file as a unit test?

Thanks

Peter


From krother at rubor.de  Mon Oct 12 11:44:10 2009
From: krother at rubor.de (Kristian Rother)
Date: Mon, 12 Oct 2009 13:44:10 +0200
Subject: [Biopython] RuPy 2009 Bioinformatics Satellite 6.11. in Poznan,
	Poland
Message-ID: <1c64f5fbb09ada1aae8207d5c7d737a8-EhVcX1xCQgFaRwICBxEAXR0wfgFLV15YQUBGAEFfUC9ZUFgWXVpyH1RXX0FaLkQAWl9aRF9dXQg=-webmailer2@server01.webmailer.hosteurope.de>


Hi,

As some of you may know, this years November 7th-8th, the RuPy
(Ruby/Python) conference is taking place in Poznan, Poland.
--> see: http://rupy.eu

I am happy to announce that we will have a small satellite meeting to the
RuPy conference dedicated to structural bioinformatics.
Please feel invited to join - everybody is welcome.

Date: November 6th
Time: 13:00
Place: Collegium Biologicum - right next to the main conference
Room: 1.126 (1st floor at the very end of the building)

Tentative programme:
- Lightning talks (enrolment on-site)
- Code gallery
- Space for hands-on work on modules of interest, e.g.:
  * Bio.PDB
  * Bio.RNA
  * django.*
  * moderna.*
  * ...

Total duration: 3-4 hours.


Best regards,
    Kristian Rother

    Laboratory of structural bioinformatics, UAM
    http://bioinformatics.amu.edu.pl/index_.html



From biopython at maubp.freeserve.co.uk  Tue Oct 13 11:10:06 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 13 Oct 2009 12:10:06 +0100
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>
References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>
Message-ID: <320fb6e00910130410h224c0a9ft595befcad4a47cf4@mail.gmail.com>

On Mon, Oct 12, 2009 at 9:11 AM, Miguel Ortiz Lombardia
 wrote:
> Dear list members,
>
> I have a problem with NCBIStandalone.PSIBlastParser, which I need to use
> instead of NCBIXML since the latter one lacks some record properties that I
> need.
>
> My code used to work until recently (say three months) and now it seems
> something has changed in the latest biopython (1.52-1, I install it on an
> intel OSX 10.5.8 via fink). I get the same problem irrespectively of whether
> I use python 2.5 or 2.6.

Thanks for filing the bug, and supplying the example output files.
http://bugzilla.open-bio.org/show_bug.cgi?id=2927

Do you remember what version of Biopython you used to be running
before updating to 1.52? This would help to narrow down the change
triggering this problem.

In the mean time, I have tried parsing your sample output, and it seems fine:

from Bio.Blast.NCBIStandalone import PSIBlastParser
b_parser = PSIBlastParser()
handle = open("Q3V4Q0.psiblast.txt")
b_record = b_parser.parse(handle)
handle.close()
for b_round in b_record.rounds :
    print "Round %i has %i alignments" \
          % (b_round.number, len(b_round.alignments))


Gives:

Round 1 has 385 alignments
Round 2 has 1000 alignments
Round 3 has 1000 alignments
Round 4 has 1000 alignments
Round 5 has 1000 alignments

So, if the file parser is fine, then my guess is this is something to do with
how we are running PSI-BLAST via NCBIStandalone.blastpgp - and this
code has changed in recent releases. It used to use the python function
os.popen3 but this was deprecated in Python 2.6 and we now use the
subprocess library.

It is also possible that the command line options you used when running
BLAST by hand to supply me the example output differed from what
was used in your Python script.

What exactly did you type at the command line to make the example
output you sent me? I'd like to double check the Python code is using
the same thing...

Peter


From biopython at maubp.freeserve.co.uk  Tue Oct 13 11:41:58 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 13 Oct 2009 12:41:58 +0100
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: 
References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>
	<320fb6e00910130410h224c0a9ft595befcad4a47cf4@mail.gmail.com>
	
Message-ID: <320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com>

Don't forget to CC the mailing list ;)

On Tue, Oct 13, 2009 at 12:22 PM, Miguel Ortiz Lombardia
 wrote:
>
>
> Le 13 oct. 09 ? 13:10, Peter a ?crit :
>
>> On Mon, Oct 12, 2009 at 9:11 AM, Miguel Ortiz Lombardia
>>  wrote:
>>>
>>> Dear list members,
>>>
>>> I have a problem with NCBIStandalone.PSIBlastParser, which I need to use
>>> instead of NCBIXML since the latter one lacks some record properties that
>>> I need.
>>>
>>> My code used to work until recently (say three months) and now it seems
>>> something has changed in the latest biopython (1.52-1, I install it on an
>>> intel OSX 10.5.8 via fink). I get the same problem irrespectively of
>>> whether I use python 2.5 or 2.6.
>>
>> Thanks for filing the bug, and supplying the example output files.
>> http://bugzilla.open-bio.org/show_bug.cgi?id=2927
>>
>> Do you remember what version of Biopython you used to be running
>> before updating to 1.52? This would help to narrow down the change
>> triggering this problem.
>>
>
> Sorry, can't tell it for sure, but it was whatever version was current in
> March 2009.

Probably Biopython 1.49 then. That may help.

>> In the mean time, I have tried parsing your sample output, and it seems
>> fine:
>>
>> from Bio.Blast.NCBIStandalone import PSIBlastParser
>> b_parser = PSIBlastParser()
>> handle = open("Q3V4Q0.psiblast.txt")
>> b_record = b_parser.parse(handle)
>> handle.close()
>> for b_round in b_record.rounds :
>> ? print "Round %i has %i alignments" \
>> ? ? ? ? % (b_round.number, len(b_round.alignments))
>>
>>
>> Gives:
>>
>> Round 1 has 385 alignments
>> Round 2 has 1000 alignments
>> Round 3 has 1000 alignments
>> Round 4 has 1000 alignments
>> Round 5 has 1000 alignments
>>
>
> Yes, that's also what I see with my code: text files can be parsed.

OK - good. So it doesn't look like a parser bug.

>> So, if the file parser is fine, then my guess is this is something to do
>> with
>> how we are running PSI-BLAST via NCBIStandalone.blastpgp - and this
>> code has changed in recent releases. It used to use the python function
>> os.popen3 but this was deprecated in Python 2.6 and we now use the
>> subprocess library.
>
> I think this is the most likely explanation.
>
>> It is also possible that the command line options you used when running
>> BLAST by hand to supply me the example output differed from what
>> was used in your Python script.
>
> I don't think so, I just used the same command line that was launched from
> the python script (got it from a 'ps' command)

Great :)

>> What exactly did you type at the command line to make the example
>> output you sent me? I'd like to double check the Python code is using
>> the same thing...
>
> For plain text output:
>
> /usr/local/blast-2.2.22/bin/blastpgp -d /opt/BlastDBs/uniref100 -i
> Q3V4Q0.fasta -m 0 -v 500 -b 1000 -Q Q3V4Q0.uniref100.5.pssm -j 5 -p blastpgp
>> Q3V4Q0.psiblast.log
>
> For XML:
>
> /usr/local/blast-2.2.22/bin/blastpgp -d /opt/BlastDBs/uniref100 -i
> Q3V4Q0.fasta -m 7 -v 500 -b 1000 -Q Q3V4Q0.uniref100.5.pssm -j 5 -p blastpgp
>> Q3V4Q0.psiblast.xml.log

Because you capture the stdout to a file (rather than using the -o option),
the output files should be identical to those obtained by the python script.

I would need to install the same BLAST database etc in order to try and
debug this on my own machine, which is a hassle. So I'll try and ask you
to test a few things instead.

Could you try changing this line:

blast_out, error_info = NCBIStandalone.blastpgp(...)

to this:

temp_handle, error_info = NCBIStandalone.blastpgp(...)
from StringIO import StringIO
blast_out = StringIO(temp_handle.read())
temp_handle.close()

This will try to read in all the BLAST output (all 5MB of it) into memory
as a string, and turn it into a StringIO handle which the parser should
accept.

You could also try explicitly saving to a file:

temp_handle, error_info = NCBIStandalone.blastpgp(...)
temp_file = open("temp.txt", "w")
temp_file.write(temp_handle.read())
temp_file.close()
temp_handle.close()
blast_out = open("temp.txt")

or, perhaps:

temp_handle, error_info = NCBIStandalone.blastpgp(...)
temp_file = open("temp.txt", "w")
for line in temp_handle : temp_file.write(line)
temp_file.close()
temp_handle.close()
blast_out = open("temp.txt")

It would not surprise me to see these fail as before, but having a
look at the temp.txt file could be very instructive (especially if it
contains that odd query line you mentioned earlier).

I know that the Python subprocess module can have problems with
deadlocks when dealing with large amounts of piped data. There are
ways to cope, but the simplest option is to tell BLAST to save the
data to a file (instead of stdout) with the -o command line option.
This avoids sending large amounts of data via the stdout pipe. I can
explain how to do this within Biopython if you like (this email is
already very long).

Peter



From biopython at maubp.freeserve.co.uk  Tue Oct 13 11:46:27 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 13 Oct 2009 12:46:27 +0100
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: <320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com>
References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>
	<320fb6e00910130410h224c0a9ft595befcad4a47cf4@mail.gmail.com>
	
	<320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com>
Message-ID: <320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com>

On Tue, Oct 13, 2009 at 12:41 PM, Peter  wrote:
>>>
>>> Do you remember what version of Biopython you used to be running
>>> before updating to 1.52? This would help to narrow down the change
>>> triggering this problem.
>>>
>>
>> Sorry, can't tell it for sure, but it was whatever version was current in
>> March 2009.
>
> Probably Biopython 1.49 then. That may help.
>

Hmm - the switch to using subprocess (on Python 2.4+ or later) was made
in October 2008, and would have first appeared in Biopython 1.49. Maybe
you were using Biopython 1.48 before - or the issue is something else.

Peter


From ibdeno at gmail.com  Tue Oct 13 11:58:23 2009
From: ibdeno at gmail.com (Miguel Ortiz Lombardia)
Date: Tue, 13 Oct 2009 13:58:23 +0200
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: <320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com>
References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>
	<320fb6e00910130410h224c0a9ft595befcad4a47cf4@mail.gmail.com>
	
	<320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com>
	<320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com>
Message-ID: 

>>>>
>>>> Do you remember what version of Biopython you used to be running
>>>> before updating to 1.52? This would help to narrow down the change
>>>> triggering this problem.
>>>>
>>>
>>> Sorry, can't tell it for sure, but it was whatever version was  
>>> current in
>>> March 2009.
>>
>> Probably Biopython 1.49 then. That may help.
>>
>
> Hmm - the switch to using subprocess (on Python 2.4+ or later) was  
> made
> in October 2008, and would have first appeared in Biopython 1.49.  
> Maybe
> you were using Biopython 1.48 before - or the issue is something else.
>
> Peter


It may well have been 1.48... Having a closer look at the files from  
my last successful runs I discover the actually come from November  
2008...

I'm now running the tests you suggested.

Sorry not to have copied the list in the previous post!

Best,


-- Miguel





From biopython at maubp.freeserve.co.uk  Tue Oct 13 13:36:44 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 13 Oct 2009 14:36:44 +0100
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: 
References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>
	<320fb6e00910130410h224c0a9ft595befcad4a47cf4@mail.gmail.com>
	
	<320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com>
	<320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com>
	
Message-ID: <320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com>

On Tue, Oct 13, 2009 at 12:58 PM, Miguel Ortiz Lombardia
 wrote:
>>
>> Hmm - the switch to using subprocess (on Python 2.4+ or later) was made
>> in October 2008, and would have first appeared in Biopython 1.49. Maybe
>> you were using Biopython 1.48 before - or the issue is something else.
>>
>> Peter
>
>
> It may well have been 1.48... Having a closer look at the files from my last
> successful runs I discover the actually come from November 2008...
>
> I'm now running the tests you suggested.

Let me know what they show. How long do these BLAST runs take?
Perhaps I was ambitious with the number of suggestions to try ;)

Assuming the problem is with how we are calling the BLAST tool via the
subprocess module, I have two suggested fixes in mind. The first is a change
to the _invoke_blast() function in Bio/Blast/NCBIStandalone.py, essentially
replace these lines:

    blast_process.stdin.close()
    return blast_process.stdout, blast_process.stderr

With this:

    stdout, stderr = blast_process.communicate()
    from StringIO import StringIO
    return StringIO(stdout), StringIO(stderr)

We had to make a similar change to Bio.Clustalw for Bug 2804. This uses
subprocess to buffer the data in order to avoid any deadlock reading from
the handles. I hadn't made this change before as it imposes a memory
overhead (and BLAST output is often *very* large, especially as XML),
and until now there hadn't been any problems reported. It would be worth
trying in your situation (even just to confirm the source of the error), but
I don't think we should make this change for the official distribution.

The second option (which I mentioned before) is to tell blastpgp to write
its output directly to a file, and then parse the file. This is how I normally
run large BLAST jobs. This is possible but not elegant via the function
Bio.Blast.NCBIStandalone.blastpgp (which always returns stdout/stderr
handles). Bug 2654 has an example,
http://bugzilla.open-bio.org/show_bug.cgi?id=2654

However, what I want to recommend instead is to use the more flexible
Bio.Blast.Applications objects instead (in this case, the class
BlastpgpCommandline). I had planed to update the BLAST chapter
of the Biopython Tutorial to cover this, but it didn't happen in time for
the Biopython 1.52 release. However, the alignment chapter goes
through several examples of this style of command line tool wrapper,
and the BLAST wrappers work in exactly the same way.

Using these "lower level" application wrappers, it is up to you to invoke
subprocess (or another system call) as you see fit (e.g. with pipes).
This is more flexible than the old Bio.Blast.NCBIStandalone.blastpgp
function (and others like it) where the behaviour could not be set.

Feel free to ask for clarification on this - questions now will help for
rewriting the BLAST chapter later on ;)

Regards,

Peter

P.S. See also http://docs.python.org/library/subprocess.html


From ibdeno at gmail.com  Tue Oct 13 13:57:13 2009
From: ibdeno at gmail.com (Miguel Ortiz Lombardia)
Date: Tue, 13 Oct 2009 15:57:13 +0200
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: <320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com>
References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>
	<320fb6e00910130410h224c0a9ft595befcad4a47cf4@mail.gmail.com>
	
	<320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com>
	<320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com>
	
	<320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com>
Message-ID: <7AFF985A-4DDE-4B72-A81E-26516BDA689F@gmail.com>

Le 13 oct. 09 ? 15:36, Peter a ?crit :

> On Tue, Oct 13, 2009 at 12:58 PM, Miguel Ortiz Lombardia
>  wrote:
>>>
>>> Hmm - the switch to using subprocess (on Python 2.4+ or later) was  
>>> made
>>> in October 2008, and would have first appeared in Biopython 1.49.  
>>> Maybe
>>> you were using Biopython 1.48 before - or the issue is something  
>>> else.
>>>
>>> Peter
>>
>>
>> It may well have been 1.48... Having a closer look at the files  
>> from my last
>> successful runs I discover the actually come from November 2008...
>>
>> I'm now running the tests you suggested.
>
> Let me know what they show. How long do these BLAST runs take?
> Perhaps I was ambitious with the number of suggestions to try ;)

It took long, because I wanted to reproduce the same situation.
All the three suggestions you made worked!
I have at least a work-around now.

>
> Assuming the problem is with how we are calling the BLAST tool via the
> subprocess module, I have two suggested fixes in mind. The first is  
> a change
> to the _invoke_blast() function in Bio/Blast/NCBIStandalone.py,  
> essentially
> replace these lines:
>
>    blast_process.stdin.close()
>    return blast_process.stdout, blast_process.stderr
>
> With this:
>
>    stdout, stderr = blast_process.communicate()
>    from StringIO import StringIO
>    return StringIO(stdout), StringIO(stderr)
>
> We had to make a similar change to Bio.Clustalw for Bug 2804. This  
> uses
> subprocess to buffer the data in order to avoid any deadlock reading  
> from
> the handles. I hadn't made this change before as it imposes a memory
> overhead (and BLAST output is often *very* large, especially as XML),
> and until now there hadn't been any problems reported. It would be  
> worth
> trying in your situation (even just to confirm the source of the  
> error), but
> I don't think we should make this change for the official  
> distribution.
>

You're right, probably not justified if I'm the only one with this  
problem.

> The second option (which I mentioned before) is to tell blastpgp to  
> write
> its output directly to a file, and then parse the file. This is how  
> I normally
> run large BLAST jobs. This is possible but not elegant via the  
> function
> Bio.Blast.NCBIStandalone.blastpgp (which always returns stdout/stderr
> handles). Bug 2654 has an example,
> http://bugzilla.open-bio.org/show_bug.cgi?id=2654
>
> However, what I want to recommend instead is to use the more flexible
> Bio.Blast.Applications objects instead (in this case, the class
> BlastpgpCommandline). I had planed to update the BLAST chapter
> of the Biopython Tutorial to cover this, but it didn't happen in  
> time for
> the Biopython 1.52 release. However, the alignment chapter goes
> through several examples of this style of command line tool wrapper,
> and the BLAST wrappers work in exactly the same way.
>
> Using these "lower level" application wrappers, it is up to you to  
> invoke
> subprocess (or another system call) as you see fit (e.g. with pipes).
> This is more flexible than the old Bio.Blast.NCBIStandalone.blastpgp
> function (and others like it) where the behaviour could not be set.

I will explore this possibility, it seems definitely more elegant than  
the other one (as in Bug 2654).

>
> Feel free to ask for clarification on this - questions now will help  
> for
> rewriting the BLAST chapter later on ;)

I may come back with questions :-)

Thank you very much for your help!

Best,


-- Miguel






From carlos.borroto at gmail.com  Tue Oct 13 22:45:13 2009
From: carlos.borroto at gmail.com (Carlos Javier Borroto)
Date: Tue, 13 Oct 2009 18:45:13 -0400
Subject: [Biopython] Is there any Entrez Gene parser out there?
Message-ID: <65d4b7fc0910131545y793696dv6305bf98b2e51e13@mail.gmail.com>

Do biopython have a parser for Entrez Gene?, Does someone know if
there is any python parser for this database at all?

I see there is one on Bioperl, but I'll be happy if I can stick to python.

regards,
-- 
Carlos Javier Borroto
Baltimore, MD
Phone: (410) 929 4020


From biopython at maubp.freeserve.co.uk  Tue Oct 13 23:18:58 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 14 Oct 2009 00:18:58 +0100
Subject: [Biopython] Is there any Entrez Gene parser out there?
In-Reply-To: <65d4b7fc0910131545y793696dv6305bf98b2e51e13@mail.gmail.com>
References: <65d4b7fc0910131545y793696dv6305bf98b2e51e13@mail.gmail.com>
Message-ID: <320fb6e00910131618t2e880c95n7a7e0df6acc31176@mail.gmail.com>

On Tue, Oct 13, 2009 at 11:45 PM, Carlos Javier Borroto
 wrote:
> Do biopython have a parser for Entrez Gene?, Does someone know if
> there is any python parser for this database at all?

The Bio.Entrez.read() should be fine with the XML Entrez
Gene data, or try the recently added Bio.Entrez.parse()
for large datasets (incremental parsing).

Peter


From winda002 at student.otago.ac.nz  Tue Oct 13 23:37:52 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Wed, 14 Oct 2009 12:37:52 +1300
Subject: [Biopython] Is there any Entrez Gene parser out there?
In-Reply-To: <65d4b7fc0910131545y793696dv6305bf98b2e51e13@mail.gmail.com>
References: <65d4b7fc0910131545y793696dv6305bf98b2e51e13@mail.gmail.com>
Message-ID: <200910141237.52810.winda002@student.otago.ac.nz>

On Wed, 14 Oct 2009 11:45:13 Carlos Javier Borroto wrote:
> Do biopython have a parser for Entrez Gene?, Does someone know if
> there is any python parser for this database at all?
>
> I see there is one on Bioperl, but I'll be happy if I can stick to python.

Hi Carlos,

I don't have much experience with the Entrez module, so this might not be the 
best way (I thought I should reply before you where forced to resort to Perl 
;)

If you use Bio.Entrez.esummary() you can get a list of python dictionaries for 
a given record. Something like this:

>>> Entrez.email = "you at someplace"
>>> query = Entrez.esummary(db="gene", id="641535")
>>> record = Entrez.read(query)
>>> record
[{'Mim': [], 'Orgname': 'Tribolium castaneum', 'TaxID': 7070 ...
>>>for field in record:
...     print field["Chromosome"]
LG2

There's also documentation in the tutorial and a related cookbook example on 
the wiki:

http://www.biopython.org/wiki/Annotate_Entrez_Gene_IDs

Cheers,
David




From sdavis2 at mail.nih.gov  Wed Oct 14 01:04:20 2009
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Tue, 13 Oct 2009 21:04:20 -0400
Subject: [Biopython] Is there any Entrez Gene parser out there?
In-Reply-To: <200910141237.52810.winda002@student.otago.ac.nz>
References: <65d4b7fc0910131545y793696dv6305bf98b2e51e13@mail.gmail.com>
	<200910141237.52810.winda002@student.otago.ac.nz>
Message-ID: <264855a00910131804k28f08c8nca3cd82e1ab8280e@mail.gmail.com>

On Tue, Oct 13, 2009 at 7:37 PM, David Winter
wrote:

> On Wed, 14 Oct 2009 11:45:13 Carlos Javier Borroto wrote:
> > Do biopython have a parser for Entrez Gene?, Does someone know if
> > there is any python parser for this database at all?
> >
> > I see there is one on Bioperl, but I'll be happy if I can stick to
> python.
>
>
If you like, there are simple tab-delimited files that contain much of the
information that you might want:

ftp://ftp.ncbi.nih.gov/gene/DATA/

You can push these into sqlite or another RDBMS or just read them into
python directly.

Sean



> Hi Carlos,
>
> I don't have much experience with the Entrez module, so this might not be
> the
> best way (I thought I should reply before you where forced to resort to
> Perl
> ;)
>
> If you use Bio.Entrez.esummary() you can get a list of python dictionaries
> for
> a given record. Something like this:
>
> >>> Entrez.email = "you at someplace"
> >>> query = Entrez.esummary(db="gene", id="641535")
> >>> record = Entrez.read(query)
> >>> record
> [{'Mim': [], 'Orgname': 'Tribolium castaneum', 'TaxID': 7070 ...
> >>>for field in record:
> ...     print field["Chromosome"]
> LG2
>
> There's also documentation in the tutorial and a related cookbook example
> on
> the wiki:
>
> http://www.biopython.org/wiki/Annotate_Entrez_Gene_IDs
>
> Cheers,
> David
>
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From cjfields at illinois.edu  Wed Oct 14 00:54:26 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Tue, 13 Oct 2009 19:54:26 -0500
Subject: [Biopython] Is there any Entrez Gene parser out there?
In-Reply-To: <200910141237.52810.winda002@student.otago.ac.nz>
References: <65d4b7fc0910131545y793696dv6305bf98b2e51e13@mail.gmail.com>
	<200910141237.52810.winda002@student.otago.ac.nz>
Message-ID: 


On Oct 13, 2009, at 6:37 PM, David Winter wrote:

> On Wed, 14 Oct 2009 11:45:13 Carlos Javier Borroto wrote:
>> Do biopython have a parser for Entrez Gene?, Does someone know if
>> there is any python parser for this database at all?
>>
>> I see there is one on Bioperl, but I'll be happy if I can stick to  
>> python.
>
> Hi Carlos,
>
> I don't have much experience with the Entrez module, so this might  
> not be the
> best way (I thought I should reply before you where forced to resort  
> to Perl
> ;)

Alright now, let's not start cross-lang flame wars, there are cross- 
lang users out there (like me!).

chris


From winda002 at student.otago.ac.nz  Wed Oct 14 02:46:03 2009
From: winda002 at student.otago.ac.nz (David Winter)
Date: Wed, 14 Oct 2009 15:46:03 +1300
Subject: [Biopython] Is there any Entrez Gene parser out there?
In-Reply-To: 
References: <65d4b7fc0910131545y793696dv6305bf98b2e51e13@mail.gmail.com>
	<200910141237.52810.winda002@student.otago.ac.nz>
	
Message-ID: <200910141546.03884.winda002@student.otago.ac.nz>


> > I don't have much experience with the Entrez module, so this might
> > not be the
> > best way (I thought I should reply before you where forced to resort
> > to Perl
> > ;)
>
> Alright now, let's not start cross-lang flame wars, there are cross-
> lang users out there (like me!).
>
> chris

Sorry Chris, tongue was firmly in cheek there.

david


From biopython at maubp.freeserve.co.uk  Wed Oct 14 12:37:45 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 14 Oct 2009 13:37:45 +0100
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: 
References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>
	<320fb6e00910130410h224c0a9ft595befcad4a47cf4@mail.gmail.com>
	
	<320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com>
	<320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com>
	
	<320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com>
	<7AFF985A-4DDE-4B72-A81E-26516BDA689F@gmail.com>
	
Message-ID: <320fb6e00910140537h1d9d71f5i97417266c542ae3b@mail.gmail.com>

On Wed, Oct 14, 2009 at 12:30 PM, Miguel Ortiz Lombardia
 wrote:
>
> Hi again, Peter.
>
> Well, it turned out that I don't have such work-around... When I launched
> the script as:
>
> nohup lpbl.py ... &
>
> against all my sequences it choked at the first one (quite longer than the
> one I was using as an example) with the very same error.

It would take longer as it would wait for BLAST to finish before starting
to parse it.

> However, this time I have the "temp.txt" file and indeed there lines such as:
>
> Query: 0 ? ?-
>
> Sbjct: 445 ?G ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?445
>
> Query: 0
>
> Sbjct: 445 ?G ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?445
>
> Query: 0 ? ?------
>
> Sbjct: 1316 ETNAPV
> 1321
>
> present for some alignments and it cannot be parsed by my code.

Those do look strange.

> When I run blastpgp myself on the command line, same arguments, and catch
> the standard output to a temp2.txt file, the latter file does not contain
> those lines and can be parsed correctly.

This is odd, and I am not sure what would cause this.

> So, in the end I went back to my code and modified according to your
> recommendation of using the commandline applications. The relevant part of
> code now looks like this:
> ...
> And it works!

Great - I'm glad my vague instructions made sense :)

> Thanks again for your help,

At least we have solution, even if we didn't get to the bottom of
the strange BLAST output. I'll close the bug...

Peter



From ibdeno at gmail.com  Wed Oct 14 12:49:30 2009
From: ibdeno at gmail.com (Miguel Ortiz Lombardia)
Date: Wed, 14 Oct 2009 14:49:30 +0200
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: <320fb6e00910140537h1d9d71f5i97417266c542ae3b@mail.gmail.com>
References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>
	<320fb6e00910130410h224c0a9ft595befcad4a47cf4@mail.gmail.com>
	
	<320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com>
	<320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com>
	
	<320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com>
	<7AFF985A-4DDE-4B72-A81E-26516BDA689F@gmail.com>
	
	<320fb6e00910140537h1d9d71f5i97417266c542ae3b@mail.gmail.com>
Message-ID: <408372EF-748A-4089-A6F6-28B8E1F88B4B@gmail.com>

Le 14 oct. 09 ? 14:37, Peter a ?crit :

> On Wed, Oct 14, 2009 at 12:30 PM, Miguel Ortiz Lombardia
>  wrote:
>>
>> Hi again, Peter.
>>
>> Well, it turned out that I don't have such work-around... When I  
>> launched
>> the script as:
>>
>> nohup lpbl.py ... &
>>
>> against all my sequences it choked at the first one (quite longer  
>> than the
>> one I was using as an example) with the very same error.
>
> It would take longer as it would wait for BLAST to finish before  
> starting
> to parse it.
>
>> However, this time I have the "temp.txt" file and indeed there  
>> lines such as:
>>
>> Query: 0    -
>>
>> Sbjct: 445   
>> G                                                            445
>>
>> Query: 0
>>
>> Sbjct: 445   
>> G                                                            445
>>
>> Query: 0    ------
>>
>> Sbjct: 1316 ETNAPV
>> 1321
>>
>> present for some alignments and it cannot be parsed by my code.
>
> Those do look strange.
>
>> When I run blastpgp myself on the command line, same arguments, and  
>> catch
>> the standard output to a temp2.txt file, the latter file does not  
>> contain
>> those lines and can be parsed correctly.
>
> This is odd, and I am not sure what would cause this.
>
>> So, in the end I went back to my code and modified according to your
>> recommendation of using the commandline applications. The relevant  
>> part of
>> code now looks like this:
>> ...
>> And it works!
>
> Great - I'm glad my vague instructions made sense :)
>

They were quite clear :-) and the pointer to the alignment tutorial  
helped a lot.

>> Thanks again for your help,
>
> At least we have solution, even if we didn't get to the bottom of
> the strange BLAST output. I'll close the bug...
>

That's fine.

Thanks!



-- Miguel






From andrea at biodec.com  Wed Oct 14 14:28:17 2009
From: andrea at biodec.com (Andrea)
Date: Wed, 14 Oct 2009 16:28:17 +0200
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: <408372EF-748A-4089-A6F6-28B8E1F88B4B@gmail.com>
References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>	<320fb6e00910130410h224c0a9ft595befcad4a47cf4@mail.gmail.com>		<320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com>	<320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com>		<320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com>	<7AFF985A-4DDE-4B72-A81E-26516BDA689F@gmail.com>		<320fb6e00910140537h1d9d71f5i97417266c542ae3b@mail.gmail.com>
	<408372EF-748A-4089-A6F6-28B8E1F88B4B@gmail.com>
Message-ID: <4AD5E001.6070506@biodec.com>

Hi to everybody,
I work with blast quite often and i could say i run hundreds of thousand
of blastpgp. The "Query 0" outpt of blastpgp, is quite common for me, and
i wrote a patch to my code, to remove these "nasty" lines, before passing
the output to the parser.

I found these type of lines in at least 1-2% of my runs. And i'm fully sure
that i found them either in the output of blast via shell and in the output
of blast via Biopython.

The problem, according to me, is in the blastpgp algorithm and maybe
could be managed in biopython (as i did in my code), cutting out these
"Query 0" lines, because from the point of view of the alignments,
they don't have any sense. It seems that blastpgp, wants to show
wich is the part of the target sequence align to the query before the
starting
point of the query itself (something like opening a gap, at the
beginning of the query).
And this happens "sometimes", and without any apparent reason.

What i think, is that there aren't any problem with biopython in wrapping
the blastpgp process and maybe, but i'm not sure, the difference in the
output could be related to small differences in the parameter of the process
(or in the environment... or in the .ncbirc file).

I always was able to  observe  the identity  between the  blastpgp output
via shell (bash) and the output of the popen wrapper.

Miguel, could you check if really everything is identical? Because i'm
really
surprised of such a strange behaviour....

Despite, according to me there aren't any problem in biopython, and maybe,
Miguel will be able to discover some differences in the way blastpgp is
launched,
i would suggest to develop a patch (i could submit mine), that could remove
"Query 0" lines.

I aplogize if i understanded the problem wrongly and for the fact that
i'm entering
in the discussion in this moment (maybe when the discussion is finished)...

Thanks
Andrea

Miguel Ortiz Lombardia ha scritto:
> Le 14 oct. 09 ? 14:37, Peter a ?crit :
>
>> On Wed, Oct 14, 2009 at 12:30 PM, Miguel Ortiz Lombardia
>>  wrote:
>>>
>>> Hi again, Peter.
>>>
>>> Well, it turned out that I don't have such work-around... When I
>>> launched
>>> the script as:
>>>
>>> nohup lpbl.py ... &
>>>
>>> against all my sequences it choked at the first one (quite longer
>>> than the
>>> one I was using as an example) with the very same error.
>>
>> It would take longer as it would wait for BLAST to finish before
>> starting
>> to parse it.
>>
>>> However, this time I have the "temp.txt" file and indeed there lines
>>> such as:
>>>
>>> Query: 0    -
>>>
>>> Sbjct: 445 
>>> G                                                            445
>>>
>>> Query: 0
>>>
>>> Sbjct: 445 
>>> G                                                            445
>>>
>>> Query: 0    ------
>>>
>>> Sbjct: 1316 ETNAPV
>>> 1321
>>>
>>> present for some alignments and it cannot be parsed by my code.
>>
>> Those do look strange.
>>
>>> When I run blastpgp myself on the command line, same arguments, and
>>> catch
>>> the standard output to a temp2.txt file, the latter file does not
>>> contain
>>> those lines and can be parsed correctly.
>>
>> This is odd, and I am not sure what would cause this.
>>
>>> So, in the end I went back to my code and modified according to your
>>> recommendation of using the commandline applications. The relevant
>>> part of
>>> code now looks like this:
>>> ...
>>> And it works!
>>
>> Great - I'm glad my vague instructions made sense :)
>>
>
> They were quite clear :-) and the pointer to the alignment tutorial
> helped a lot.
>
>>> Thanks again for your help,
>>
>> At least we have solution, even if we didn't get to the bottom of
>> the strange BLAST output. I'll close the bug...
>>
>
> That's fine.
>
> Thanks!
>
>
>
> -- Miguel
>
>
>
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython



From biopython at maubp.freeserve.co.uk  Wed Oct 14 14:46:48 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 14 Oct 2009 15:46:48 +0100
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: <4AD5E001.6070506@biodec.com>
References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>
	<320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com>
	<320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com>
	
	<320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com>
	<7AFF985A-4DDE-4B72-A81E-26516BDA689F@gmail.com>
	
	<320fb6e00910140537h1d9d71f5i97417266c542ae3b@mail.gmail.com>
	<408372EF-748A-4089-A6F6-28B8E1F88B4B@gmail.com>
	<4AD5E001.6070506@biodec.com>
Message-ID: <320fb6e00910140746j15124dccsa901b9cbee784bdc@mail.gmail.com>

On Wed, Oct 14, 2009 at 3:28 PM, Andrea  wrote:
>
> Hi to everybody,
> I work with blast quite often and i could say i run hundreds of thousand
> of blastpgp. The "Query 0" outpt of blastpgp, is quite common for me, and
> i wrote a patch to my code, to remove these "nasty" lines, before passing
> the output to the parser.
>
> I found these type of lines in at least 1-2% of my runs. And i'm fully sure
> that i found them either in the output of blast via shell and in the output
> of blast via Biopython.
>
> The problem, according to me, is in the blastpgp algorithm and maybe
> could be managed in biopython (as i did in my code), cutting out these
> "Query 0" lines, because from the point of view of the alignments,
> they don't have any sense. It seems that blastpgp, wants to show
> which is the part of the target sequence align to the query before the
> starting point of the query itself (something like opening a gap, at the
> beginning of the query).
> And this happens "sometimes", and without any apparent reason.

Andrea - do you have any small example output files with this
problem? If it does occur fairly often (1 to 2% of the time), then
we should try and update the parser to cope. Miguel's example
is useful for testing while working on a bug fix, but too big to
include as part the unit tests.

> What i think, is that there aren't any problem with biopython in wrapping
> the blastpgp process and maybe, but i'm not sure, the difference in the
> output could be related to small differences in the parameter of the process
> (or in the environment... or in the .ncbirc file).
>
> I always was able to ?observe ?the identity ?between the blastpgp output
> via shell (bash) and the output of the popen wrapper.

If you saw "Query 0" output at the command line (shell), then that is
worth knowing.

> Miguel, could you check if really everything is identical? Because i'm
> really surprised of such a strange behaviour....

Maybe the environment variables are different or something?

> Despite, according to me there aren't any problem in biopython, and maybe,
> Miguel will be able to discover some differences in the way blastpgp is
> launched, i would suggest to develop a patch (i could submit mine), that
> could remove "Query 0" lines.

Could you upload your "Query 0" patch to Bug 2927?
http://bugzilla.open-bio.org/show_bug.cgi?id=2927

> I aplogize if i understanded the problem wrongly and for the fact that
> i'm entering in the discussion in this moment (maybe when the
> discussion is finished)...

Well I don't (yet) understand what the problem is either ;)

Peter



From andrea at biodec.com  Wed Oct 14 15:02:40 2009
From: andrea at biodec.com (Andrea)
Date: Wed, 14 Oct 2009 17:02:40 +0200
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: <320fb6e00910140746j15124dccsa901b9cbee784bdc@mail.gmail.com>
References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>	
	<320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com>	
	<320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com>	
		
	<320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com>	
	<7AFF985A-4DDE-4B72-A81E-26516BDA689F@gmail.com>	
		
	<320fb6e00910140537h1d9d71f5i97417266c542ae3b@mail.gmail.com>	
	<408372EF-748A-4089-A6F6-28B8E1F88B4B@gmail.com>	
	<4AD5E001.6070506@biodec.com>
	<320fb6e00910140746j15124dccsa901b9cbee784bdc@mail.gmail.com>
Message-ID: <4AD5E810.5090607@biodec.com>

Peter ha scritto:
> On Wed, Oct 14, 2009 at 3:28 PM, Andrea  wrote:
>   
>> Hi to everybody,
>> I work with blast quite often and i could say i run hundreds of thousand
>> of blastpgp. The "Query 0" outpt of blastpgp, is quite common for me, and
>> i wrote a patch to my code, to remove these "nasty" lines, before passing
>> the output to the parser.
>>
>> I found these type of lines in at least 1-2% of my runs. And i'm fully sure
>> that i found them either in the output of blast via shell and in the output
>> of blast via Biopython.
>>
>> The problem, according to me, is in the blastpgp algorithm and maybe
>> could be managed in biopython (as i did in my code), cutting out these
>> "Query 0" lines, because from the point of view of the alignments,
>> they don't have any sense. It seems that blastpgp, wants to show
>> which is the part of the target sequence align to the query before the
>> starting point of the query itself (something like opening a gap, at the
>> beginning of the query).
>> And this happens "sometimes", and without any apparent reason.
>>     
>
> Andrea - do you have any small example output files with this
> problem? If it does occur fairly often (1 to 2% of the time), then
> we should try and update the parser to cope. Miguel's example
> is useful for testing while working on a bug fix, but too big to
> include as part the unit tests.
>
>   
mmm... i've to search. I've some "cache" of gzipped blastpgp outputs.
But I'm not
sure i've the original (maybe already patched).... waht I'm sure, is
that in the
next month I'm going to run almost 100.000 blasptpg so I'll for sure find
something small. ;-)
>> What i think, is that there aren't any problem with biopython in wrapping
>> the blastpgp process and maybe, but i'm not sure, the difference in the
>> output could be related to small differences in the parameter of the process
>> (or in the environment... or in the .ncbirc file).
>>
>> I always was able to  observe  the identity  between the blastpgp output
>> via shell (bash) and the output of the popen wrapper.
>>     
>
> If you saw "Query 0" output at the command line (shell), then that is
> worth knowing.
>
>   
i think so.
>> Miguel, could you check if really everything is identical? Because i'm
>> really surprised of such a strange behaviour....
>>     
>
> Maybe the environment variables are different or something?
>
>   
>> Despite, according to me there aren't any problem in biopython, and maybe,
>> Miguel will be able to discover some differences in the way blastpgp is
>> launched, i would suggest to develop a patch (i could submit mine), that
>> could remove "Query 0" lines.
>>     
>
> Could you upload your "Query 0" patch to Bug 2927?
> http://bugzilla.open-bio.org/show_bug.cgi?id=2927
>   
Now i'm wuite busy, because i'm working on a different project and i've
to manage deliveries...
but i will for sure upload my patch ASAP.
>   
>> I aplogize if i understanded the problem wrongly and for the fact that
>> i'm entering in the discussion in this moment (maybe when the
>> discussion is finished)...
>>     
>
> Well I don't (yet) understand what the problem is either ;)
>
> Peter
>   
Ciao
andrea


From biopython at maubp.freeserve.co.uk  Wed Oct 14 15:10:54 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 14 Oct 2009 16:10:54 +0100
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: <4AD5E810.5090607@biodec.com>
References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>
	
	<320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com>
	<7AFF985A-4DDE-4B72-A81E-26516BDA689F@gmail.com>
	
	<320fb6e00910140537h1d9d71f5i97417266c542ae3b@mail.gmail.com>
	<408372EF-748A-4089-A6F6-28B8E1F88B4B@gmail.com>
	<4AD5E001.6070506@biodec.com>
	<320fb6e00910140746j15124dccsa901b9cbee784bdc@mail.gmail.com>
	<4AD5E810.5090607@biodec.com>
Message-ID: <320fb6e00910140810y296d19beo9190022b9eede94f@mail.gmail.com>

On Wed, Oct 14, 2009 at 4:02 PM, Andrea  wrote:
>>
>> Andrea - do you have any small example output files with this
>> problem? If it does occur fairly often (1 to 2% of the time), then
>> we should try and update the parser to cope. Miguel's example
>> is useful for testing while working on a bug fix, but too big to
>> include as part the unit tests.
>
> mmm... i've to search. I've some "cache" of gzipped blastpgp outputs.
> But I'm not sure i've the original (maybe already patched).... waht I'm
> sure, is that in the next month I'm going to run almost 100.000
> blasptpg so I'll for sure find something small. ;-)

Great.

>> If you saw "Query 0" output at the command line (shell), then that is
>> worth knowing.
>
> i think so.

OK.

>> Could you upload your "Query 0" patch to Bug 2927?
>> http://bugzilla.open-bio.org/show_bug.cgi?id=2927
>
> Now i'm quite busy, because i'm working on a different project and i've
> to manage deliveries... but i will for sure upload my patch ASAP.

Thanks.

Peter


From ibdeno at gmail.com  Wed Oct 14 20:15:07 2009
From: ibdeno at gmail.com (Miguel Ortiz Lombardia)
Date: Wed, 14 Oct 2009 22:15:07 +0200
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: <4AD5E810.5090607@biodec.com>
References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>	
	<320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com>	
	<320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com>	
		
	<320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com>	
	<7AFF985A-4DDE-4B72-A81E-26516BDA689F@gmail.com>	
		
	<320fb6e00910140537h1d9d71f5i97417266c542ae3b@mail.gmail.com>	
	<408372EF-748A-4089-A6F6-28B8E1F88B4B@gmail.com>	
	<4AD5E001.6070506@biodec.com>
	<320fb6e00910140746j15124dccsa901b9cbee784bdc@mail.gmail.com>
	<4AD5E810.5090607@biodec.com>
Message-ID: <4CDC51BD-7953-47DE-910C-F1F50F0C3275@gmail.com>

Le 14 oct. 09 ? 17:02, Andrea a ?crit :
> Peter ha scritto:
>> On Wed, Oct 14, 2009 at 3:28 PM, Andrea  wrote:
>>
>>> Hi to everybody,
>>> I work with blast quite often and i could say i run hundreds of  
>>> thousand
>>> of blastpgp. The "Query 0" outpt of blastpgp, is quite common for  
>>> me, and
>>> i wrote a patch to my code, to remove these "nasty" lines, before  
>>> passing
>>> the output to the parser.
>>>
>>> I found these type of lines in at least 1-2% of my runs. And i'm  
>>> fully sure
>>> that i found them either in the output of blast via shell and in  
>>> the output
>>> of blast via Biopython.
>>>
>>> The problem, according to me, is in the blastpgp algorithm and maybe
>>> could be managed in biopython (as i did in my code), cutting out  
>>> these
>>> "Query 0" lines, because from the point of view of the alignments,
>>> they don't have any sense. It seems that blastpgp, wants to show
>>> which is the part of the target sequence align to the query before  
>>> the
>>> starting point of the query itself (something like opening a gap,  
>>> at the
>>> beginning of the query).
>>> And this happens "sometimes", and without any apparent reason.
>>>
>>
>> Andrea - do you have any small example output files with this
>> problem? If it does occur fairly often (1 to 2% of the time), then
>> we should try and update the parser to cope. Miguel's example
>> is useful for testing while working on a bug fix, but too big to
>> include as part the unit tests.
>>
>>
> mmm... i've to search. I've some "cache" of gzipped blastpgp outputs.
> But I'm not
> sure i've the original (maybe already patched).... waht I'm sure, is
> that in the
> next month I'm going to run almost 100.000 blasptpg so I'll for sure  
> find
> something small. ;-)
>>> What i think, is that there aren't any problem with biopython in  
>>> wrapping
>>> the blastpgp process and maybe, but i'm not sure, the difference  
>>> in the
>>> output could be related to small differences in the parameter of  
>>> the process
>>> (or in the environment... or in the .ncbirc file).
>>>
>>> I always was able to  observe  the identity  between the blastpgp  
>>> output
>>> via shell (bash) and the output of the popen wrapper.
>>>
>>
>> If you saw "Query 0" output at the command line (shell), then that is
>> worth knowing.

All I can say is that this is not what I observe.
1. When I send directly from the shell exactly the same blastpgp  
search ( I capture the full command line issued in the background by  
the python script with a 'ps -a | grep blastpgp' ) I have never find  
the 'Query: 0' lines.
2. When I send the search from within the python script and use  
'nohup', the problem is reproducible, not random.
3. If the script is sent without 'nohup', that is, if the shell keeps  
full control of both standard error and output, then again, the  
problem seems to disappear. I say 'seems' because I haven't tried with  
my longest ( more than 1300 aa ) sequences.
4. When, from within the python script I use, as Peter suggested, the  
BlastpgpCommandline class to ask blastpgp to send the output to a file  
( the -o option ) the problem disappears irrespectively whether I use  
or not 'nohup'.

Therefore, in my opinion, the problem is not with blastpgp but with  
the handling of its output by python or biopython.

>>
> i think so.
>>> Miguel, could you check if really everything is identical? Because  
>>> i'm
>>> really surprised of such a strange behaviour....
>>
>> Maybe the environment variables are different or something?

Command options are absolutely the same, see above. I am surprised  
too, but I don't think blastpgp is sensitive to any environment  
variable and I don't see how they could change from an in-script to a  
standalone run.

>>
>>> Despite, according to me there aren't any problem in biopython,  
>>> and maybe,
>>> Miguel will be able to discover some differences in the way  
>>> blastpgp is
>>> launched, i would suggest to develop a patch (i could submit  
>>> mine), that
>>> could remove "Query 0" lines.

I couldn't find any differences, so I'm afraid I can't help... I'm  
still testing the script, I will let you know if I find again this  
problem.

>>>
>> Could you upload your "Query 0" patch to Bug 2927?
>> http://bugzilla.open-bio.org/show_bug.cgi?id=2927
>>
> Now i'm wuite busy, because i'm working on a different project and  
> i've
> to manage deliveries...
> but i will for sure upload my patch ASAP.
>>
>>> I aplogize if i understanded the problem wrongly and for the fact  
>>> that
>>> i'm entering in the discussion in this moment (maybe when the
>>> discussion is finished)...
>>>
>>
>> Well I don't (yet) understand what the problem is either ;)
>>
>> Peter
>>
> Ciao
> andrea


Best,



-- Miguel






From ibdeno at gmail.com  Thu Oct 15 13:04:33 2009
From: ibdeno at gmail.com (Miguel Ortiz Lombardia)
Date: Thu, 15 Oct 2009 15:04:33 +0200
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: <4AD64602.9060603@biodec.com>
References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>	
	<320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com>	
	<320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com>	
		
	<320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com>	
	<7AFF985A-4DDE-4B72-A81E-26516BDA689F@gmail.com>	
		
	<320fb6e00910140537h1d9d71f5i97417266c542ae3b@mail.gmail.com>	
	<408372EF-748A-4089-A6F6-28B8E1F88B4B@gmail.com>	
	<4AD5E001.6070506@biodec.com>
	<320fb6e00910140746j15124dccsa901b9cbee784bdc@mail.gmail.com>
	<4AD5E810.5090607@biodec.com>
	<4CDC51BD-7953-47DE-910C-F1F50F0C3275@gmail.com>
	<4AD64602.9060603@biodec.com>
Message-ID: <13B32636-1A38-4E0E-96A1-62D75B6AAEFC@gmail.com>


Le 14 oct. 09 ? 23:43, Andrea a ?crit :

> Miguel Ortiz Lombardia ha scritto:
>> Le 14 oct. 09 ? 17:02, Andrea a ?crit :
>>> Peter ha scritto:
>>>> On Wed, Oct 14, 2009 at 3:28 PM, Andrea  wrote:
>>>>
>>>>> Hi to everybody,
>>>>> I work with blast quite often and i could say i run hundreds of
>>>>> thousand
>>>>> of blastpgp. The "Query 0" outpt of blastpgp, is quite common for
>>>>> me, and
>>>>> i wrote a patch to my code, to remove these "nasty" lines, before
>>>>> passing
>>>>> the output to the parser.
>>>>>
>>>>> I found these type of lines in at least 1-2% of my runs. And i'm
>>>>> fully sure
>>>>> that i found them either in the output of blast via shell and in
>>>>> the output
>>>>> of blast via Biopython.
>>>>>
>>>>> The problem, according to me, is in the blastpgp algorithm and  
>>>>> maybe
>>>>> could be managed in biopython (as i did in my code), cutting out  
>>>>> these
>>>>> "Query 0" lines, because from the point of view of the alignments,
>>>>> they don't have any sense. It seems that blastpgp, wants to show
>>>>> which is the part of the target sequence align to the query  
>>>>> before the
>>>>> starting point of the query itself (something like opening a gap,
>>>>> at the
>>>>> beginning of the query).
>>>>> And this happens "sometimes", and without any apparent reason.
>>>>>
>>>>
>>>> Andrea - do you have any small example output files with this
>>>> problem? If it does occur fairly often (1 to 2% of the time), then
>>>> we should try and update the parser to cope. Miguel's example
>>>> is useful for testing while working on a bug fix, but too big to
>>>> include as part the unit tests.
>>>>
>>>>
>>> mmm... i've to search. I've some "cache" of gzipped blastpgp  
>>> outputs.
>>> But I'm not
>>> sure i've the original (maybe already patched).... waht I'm sure, is
>>> that in the
>>> next month I'm going to run almost 100.000 blasptpg so I'll for sure
>>> find
>>> something small. ;-)
>>>>> What i think, is that there aren't any problem with biopython in
>>>>> wrapping
>>>>> the blastpgp process and maybe, but i'm not sure, the difference  
>>>>> in
>>>>> the
>>>>> output could be related to small differences in the parameter of
>>>>> the process
>>>>> (or in the environment... or in the .ncbirc file).
>>>>>
>>>>> I always was able to  observe  the identity  between the blastpgp
>>>>> output
>>>>> via shell (bash) and the output of the popen wrapper.
>>>>>
>>>>
>>>> If you saw "Query 0" output at the command line (shell), then  
>>>> that is
>>>> worth knowing.
>>
>> All I can say is that this is not what I observe.
>> 1. When I send directly from the shell exactly the same blastpgp
>> search ( I capture the full command line issued in the background by
>> the python script with a 'ps -a | grep blastpgp' ) I have never find
>> the 'Query: 0' lines.
>> 2. When I send the search from within the python script and use
>> 'nohup', the problem is reproducible, not random.
> yes, i'm sure is reproducible. I  mean that what I've observed wasn't
> random on one sequence, but maybe along
> many sequences...
>> 3. If the script is sent without 'nohup', that is, if the shell keeps
>> full control of both standard error and output, then again, the
>> problem seems to disappear. I say 'seems' because I haven't tried  
>> with
>> my longest ( more than 1300 aa ) sequences.
>> 4. When, from within the python script I use, as Peter suggested, the
>> BlastpgpCommandline class to ask blastpgp to send the output to a  
>> file
>> ( the -o option ) the problem disappears irrespectively whether I use
>> or not 'nohup'.
>>
>> Therefore, in my opinion, the problem is not with blastpgp but with
>> the handling of its output by python or biopython.
>>
> I'm really curious. What you have is very strange, but i believe you  
> fully.
>
> Is there the possibility to have:
> your database,
> your .bashrc
> the sequence
> the exact command line.
> the versione of blastpgp
> the versione of blastpgp (2.2.18 ?)
> the other things you use (matrix.... )
> the different possibilities you try....( nohup/python/shell )
> I should be reprodcible.
>
> Have you tried to observe the behaviour of the blastpgp process with a
> "strace" expecially at the
> beginning?
>
>
>>>>
>>> i think so.
>>>>> Miguel, could you check if really everything is identical?  
>>>>> Because i'm
>>>>> really surprised of such a strange behaviour....
>>>>
>>>> Maybe the environment variables are different or something?
>>
>> Command options are absolutely the same, see above. I am surprised
>> too, but I don't think blastpgp is sensitive to any environment
>> variable and I don't see how they could change from an in-script to a
>> standalone run.
> I think only to .bashrc.
>>
>>>>
>>>>> Despite, according to me there aren't any problem in biopython,  
>>>>> and
>>>>> maybe,
>>>>> Miguel will be able to discover some differences in the way
>>>>> blastpgp is
>>>>> launched, i would suggest to develop a patch (i could submit  
>>>>> mine),
>>>>> that
>>>>> could remove "Query 0" lines.
>>
>> I couldn't find any differences, so I'm afraid I can't help... I'm
>> still testing the script, I will let you know if I find again this
>> problem.
> I will try to find the problem in my sequences (but i could say that  
> is
> quite common)... and if i will
> find i will try with the same parameters and the shell...
>>
>>>>>
>>>> Could you upload your "Query 0" patch to Bug 2927?
>>>> http://bugzilla.open-bio.org/show_bug.cgi?id=2927
>>>>
>>> Now i'm wuite busy, because i'm working on a different project and  
>>> i've
>>> to manage deliveries...
>>> but i will for sure upload my patch ASAP.
>>>>
>>>>> I aplogize if i understanded the problem wrongly and for the  
>>>>> fact that
>>>>> i'm entering in the discussion in this moment (maybe when the
>>>>> discussion is finished)...
>>>>>
>>>>
>>>> Well I don't (yet) understand what the problem is either ;)
>>>>
>>>> Peter
>>>>
>>> Ciao
>>> andrea
>>
>>
>> Best,
>>
>>
>>
>> -- Miguel
>>
>>
> thanks.
> Ciao
> Andrea

Hi!

Some new findings that contradict my previous perception of the problem.
Tonight my script failed again after stumbling upon the same problem  
for a different sequence. I have now investigated more carefully and  
found:

1. The problem (a line with 'Query: 0 ---' that impaired parsing of  
the blastpgp output) was encountered in all these cases:

a) nohup myscript.py [some script options] sequences.fasta >&  
myscript.log &
b) myscript.py [some script options] sequences.fasta
c) /usr/local/blast-2.2.22/bin/blastpgp -d /opt/BlastDBs/nr -i  
U7.fasta -m 0 -o tmp.bl.txt -v 500 -b 1000 -a 6 -Q U7.nr.5.pssm -j 5 - 
h 0.001 -p blastpgp

That is, for the first time I was able to reproduce the problem from a  
standalone run of blastpgp.

2. The problem disappears with a previous version of blastpgp  
(2.2.18). Using this version, all these cases work:

a) nohup myscript.py [some script options] sequences.fasta >&  
myscript.log &
b) myscript.py [some script options] sequences.fasta
c) /usr/local/blast-2.2.18/bin/blastpgp -d /opt/BlastDBs/nr -i  
U7.fasta -m 0 -o tmp.bl.txt -v 500 -b 1000 -a 6 -Q U7.nr.5.pssm -j 5 - 
h 0.001 -p blastpgp

So, it would seem that, as Andrea suggested, this is a bug in  
blastpgp, to be more precise, after blastpgp-2.2.18.

3. In this particular case, I notice that the problem happens with a  
sequence containing low complexity region(s). Now, I had thought that  
the default in blastpgp was to filter those sequences out! I'm running  
the original script again with blastpgp-2.2.22 with the filter on (-F  
T) to see if the problem persists.

I will write to the blast-help address at the ncbi to let them know  
about the problem.

Best,


-- Miguel






From andrea at biodec.com  Thu Oct 15 15:03:38 2009
From: andrea at biodec.com (Andrea)
Date: Thu, 15 Oct 2009 17:03:38 +0200
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: <05361598-93AF-4328-BE4B-25D4300DCDE0@gmail.com>
References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>	
	<320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com>	
	<320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com>	
		
	<320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com>	
	<7AFF985A-4DDE-4B72-A81E-26516BDA689F@gmail.com>	
		
	<320fb6e00910140537h1d9d71f5i97417266c542ae3b@mail.gmail.com>	
	<408372EF-748A-4089-A6F6-28B8E1F88B4B@gmail.com>	
	<4AD5E001.6070506@biodec.com>
	<320fb6e00910140746j15124dccsa901b9cbee784bdc@mail.gmail.com>
	<4AD5E810.5090607@biodec.com>
	<4CDC51BD-7953-47DE-910C-F1F50F0C3275@gmail.com>
	<4AD64602.9060603@biodec.com>
	<13B32636-1A38-4E0E-96A1-62D75B6AAEFC@gmail.com>
	<4AD72978.4030900@biodec.com>
	<05361598-93AF-4328-BE4B-25D4300DCDE0@gmail.com>
Message-ID: <4AD739CA.6090403@biodec.com>

Miguel Ortiz Lombardia ha scritto:
>
> Le 15 oct. 09 ? 15:54, Andrea a ?crit :
>
>> Miguel Ortiz Lombardia ha scritto:
>>>
>>> Le 14 oct. 09 ? 23:43, Andrea a ?crit :
>>>
>>>> Miguel Ortiz Lombardia ha scritto:
>>>>> Le 14 oct. 09 ? 17:02, Andrea a ?crit :
>>>>>> Peter ha scritto:
>>>>>>> On Wed, Oct 14, 2009 at 3:28 PM, Andrea  wrote:
>>>>>>>
>>>>>>>> Hi to everybody,
>>>>>>>> I work with blast quite often and i could say i run hundreds of
>>>>>>>> thousand
>>>>>>>> of blastpgp. The "Query 0" outpt of blastpgp, is quite common for
>>>>>>>> me, and
>>>>>>>> i wrote a patch to my code, to remove these "nasty" lines, before
>>>>>>>> passing
>>>>>>>> the output to the parser.
>>>>>>>>
>>>>>>>> I found these type of lines in at least 1-2% of my runs. And i'm
>>>>>>>> fully sure
>>>>>>>> that i found them either in the output of blast via shell and in
>>>>>>>> the output
>>>>>>>> of blast via Biopython.
>>>>>>>>
>>>>>>>> The problem, according to me, is in the blastpgp algorithm and
>>>>>>>> maybe
>>>>>>>> could be managed in biopython (as i did in my code), cutting out
>>>>>>>> these
>>>>>>>> "Query 0" lines, because from the point of view of the alignments,
>>>>>>>> they don't have any sense. It seems that blastpgp, wants to show
>>>>>>>> which is the part of the target sequence align to the query
>>>>>>>> before the
>>>>>>>> starting point of the query itself (something like opening a gap,
>>>>>>>> at the
>>>>>>>> beginning of the query).
>>>>>>>> And this happens "sometimes", and without any apparent reason.
>>>>>>>>
>>>>>>>
>>>>>>> Andrea - do you have any small example output files with this
>>>>>>> problem? If it does occur fairly often (1 to 2% of the time), then
>>>>>>> we should try and update the parser to cope. Miguel's example
>>>>>>> is useful for testing while working on a bug fix, but too big to
>>>>>>> include as part the unit tests.
>>>>>>>
>>>>>>>
>>>>>> mmm... i've to search. I've some "cache" of gzipped blastpgp
>>>>>> outputs.
>>>>>> But I'm not
>>>>>> sure i've the original (maybe already patched).... waht I'm sure, is
>>>>>> that in the
>>>>>> next month I'm going to run almost 100.000 blasptpg so I'll for sure
>>>>>> find
>>>>>> something small. ;-)
>>>>>>>> What i think, is that there aren't any problem with biopython in
>>>>>>>> wrapping
>>>>>>>> the blastpgp process and maybe, but i'm not sure, the
>>>>>>>> difference in
>>>>>>>> the
>>>>>>>> output could be related to small differences in the parameter of
>>>>>>>> the process
>>>>>>>> (or in the environment... or in the .ncbirc file).
>>>>>>>>
>>>>>>>> I always was able to  observe  the identity  between the blastpgp
>>>>>>>> output
>>>>>>>> via shell (bash) and the output of the popen wrapper.
>>>>>>>>
>>>>>>>
>>>>>>> If you saw "Query 0" output at the command line (shell), then
>>>>>>> that is
>>>>>>> worth knowing.
>>>>>
>>>>> All I can say is that this is not what I observe.
>>>>> 1. When I send directly from the shell exactly the same blastpgp
>>>>> search ( I capture the full command line issued in the background by
>>>>> the python script with a 'ps -a | grep blastpgp' ) I have never find
>>>>> the 'Query: 0' lines.
>>>>> 2. When I send the search from within the python script and use
>>>>> 'nohup', the problem is reproducible, not random.
>>>> yes, i'm sure is reproducible. I  mean that what I've observed wasn't
>>>> random on one sequence, but maybe along
>>>> many sequences...
>>>>> 3. If the script is sent without 'nohup', that is, if the shell keeps
>>>>> full control of both standard error and output, then again, the
>>>>> problem seems to disappear. I say 'seems' because I haven't tried
>>>>> with
>>>>> my longest ( more than 1300 aa ) sequences.
>>>>> 4. When, from within the python script I use, as Peter suggested, the
>>>>> BlastpgpCommandline class to ask blastpgp to send the output to a
>>>>> file
>>>>> ( the -o option ) the problem disappears irrespectively whether I use
>>>>> or not 'nohup'.
>>>>>
>>>>> Therefore, in my opinion, the problem is not with blastpgp but with
>>>>> the handling of its output by python or biopython.
>>>>>
>>>> I'm really curious. What you have is very strange, but i believe you
>>>> fully.
>>>>
>>>> Is there the possibility to have:
>>>> your database,
>>>> your .bashrc
>>>> the sequence
>>>> the exact command line.
>>>> the versione of blastpgp
>>>> the versione of blastpgp (2.2.18 ?)
>>>> the other things you use (matrix.... )
>>>> the different possibilities you try....( nohup/python/shell )
>>>> I should be reprodcible.
>>>>
>>>> Have you tried to observe the behaviour of the blastpgp process with a
>>>> "strace" expecially at the
>>>> beginning?
>>>>
>>>>
>>>>>>>
>>>>>> i think so.
>>>>>>>> Miguel, could you check if really everything is identical?
>>>>>>>> Because i'm
>>>>>>>> really surprised of such a strange behaviour....
>>>>>>>
>>>>>>> Maybe the environment variables are different or something?
>>>>>
>>>>> Command options are absolutely the same, see above. I am surprised
>>>>> too, but I don't think blastpgp is sensitive to any environment
>>>>> variable and I don't see how they could change from an in-script to a
>>>>> standalone run.
>>>> I think only to .bashrc.
>>>>>
>>>>>>>
>>>>>>>> Despite, according to me there aren't any problem in biopython,
>>>>>>>> and
>>>>>>>> maybe,
>>>>>>>> Miguel will be able to discover some differences in the way
>>>>>>>> blastpgp is
>>>>>>>> launched, i would suggest to develop a patch (i could submit
>>>>>>>> mine),
>>>>>>>> that
>>>>>>>> could remove "Query 0" lines.
>>>>>
>>>>> I couldn't find any differences, so I'm afraid I can't help... I'm
>>>>> still testing the script, I will let you know if I find again this
>>>>> problem.
>>>> I will try to find the problem in my sequences (but i could say
>>>> that is
>>>> quite common)... and if i will
>>>> find i will try with the same parameters and the shell...
>>>>>
>>>>>>>>
>>>>>>> Could you upload your "Query 0" patch to Bug 2927?
>>>>>>> http://bugzilla.open-bio.org/show_bug.cgi?id=2927
>>>>>>>
>>>>>> Now i'm wuite busy, because i'm working on a different project and
>>>>>> i've
>>>>>> to manage deliveries...
>>>>>> but i will for sure upload my patch ASAP.
>>>>>>>
>>>>>>>> I aplogize if i understanded the problem wrongly and for the fact
>>>>>>>> that
>>>>>>>> i'm entering in the discussion in this moment (maybe when the
>>>>>>>> discussion is finished)...
>>>>>>>>
>>>>>>>
>>>>>>> Well I don't (yet) understand what the problem is either ;)
>>>>>>>
>>>>>>> Peter
>>>>>>>
>>>>>> Ciao
>>>>>> andrea
>>>>>
>>>>>
>>>>> Best,
>>>>>
>>>>>
>>>>>
>>>>> -- Miguel
>>>>>
>>>>>
>>>> thanks.
>>>> Ciao
>>>> Andrea
>>>
>>> Hi!
>>>
>>> Some new findings that contradict my previous perception of the
>>> problem.
>>> Tonight my script failed again after stumbling upon the same problem
>>> for a different sequence. I have now investigated more carefully and
>>> found:
>>>
>>> 1. The problem (a line with 'Query: 0 ---' that impaired parsing of
>>> the blastpgp output) was encountered in all these cases:
>>>
>>> a) nohup myscript.py [some script options] sequences.fasta >&
>>> myscript.log &
>>> b) myscript.py [some script options] sequences.fasta
>>> c) /usr/local/blast-2.2.22/bin/blastpgp -d /opt/BlastDBs/nr -i
>>> U7.fasta -m 0 -o tmp.bl.txt -v 500 -b 1000 -a 6 -Q U7.nr.5.pssm -j 5
>>> -h 0.001 -p blastpgp
>>>
>>> That is, for the first time I was able to reproduce the problem from a
>>> standalone run of blastpgp.
>>>
>>> 2. The problem disappears with a previous version of blastpgp
>>> (2.2.18). Using this version, all these cases work:
>>>
>>> a) nohup myscript.py [some script options] sequences.fasta >&
>>> myscript.log &
>>> b) myscript.py [some script options] sequences.fasta
>>> c) /usr/local/blast-2.2.18/bin/blastpgp -d /opt/BlastDBs/nr -i
>>> U7.fasta -m 0 -o tmp.bl.txt -v 500 -b 1000 -a 6 -Q U7.nr.5.pssm -j 5
>>> -h 0.001 -p blastpgp
>>>
>>> So, it would seem that, as Andrea suggested, this is a bug in
>>> blastpgp, to be more precise, after blastpgp-2.2.18.
>>>
>>> 3. In this particular case, I notice that the problem happens with a
>>> sequence containing low complexity region(s). Now, I had thought that
>>> the default in blastpgp was to filter those sequences out! I'm running
>>> the original script again with blastpgp-2.2.22 with the filter on (-F
>>> T) to see if the problem persists.
>>>
>>> I will write to the blast-help address at the ncbi to let them know
>>> about the problem.
>>>
>>> Best,
>>>
>>>
>>> -- Miguel
>>>
>>>
>> Hi,
>> Thanks for your updates!!!. I can say one thing:
>> I've used in the past these three versione of blastpgp:
>>  - 2.2.15
>>  - 2.2.18
>>  - 2.2.19
>> and i found the "Query 0" problem in all of them, but, if one
>> of them fails (i mean, gives "Query 0" output) the other may not fail
>> at all (they most probably not give the "Query 0" output).
>>
>> Another interesting things is that, with the three version, the same
>> database, and the same parameters, the output is quite different...
>> ...sorry.. very different...
>>
>> I'm also sure that it could happens also with the complexity region(s)
>> filter "True".
>> What i observe, is that there aren't parameters that make it
>> disappear. It
>> just disappear from a sequence, and it will appear in another.... in
>> other
>> word, changing parameters, make it "moving"  between sequences.
>>
>> I've never used blastpgp 2.2.22. So i cannot say anything about it.
>>
>> Thanks
>> Andrea
>
>
> Then it looks like something more weird than what I thought...
> Andrea, would you mind if I send your e-mail to the blast people? Or
> perhaps you can do it yourself... I wrote to blast-help at ncbi.nlm.nih.gov
If you can, for me is an help. I hope they will reply.
I can also send and email, buti f you have....
>
> I suspect they will tell us to use the XML output, but then, not all
> info I need seems to go there...
i think the same, and i suspect the XML output doesn't suffer of the
same problem.
>
> Thanks a lot!
>
>
To you!!
> -- Miguel
>
>
And for my patch, is not a patch.I've checked now. To be fully independent
from NcbiStandalone.py i didn't write a patch for it. I wrote a patch
in the sense that actually i remove from the blastpgp output, four
lines, starting
from the "Query 0" one, and then i submit the "new output" to the parser.
In this way i'm reading the file twice (so it's not a good idea), but i
don't mind
if the NcbiStandalone.py change, because I'm fully independent from it.

This is my "simple code":

## THIS IS NOT A PATCH. BUT IT WORKS.
## THIS MEANS THAT IF WE FIND THE WAY
## TO REMOVE FOUR LINES STARTING
## FROM "Query 0" THE PROBLEM IS REALLY
## SOLVED (NOW I DON'T HAVE PARSER
## PROBLEMS AT ALL).
## lines is a list derived from a  readlines() call of the
## output of blastpgp.
## newlines has to be reconverted into an handle
## object.
def removeQuery0lines(lines):
        newlines = []
        count = 0
        for l in lines:
                if count == 4: count = 0
                if count != 0: count+=1
                if l.startswith('Query: 0'): count = 1
                if count == 0: newlines.append(l)
        return newlines


It should be interesting to develope a patch that  works  inside the parser.
I will try to work on it, in November, becaue now i cannot.
The right function to manipulate it should be (inside NCBIStandalone.py):

def _scan_hsp_alignment(self, uhandle, consumer):
        # Query: 11
GRGVSACA-------TCDGFFYRNQKVAVIGGGNTAVEEALYLSNIASEVHLIHRRDGF
        #           GRGVS+         TC    Y  + + V GGG+ + EE   L     +   I R+
        # Sbjct: 12
GRGVSSVVRRCIHKPTCKE--YAVKIIDVTGGGSFSAEEVQELREATLKEVDILRKVSG
        #
        # Query: 64 AEKILIKR 71
        #              I +K
        # Sbjct: 70 PNIIQLKD 77
        #

        while 1:
            # Blastn adds an extra line filled with spaces before Query
            attempt_read_and_call(uhandle, consumer.noevent, start='     ')
            read_and_call(uhandle, consumer.query, start='Query')
            read_and_call(uhandle, consumer.align, start='     ')
            read_and_call(uhandle, consumer.sbjct, start='Sbjct')
            read_and_call_while(uhandle, consumer.noevent, blank=1)
            line = safe_peekline(uhandle)
            # Alignment continues if I see a 'Query' or the spaces for
Blastn.
            if not (line.startswith('Query') or line.startswith('     ')):
                break

changing it in:

def _scan_hsp_alignment(self, uhandle, consumer):
        # Query: 11
GRGVSACA-------TCDGFFYRNQKVAVIGGGNTAVEEALYLSNIASEVHLIHRRDGF
        #           GRGVS+         TC    Y  + + V GGG+ + EE   L     +   I R+
        # Sbjct: 12
GRGVSSVVRRCIHKPTCKE--YAVKIIDVTGGGSFSAEEVQELREATLKEVDILRKVSG
        #
        # Query: 64 AEKILIKR 71
        #              I +K
        # Sbjct: 70 PNIIQLKD 77
        #
        while 1:
            # Blastn adds an extra line filled with spaces before Query
            attempt_read_and_call(uhandle, consumer.noevent, start='     ')
            # Remove Query 0 start (It is only at the beginning...)
            q0_count = attempt_read_and_call(uhandle, consumer.noevent,
start='Query: 0')
            if q0_count:
                  # if "Query 0" remove its alignment
                  read_and_call(uhandle, consumer.noevent, start='     ')
                  read_and_call(uhandle, consumer.noevent, start='Sbjct')
                  read_and_call_while(uhandle, consumer.noevent, blank=1)
            # Remove Query 0 end
            read_and_call(uhandle, consumer.query, start='Query')
            read_and_call(uhandle, consumer.align, start='     ')
            read_and_call(uhandle, consumer.sbjct, start='Sbjct')
            read_and_call_while(uhandle, consumer.noevent, blank=1)
            line = safe_peekline(uhandle)
            # Alignment continues if I see a 'Query' or the spaces for
Blastn.
            if not (line.startswith('Query') or line.startswith('     ')):
                break

BUT, i'm not sure of the patch and i didn't try at all... so i cannot
submit... It needs to be tryed and tested!!!!
And i'm also not sure if it is the right place to patch....!!!!




I hope this could help....
Miguel, have you time to try and test?

Thanks a lot.
Andrea



From biopython at maubp.freeserve.co.uk  Thu Oct 15 15:15:30 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Oct 2009 16:15:30 +0100
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: <4AD739CA.6090403@biodec.com>
References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>
	<4AD5E001.6070506@biodec.com>
	<320fb6e00910140746j15124dccsa901b9cbee784bdc@mail.gmail.com>
	<4AD5E810.5090607@biodec.com>
	<4CDC51BD-7953-47DE-910C-F1F50F0C3275@gmail.com>
	<4AD64602.9060603@biodec.com>
	<13B32636-1A38-4E0E-96A1-62D75B6AAEFC@gmail.com>
	<4AD72978.4030900@biodec.com>
	<05361598-93AF-4328-BE4B-25D4300DCDE0@gmail.com>
	<4AD739CA.6090403@biodec.com>
Message-ID: <320fb6e00910150815h268b588cx696915143da3f097@mail.gmail.com>

Hi guys,

So we still don't understand exactly what triggers this,
but it affects multiple version of BLAST, and multiple
ways of calling blastpgp.

I think we should update the Biopython PSI parser
to tolerate (i.e. ignore) these "QUERY: 0" lines. It
would be very useful to have a few more examples
(ideally small files so we can include them with the
test suite), covering a few recent versions of BLAST.

You can email medium sized files to me personally
(NOT to the mailing list), and smaller files can be
uploaded to Bug 2927 (which I will reopen):
http://bugzilla.open-bio.org/show_bug.cgi?id=2927

Peter


From ibdeno at gmail.com  Thu Oct 15 15:33:59 2009
From: ibdeno at gmail.com (Miguel Ortiz Lombardia)
Date: Thu, 15 Oct 2009 17:33:59 +0200
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: <4AD739CA.6090403@biodec.com>
References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>	
	<320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com>	
	<320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com>	
		
	<320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com>	
	<7AFF985A-4DDE-4B72-A81E-26516BDA689F@gmail.com>	
		
	<320fb6e00910140537h1d9d71f5i97417266c542ae3b@mail.gmail.com>	
	<408372EF-748A-4089-A6F6-28B8E1F88B4B@gmail.com>	
	<4AD5E001.6070506@biodec.com>
	<320fb6e00910140746j15124dccsa901b9cbee784bdc@mail.gmail.com>
	<4AD5E810.5090607@biodec.com>
	<4CDC51BD-7953-47DE-910C-F1F50F0C3275@gmail.com>
	<4AD64602.9060603@biodec.com>
	<13B32636-1A38-4E0E-96A1-62D75B6AAEFC@gmail.com>
	<4AD72978.4030900@biodec.com>
	<05361598-93AF-4328-BE4B-25D4300DCDE0@gmail.com>
	<4AD739CA.6090403@biodec.com>
Message-ID: 


Le 15 oct. 09 ? 17:03, Andrea a ?crit :

> Miguel Ortiz Lombardia ha scritto:
>>
>> Le 15 oct. 09 ? 15:54, Andrea a ?crit :
>>
>>> Miguel Ortiz Lombardia ha scritto:
>>>>
>>>> Le 14 oct. 09 ? 23:43, Andrea a ?crit :
>>>>
>>>>> Miguel Ortiz Lombardia ha scritto:
>>>>>> Le 14 oct. 09 ? 17:02, Andrea a ?crit :
>>>>>>> Peter ha scritto:
>>>>>>>> On Wed, Oct 14, 2009 at 3:28 PM, Andrea   
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi to everybody,
>>>>>>>>> I work with blast quite often and i could say i run hundreds  
>>>>>>>>> of
>>>>>>>>> thousand
>>>>>>>>> of blastpgp. The "Query 0" outpt of blastpgp, is quite  
>>>>>>>>> common for
>>>>>>>>> me, and
>>>>>>>>> i wrote a patch to my code, to remove these "nasty" lines,  
>>>>>>>>> before
>>>>>>>>> passing
>>>>>>>>> the output to the parser.
>>>>>>>>>
>>>>>>>>> I found these type of lines in at least 1-2% of my runs. And  
>>>>>>>>> i'm
>>>>>>>>> fully sure
>>>>>>>>> that i found them either in the output of blast via shell  
>>>>>>>>> and in
>>>>>>>>> the output
>>>>>>>>> of blast via Biopython.
>>>>>>>>>
>>>>>>>>> The problem, according to me, is in the blastpgp algorithm and
>>>>>>>>> maybe
>>>>>>>>> could be managed in biopython (as i did in my code), cutting  
>>>>>>>>> out
>>>>>>>>> these
>>>>>>>>> "Query 0" lines, because from the point of view of the  
>>>>>>>>> alignments,
>>>>>>>>> they don't have any sense. It seems that blastpgp, wants to  
>>>>>>>>> show
>>>>>>>>> which is the part of the target sequence align to the query
>>>>>>>>> before the
>>>>>>>>> starting point of the query itself (something like opening a  
>>>>>>>>> gap,
>>>>>>>>> at the
>>>>>>>>> beginning of the query).
>>>>>>>>> And this happens "sometimes", and without any apparent reason.
>>>>>>>>>
>>>>>>>>
>>>>>>>> Andrea - do you have any small example output files with this
>>>>>>>> problem? If it does occur fairly often (1 to 2% of the time),  
>>>>>>>> then
>>>>>>>> we should try and update the parser to cope. Miguel's example
>>>>>>>> is useful for testing while working on a bug fix, but too big  
>>>>>>>> to
>>>>>>>> include as part the unit tests.
>>>>>>>>
>>>>>>>>
>>>>>>> mmm... i've to search. I've some "cache" of gzipped blastpgp
>>>>>>> outputs.
>>>>>>> But I'm not
>>>>>>> sure i've the original (maybe already patched).... waht I'm  
>>>>>>> sure, is
>>>>>>> that in the
>>>>>>> next month I'm going to run almost 100.000 blasptpg so I'll  
>>>>>>> for sure
>>>>>>> find
>>>>>>> something small. ;-)
>>>>>>>>> What i think, is that there aren't any problem with  
>>>>>>>>> biopython in
>>>>>>>>> wrapping
>>>>>>>>> the blastpgp process and maybe, but i'm not sure, the
>>>>>>>>> difference in
>>>>>>>>> the
>>>>>>>>> output could be related to small differences in the  
>>>>>>>>> parameter of
>>>>>>>>> the process
>>>>>>>>> (or in the environment... or in the .ncbirc file).
>>>>>>>>>
>>>>>>>>> I always was able to  observe  the identity  between the  
>>>>>>>>> blastpgp
>>>>>>>>> output
>>>>>>>>> via shell (bash) and the output of the popen wrapper.
>>>>>>>>>
>>>>>>>>
>>>>>>>> If you saw "Query 0" output at the command line (shell), then
>>>>>>>> that is
>>>>>>>> worth knowing.
>>>>>>
>>>>>> All I can say is that this is not what I observe.
>>>>>> 1. When I send directly from the shell exactly the same blastpgp
>>>>>> search ( I capture the full command line issued in the  
>>>>>> background by
>>>>>> the python script with a 'ps -a | grep blastpgp' ) I have never  
>>>>>> find
>>>>>> the 'Query: 0' lines.
>>>>>> 2. When I send the search from within the python script and use
>>>>>> 'nohup', the problem is reproducible, not random.
>>>>> yes, i'm sure is reproducible. I  mean that what I've observed  
>>>>> wasn't
>>>>> random on one sequence, but maybe along
>>>>> many sequences...
>>>>>> 3. If the script is sent without 'nohup', that is, if the shell  
>>>>>> keeps
>>>>>> full control of both standard error and output, then again, the
>>>>>> problem seems to disappear. I say 'seems' because I haven't tried
>>>>>> with
>>>>>> my longest ( more than 1300 aa ) sequences.
>>>>>> 4. When, from within the python script I use, as Peter  
>>>>>> suggested, the
>>>>>> BlastpgpCommandline class to ask blastpgp to send the output to a
>>>>>> file
>>>>>> ( the -o option ) the problem disappears irrespectively whether  
>>>>>> I use
>>>>>> or not 'nohup'.
>>>>>>
>>>>>> Therefore, in my opinion, the problem is not with blastpgp but  
>>>>>> with
>>>>>> the handling of its output by python or biopython.
>>>>>>
>>>>> I'm really curious. What you have is very strange, but i believe  
>>>>> you
>>>>> fully.
>>>>>
>>>>> Is there the possibility to have:
>>>>> your database,
>>>>> your .bashrc
>>>>> the sequence
>>>>> the exact command line.
>>>>> the versione of blastpgp
>>>>> the versione of blastpgp (2.2.18 ?)
>>>>> the other things you use (matrix.... )
>>>>> the different possibilities you try....( nohup/python/shell )
>>>>> I should be reprodcible.
>>>>>
>>>>> Have you tried to observe the behaviour of the blastpgp process  
>>>>> with a
>>>>> "strace" expecially at the
>>>>> beginning?
>>>>>
>>>>>
>>>>>>>>
>>>>>>> i think so.
>>>>>>>>> Miguel, could you check if really everything is identical?
>>>>>>>>> Because i'm
>>>>>>>>> really surprised of such a strange behaviour....
>>>>>>>>
>>>>>>>> Maybe the environment variables are different or something?
>>>>>>
>>>>>> Command options are absolutely the same, see above. I am  
>>>>>> surprised
>>>>>> too, but I don't think blastpgp is sensitive to any environment
>>>>>> variable and I don't see how they could change from an in- 
>>>>>> script to a
>>>>>> standalone run.
>>>>> I think only to .bashrc.
>>>>>>
>>>>>>>>
>>>>>>>>> Despite, according to me there aren't any problem in  
>>>>>>>>> biopython,
>>>>>>>>> and
>>>>>>>>> maybe,
>>>>>>>>> Miguel will be able to discover some differences in the way
>>>>>>>>> blastpgp is
>>>>>>>>> launched, i would suggest to develop a patch (i could submit
>>>>>>>>> mine),
>>>>>>>>> that
>>>>>>>>> could remove "Query 0" lines.
>>>>>>
>>>>>> I couldn't find any differences, so I'm afraid I can't help...  
>>>>>> I'm
>>>>>> still testing the script, I will let you know if I find again  
>>>>>> this
>>>>>> problem.
>>>>> I will try to find the problem in my sequences (but i could say
>>>>> that is
>>>>> quite common)... and if i will
>>>>> find i will try with the same parameters and the shell...
>>>>>>
>>>>>>>>>
>>>>>>>> Could you upload your "Query 0" patch to Bug 2927?
>>>>>>>> http://bugzilla.open-bio.org/show_bug.cgi?id=2927
>>>>>>>>
>>>>>>> Now i'm wuite busy, because i'm working on a different project  
>>>>>>> and
>>>>>>> i've
>>>>>>> to manage deliveries...
>>>>>>> but i will for sure upload my patch ASAP.
>>>>>>>>
>>>>>>>>> I aplogize if i understanded the problem wrongly and for the  
>>>>>>>>> fact
>>>>>>>>> that
>>>>>>>>> i'm entering in the discussion in this moment (maybe when the
>>>>>>>>> discussion is finished)...
>>>>>>>>>
>>>>>>>>
>>>>>>>> Well I don't (yet) understand what the problem is either ;)
>>>>>>>>
>>>>>>>> Peter
>>>>>>>>
>>>>>>> Ciao
>>>>>>> andrea
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>>
>>>>>>
>>>>>>
>>>>>> -- Miguel
>>>>>>
>>>>>>
>>>>> thanks.
>>>>> Ciao
>>>>> Andrea
>>>>
>>>> Hi!
>>>>
>>>> Some new findings that contradict my previous perception of the
>>>> problem.
>>>> Tonight my script failed again after stumbling upon the same  
>>>> problem
>>>> for a different sequence. I have now investigated more carefully  
>>>> and
>>>> found:
>>>>
>>>> 1. The problem (a line with 'Query: 0 ---' that impaired parsing of
>>>> the blastpgp output) was encountered in all these cases:
>>>>
>>>> a) nohup myscript.py [some script options] sequences.fasta >&
>>>> myscript.log &
>>>> b) myscript.py [some script options] sequences.fasta
>>>> c) /usr/local/blast-2.2.22/bin/blastpgp -d /opt/BlastDBs/nr -i
>>>> U7.fasta -m 0 -o tmp.bl.txt -v 500 -b 1000 -a 6 -Q U7.nr.5.pssm - 
>>>> j 5
>>>> -h 0.001 -p blastpgp
>>>>
>>>> That is, for the first time I was able to reproduce the problem  
>>>> from a
>>>> standalone run of blastpgp.
>>>>
>>>> 2. The problem disappears with a previous version of blastpgp
>>>> (2.2.18). Using this version, all these cases work:
>>>>
>>>> a) nohup myscript.py [some script options] sequences.fasta >&
>>>> myscript.log &
>>>> b) myscript.py [some script options] sequences.fasta
>>>> c) /usr/local/blast-2.2.18/bin/blastpgp -d /opt/BlastDBs/nr -i
>>>> U7.fasta -m 0 -o tmp.bl.txt -v 500 -b 1000 -a 6 -Q U7.nr.5.pssm - 
>>>> j 5
>>>> -h 0.001 -p blastpgp
>>>>
>>>> So, it would seem that, as Andrea suggested, this is a bug in
>>>> blastpgp, to be more precise, after blastpgp-2.2.18.
>>>>
>>>> 3. In this particular case, I notice that the problem happens  
>>>> with a
>>>> sequence containing low complexity region(s). Now, I had thought  
>>>> that
>>>> the default in blastpgp was to filter those sequences out! I'm  
>>>> running
>>>> the original script again with blastpgp-2.2.22 with the filter on  
>>>> (-F
>>>> T) to see if the problem persists.
>>>>
>>>> I will write to the blast-help address at the ncbi to let them know
>>>> about the problem.
>>>>
>>>> Best,
>>>>
>>>>
>>>> -- Miguel
>>>>
>>>>
>>> Hi,
>>> Thanks for your updates!!!. I can say one thing:
>>> I've used in the past these three versione of blastpgp:
>>> - 2.2.15
>>> - 2.2.18
>>> - 2.2.19
>>> and i found the "Query 0" problem in all of them, but, if one
>>> of them fails (i mean, gives "Query 0" output) the other may not  
>>> fail
>>> at all (they most probably not give the "Query 0" output).
>>>
>>> Another interesting things is that, with the three version, the same
>>> database, and the same parameters, the output is quite different...
>>> ...sorry.. very different...
>>>
>>> I'm also sure that it could happens also with the complexity  
>>> region(s)
>>> filter "True".
>>> What i observe, is that there aren't parameters that make it
>>> disappear. It
>>> just disappear from a sequence, and it will appear in another.... in
>>> other
>>> word, changing parameters, make it "moving"  between sequences.
>>>
>>> I've never used blastpgp 2.2.22. So i cannot say anything about it.
>>>
>>> Thanks
>>> Andrea
>>
>>
>> Then it looks like something more weird than what I thought...
>> Andrea, would you mind if I send your e-mail to the blast people? Or
>> perhaps you can do it yourself... I wrote to blast-help at ncbi.nlm.nih.gov
> If you can, for me is an help. I hope they will reply.
> I can also send and email, buti f you have....

I will do that, no problem

>>
>> I suspect they will tell us to use the XML output, but then, not all
>> info I need seems to go there...
> i think the same, and i suspect the XML output doesn't suffer of the
> same problem.

For me the XML is a no issue, since the NCBIXML parser does not really  
support PSI-BLAST searches:
it can't get information on the rounds, convergence... If you have a  
look to NCBIXML.py you see a lot of XXX TODO PSI...

>>
>> Thanks a lot!
>>
>>
> To you!!
>> -- Miguel
>>
>>
> And for my patch, is not a patch.I've checked now. To be fully  
> independent
> from NcbiStandalone.py i didn't write a patch for it. I wrote a patch
> in the sense that actually i remove from the blastpgp output, four
> lines, starting
> from the "Query 0" one, and then i submit the "new output" to the  
> parser.
> In this way i'm reading the file twice (so it's not a good idea),  
> but i
> don't mind
> if the NcbiStandalone.py change, because I'm fully independent from  
> it.
>
> This is my "simple code":
>
> ## THIS IS NOT A PATCH. BUT IT WORKS.
> ## THIS MEANS THAT IF WE FIND THE WAY
> ## TO REMOVE FOUR LINES STARTING
> ## FROM "Query 0" THE PROBLEM IS REALLY
> ## SOLVED (NOW I DON'T HAVE PARSER
> ## PROBLEMS AT ALL).
> ## lines is a list derived from a  readlines() call of the
> ## output of blastpgp.
> ## newlines has to be reconverted into an handle
> ## object.
> def removeQuery0lines(lines):
>        newlines = []
>        count = 0
>        for l in lines:
>                if count == 4: count = 0
>                if count != 0: count+=1
>                if l.startswith('Query: 0'): count = 1
>                if count == 0: newlines.append(l)
>        return newlines
>

Thanks!

>
> It should be interesting to develope a patch that  works  inside the  
> parser.
> I will try to work on it, in November, becaue now i cannot.
> The right function to manipulate it should be (inside  
> NCBIStandalone.py):
>
> def _scan_hsp_alignment(self, uhandle, consumer):
>        # Query: 11
> GRGVSACA-------TCDGFFYRNQKVAVIGGGNTAVEEALYLSNIASEVHLIHRRDGF
>        #           GRGVS+         TC    Y  + + V GGG+ + EE   L      
> +   I R+
>        # Sbjct: 12
> GRGVSSVVRRCIHKPTCKE--YAVKIIDVTGGGSFSAEEVQELREATLKEVDILRKVSG
>        #
>        # Query: 64 AEKILIKR 71
>        #              I +K
>        # Sbjct: 70 PNIIQLKD 77
>        #
>
>        while 1:
>            # Blastn adds an extra line filled with spaces before Query
>            attempt_read_and_call(uhandle, consumer.noevent,  
> start='     ')
>            read_and_call(uhandle, consumer.query, start='Query')
>            read_and_call(uhandle, consumer.align, start='     ')
>            read_and_call(uhandle, consumer.sbjct, start='Sbjct')
>            read_and_call_while(uhandle, consumer.noevent, blank=1)
>            line = safe_peekline(uhandle)
>            # Alignment continues if I see a 'Query' or the spaces for
> Blastn.
>            if not (line.startswith('Query') or line.startswith('      
> ')):
>                break
>
> changing it in:
>
> def _scan_hsp_alignment(self, uhandle, consumer):
>        # Query: 11
> GRGVSACA-------TCDGFFYRNQKVAVIGGGNTAVEEALYLSNIASEVHLIHRRDGF
>        #           GRGVS+         TC    Y  + + V GGG+ + EE   L      
> +   I R+
>        # Sbjct: 12
> GRGVSSVVRRCIHKPTCKE--YAVKIIDVTGGGSFSAEEVQELREATLKEVDILRKVSG
>        #
>        # Query: 64 AEKILIKR 71
>        #              I +K
>        # Sbjct: 70 PNIIQLKD 77
>        #
>        while 1:
>            # Blastn adds an extra line filled with spaces before Query
>            attempt_read_and_call(uhandle, consumer.noevent,  
> start='     ')
>            # Remove Query 0 start (It is only at the beginning...)
>            q0_count = attempt_read_and_call(uhandle, consumer.noevent,
> start='Query: 0')
>            if q0_count:
>                  # if "Query 0" remove its alignment
>                  read_and_call(uhandle, consumer.noevent,  
> start='     ')
>                  read_and_call(uhandle, consumer.noevent,  
> start='Sbjct')
>                  read_and_call_while(uhandle, consumer.noevent,  
> blank=1)
>            # Remove Query 0 end
>            read_and_call(uhandle, consumer.query, start='Query')
>            read_and_call(uhandle, consumer.align, start='     ')
>            read_and_call(uhandle, consumer.sbjct, start='Sbjct')
>            read_and_call_while(uhandle, consumer.noevent, blank=1)
>            line = safe_peekline(uhandle)
>            # Alignment continues if I see a 'Query' or the spaces for
> Blastn.
>            if not (line.startswith('Query') or line.startswith('      
> ')):
>                break
>
> BUT, i'm not sure of the patch and i didn't try at all... so i cannot
> submit... It needs to be tryed and tested!!!!
> And i'm also not sure if it is the right place to patch....!!!!
>
>
>
>
> I hope this could help....
> Miguel, have you time to try and test?
>

I'm afraid not in the next 6 weeks...

Best,



-- Miguel






From biopython at maubp.freeserve.co.uk  Thu Oct 15 15:39:15 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Oct 2009 16:39:15 +0100
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: 
References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>
	<320fb6e00910140746j15124dccsa901b9cbee784bdc@mail.gmail.com>
	<4AD5E810.5090607@biodec.com>
	<4CDC51BD-7953-47DE-910C-F1F50F0C3275@gmail.com>
	<4AD64602.9060603@biodec.com>
	<13B32636-1A38-4E0E-96A1-62D75B6AAEFC@gmail.com>
	<4AD72978.4030900@biodec.com>
	<05361598-93AF-4328-BE4B-25D4300DCDE0@gmail.com>
	<4AD739CA.6090403@biodec.com>
	
Message-ID: <320fb6e00910150839s445f70fbx1998560c46389a94@mail.gmail.com>

You don't have to include *all* the previous email in the quote ;)

On Thu, Oct 15, 2009 at 4:33 PM, Miguel Ortiz Lombardia
 wrote:
>>>
>>> I suspect they will tell us to use the XML output, but then, not all
>>> info I need seems to go there...
>>
>> i think the same, and i suspect the XML output doesn't suffer of the
>> same problem.
>
> For me the XML is a no issue, since the NCBIXML parser does not really
> support PSI-BLAST searches:
> it can't get information on the rounds, convergence... If you have a look to
> NCBIXML.py you see a lot of XXX TODO PSI...

There may well be some things missing in our parser, but last time I checked,
the XML file itself was missing lots of information found in the plain
text output.

Peter


From andrea at biodec.com  Thu Oct 15 15:39:48 2009
From: andrea at biodec.com (Andrea)
Date: Thu, 15 Oct 2009 17:39:48 +0200
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: 
References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>	
	<320fb6e00910130441v7c170a86g3e3dff3145611690@mail.gmail.com>	
	<320fb6e00910130446v3f0cbeecha434b458f5703724@mail.gmail.com>	
		
	<320fb6e00910130636i715873e9w8cc3f12ffb83b0f1@mail.gmail.com>	
	<7AFF985A-4DDE-4B72-A81E-26516BDA689F@gmail.com>	
		
	<320fb6e00910140537h1d9d71f5i97417266c542ae3b@mail.gmail.com>	
	<408372EF-748A-4089-A6F6-28B8E1F88B4B@gmail.com>	
	<4AD5E001.6070506@biodec.com>
	<320fb6e00910140746j15124dccsa901b9cbee784bdc@mail.gmail.com>
	<4AD5E810.5090607@biodec.com>
	<4CDC51BD-7953-47DE-910C-F1F50F0C3275@gmail.com>
	<4AD64602.9060603@biodec.com>
	<13B32636-1A38-4E0E-96A1-62D75B6AAEFC@gmail.com>
	<4AD72978.4030900@biodec.com>
	<05361598-93AF-4328-BE4B-25D4300DCDE0@gmail.com>
	<4AD739CA.6090403@biodec.com>
	
Message-ID: <4AD74244.2070603@biodec.com>

Miguel Ortiz Lombardia ha scritto:
>
> Le 15 oct. 09 ? 17:03, Andrea a ?crit :
>
>> Miguel Ortiz Lombardia ha scritto:
>>>
>>> Le 15 oct. 09 ? 15:54, Andrea a ?crit :
>>>
>>>> Miguel Ortiz Lombardia ha scritto:
>>>>>
>>>>> Le 14 oct. 09 ? 23:43, Andrea a ?crit :
>>>>>
>>>>>> Miguel Ortiz Lombardia ha scritto:
>>>>>>> Le 14 oct. 09 ? 17:02, Andrea a ?crit :
>>>>>>>> Peter ha scritto:
>>>>>>>>> On Wed, Oct 14, 2009 at 3:28 PM, Andrea 
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi to everybody,
>>>>>>>>>> I work with blast quite often and i could say i run hundreds of
>>>>>>>>>> thousand
>>>>>>>>>> of blastpgp. The "Query 0" outpt of blastpgp, is quite common
>>>>>>>>>> for
>>>>>>>>>> me, and
>>>>>>>>>> i wrote a patch to my code, to remove these "nasty" lines,
>>>>>>>>>> before
>>>>>>>>>> passing
>>>>>>>>>> the output to the parser.
>>>>>>>>>>
>>>>>>>>>> I found these type of lines in at least 1-2% of my runs. And i'm
>>>>>>>>>> fully sure
>>>>>>>>>> that i found them either in the output of blast via shell and in
>>>>>>>>>> the output
>>>>>>>>>> of blast via Biopython.
>>>>>>>>>>
>>>>>>>>>> The problem, according to me, is in the blastpgp algorithm and
>>>>>>>>>> maybe
>>>>>>>>>> could be managed in biopython (as i did in my code), cutting out
>>>>>>>>>> these
>>>>>>>>>> "Query 0" lines, because from the point of view of the
>>>>>>>>>> alignments,
>>>>>>>>>> they don't have any sense. It seems that blastpgp, wants to show
>>>>>>>>>> which is the part of the target sequence align to the query
>>>>>>>>>> before the
>>>>>>>>>> starting point of the query itself (something like opening a
>>>>>>>>>> gap,
>>>>>>>>>> at the
>>>>>>>>>> beginning of the query).
>>>>>>>>>> And this happens "sometimes", and without any apparent reason.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Andrea - do you have any small example output files with this
>>>>>>>>> problem? If it does occur fairly often (1 to 2% of the time),
>>>>>>>>> then
>>>>>>>>> we should try and update the parser to cope. Miguel's example
>>>>>>>>> is useful for testing while working on a bug fix, but too big to
>>>>>>>>> include as part the unit tests.
>>>>>>>>>
>>>>>>>>>
>>>>>>>> mmm... i've to search. I've some "cache" of gzipped blastpgp
>>>>>>>> outputs.
>>>>>>>> But I'm not
>>>>>>>> sure i've the original (maybe already patched).... waht I'm
>>>>>>>> sure, is
>>>>>>>> that in the
>>>>>>>> next month I'm going to run almost 100.000 blasptpg so I'll for
>>>>>>>> sure
>>>>>>>> find
>>>>>>>> something small. ;-)
>>>>>>>>>> What i think, is that there aren't any problem with biopython in
>>>>>>>>>> wrapping
>>>>>>>>>> the blastpgp process and maybe, but i'm not sure, the
>>>>>>>>>> difference in
>>>>>>>>>> the
>>>>>>>>>> output could be related to small differences in the parameter of
>>>>>>>>>> the process
>>>>>>>>>> (or in the environment... or in the .ncbirc file).
>>>>>>>>>>
>>>>>>>>>> I always was able to  observe  the identity  between the
>>>>>>>>>> blastpgp
>>>>>>>>>> output
>>>>>>>>>> via shell (bash) and the output of the popen wrapper.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> If you saw "Query 0" output at the command line (shell), then
>>>>>>>>> that is
>>>>>>>>> worth knowing.
>>>>>>>
>>>>>>> All I can say is that this is not what I observe.
>>>>>>> 1. When I send directly from the shell exactly the same blastpgp
>>>>>>> search ( I capture the full command line issued in the
>>>>>>> background by
>>>>>>> the python script with a 'ps -a | grep blastpgp' ) I have never
>>>>>>> find
>>>>>>> the 'Query: 0' lines.
>>>>>>> 2. When I send the search from within the python script and use
>>>>>>> 'nohup', the problem is reproducible, not random.
>>>>>> yes, i'm sure is reproducible. I  mean that what I've observed
>>>>>> wasn't
>>>>>> random on one sequence, but maybe along
>>>>>> many sequences...
>>>>>>> 3. If the script is sent without 'nohup', that is, if the shell
>>>>>>> keeps
>>>>>>> full control of both standard error and output, then again, the
>>>>>>> problem seems to disappear. I say 'seems' because I haven't tried
>>>>>>> with
>>>>>>> my longest ( more than 1300 aa ) sequences.
>>>>>>> 4. When, from within the python script I use, as Peter
>>>>>>> suggested, the
>>>>>>> BlastpgpCommandline class to ask blastpgp to send the output to a
>>>>>>> file
>>>>>>> ( the -o option ) the problem disappears irrespectively whether
>>>>>>> I use
>>>>>>> or not 'nohup'.
>>>>>>>
>>>>>>> Therefore, in my opinion, the problem is not with blastpgp but with
>>>>>>> the handling of its output by python or biopython.
>>>>>>>
>>>>>> I'm really curious. What you have is very strange, but i believe you
>>>>>> fully.
>>>>>>
>>>>>> Is there the possibility to have:
>>>>>> your database,
>>>>>> your .bashrc
>>>>>> the sequence
>>>>>> the exact command line.
>>>>>> the versione of blastpgp
>>>>>> the versione of blastpgp (2.2.18 ?)
>>>>>> the other things you use (matrix.... )
>>>>>> the different possibilities you try....( nohup/python/shell )
>>>>>> I should be reprodcible.
>>>>>>
>>>>>> Have you tried to observe the behaviour of the blastpgp process
>>>>>> with a
>>>>>> "strace" expecially at the
>>>>>> beginning?
>>>>>>
>>>>>>
>>>>>>>>>
>>>>>>>> i think so.
>>>>>>>>>> Miguel, could you check if really everything is identical?
>>>>>>>>>> Because i'm
>>>>>>>>>> really surprised of such a strange behaviour....
>>>>>>>>>
>>>>>>>>> Maybe the environment variables are different or something?
>>>>>>>
>>>>>>> Command options are absolutely the same, see above. I am surprised
>>>>>>> too, but I don't think blastpgp is sensitive to any environment
>>>>>>> variable and I don't see how they could change from an in-script
>>>>>>> to a
>>>>>>> standalone run.
>>>>>> I think only to .bashrc.
>>>>>>>
>>>>>>>>>
>>>>>>>>>> Despite, according to me there aren't any problem in biopython,
>>>>>>>>>> and
>>>>>>>>>> maybe,
>>>>>>>>>> Miguel will be able to discover some differences in the way
>>>>>>>>>> blastpgp is
>>>>>>>>>> launched, i would suggest to develop a patch (i could submit
>>>>>>>>>> mine),
>>>>>>>>>> that
>>>>>>>>>> could remove "Query 0" lines.
>>>>>>>
>>>>>>> I couldn't find any differences, so I'm afraid I can't help... I'm
>>>>>>> still testing the script, I will let you know if I find again this
>>>>>>> problem.
>>>>>> I will try to find the problem in my sequences (but i could say
>>>>>> that is
>>>>>> quite common)... and if i will
>>>>>> find i will try with the same parameters and the shell...
>>>>>>>
>>>>>>>>>>
>>>>>>>>> Could you upload your "Query 0" patch to Bug 2927?
>>>>>>>>> http://bugzilla.open-bio.org/show_bug.cgi?id=2927
>>>>>>>>>
>>>>>>>> Now i'm wuite busy, because i'm working on a different project and
>>>>>>>> i've
>>>>>>>> to manage deliveries...
>>>>>>>> but i will for sure upload my patch ASAP.
>>>>>>>>>
>>>>>>>>>> I aplogize if i understanded the problem wrongly and for the
>>>>>>>>>> fact
>>>>>>>>>> that
>>>>>>>>>> i'm entering in the discussion in this moment (maybe when the
>>>>>>>>>> discussion is finished)...
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Well I don't (yet) understand what the problem is either ;)
>>>>>>>>>
>>>>>>>>> Peter
>>>>>>>>>
>>>>>>>> Ciao
>>>>>>>> andrea
>>>>>>>
>>>>>>>
>>>>>>> Best,
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> -- Miguel
>>>>>>>
>>>>>>>
>>>>>> thanks.
>>>>>> Ciao
>>>>>> Andrea
>>>>>
>>>>> Hi!
>>>>>
>>>>> Some new findings that contradict my previous perception of the
>>>>> problem.
>>>>> Tonight my script failed again after stumbling upon the same problem
>>>>> for a different sequence. I have now investigated more carefully and
>>>>> found:
>>>>>
>>>>> 1. The problem (a line with 'Query: 0 ---' that impaired parsing of
>>>>> the blastpgp output) was encountered in all these cases:
>>>>>
>>>>> a) nohup myscript.py [some script options] sequences.fasta >&
>>>>> myscript.log &
>>>>> b) myscript.py [some script options] sequences.fasta
>>>>> c) /usr/local/blast-2.2.22/bin/blastpgp -d /opt/BlastDBs/nr -i
>>>>> U7.fasta -m 0 -o tmp.bl.txt -v 500 -b 1000 -a 6 -Q U7.nr.5.pssm -j 5
>>>>> -h 0.001 -p blastpgp
>>>>>
>>>>> That is, for the first time I was able to reproduce the problem
>>>>> from a
>>>>> standalone run of blastpgp.
>>>>>
>>>>> 2. The problem disappears with a previous version of blastpgp
>>>>> (2.2.18). Using this version, all these cases work:
>>>>>
>>>>> a) nohup myscript.py [some script options] sequences.fasta >&
>>>>> myscript.log &
>>>>> b) myscript.py [some script options] sequences.fasta
>>>>> c) /usr/local/blast-2.2.18/bin/blastpgp -d /opt/BlastDBs/nr -i
>>>>> U7.fasta -m 0 -o tmp.bl.txt -v 500 -b 1000 -a 6 -Q U7.nr.5.pssm -j 5
>>>>> -h 0.001 -p blastpgp
>>>>>
>>>>> So, it would seem that, as Andrea suggested, this is a bug in
>>>>> blastpgp, to be more precise, after blastpgp-2.2.18.
>>>>>
>>>>> 3. In this particular case, I notice that the problem happens with a
>>>>> sequence containing low complexity region(s). Now, I had thought that
>>>>> the default in blastpgp was to filter those sequences out! I'm
>>>>> running
>>>>> the original script again with blastpgp-2.2.22 with the filter on (-F
>>>>> T) to see if the problem persists.
>>>>>
>>>>> I will write to the blast-help address at the ncbi to let them know
>>>>> about the problem.
>>>>>
>>>>> Best,
>>>>>
>>>>>
>>>>> -- Miguel
>>>>>
>>>>>
>>>> Hi,
>>>> Thanks for your updates!!!. I can say one thing:
>>>> I've used in the past these three versione of blastpgp:
>>>> - 2.2.15
>>>> - 2.2.18
>>>> - 2.2.19
>>>> and i found the "Query 0" problem in all of them, but, if one
>>>> of them fails (i mean, gives "Query 0" output) the other may not fail
>>>> at all (they most probably not give the "Query 0" output).
>>>>
>>>> Another interesting things is that, with the three version, the same
>>>> database, and the same parameters, the output is quite different...
>>>> ...sorry.. very different...
>>>>
>>>> I'm also sure that it could happens also with the complexity region(s)
>>>> filter "True".
>>>> What i observe, is that there aren't parameters that make it
>>>> disappear. It
>>>> just disappear from a sequence, and it will appear in another.... in
>>>> other
>>>> word, changing parameters, make it "moving"  between sequences.
>>>>
>>>> I've never used blastpgp 2.2.22. So i cannot say anything about it.
>>>>
>>>> Thanks
>>>> Andrea
>>>
>>>
>>> Then it looks like something more weird than what I thought...
>>> Andrea, would you mind if I send your e-mail to the blast people? Or
>>> perhaps you can do it yourself... I wrote to
>>> blast-help at ncbi.nlm.nih.gov
>> If you can, for me is an help. I hope they will reply.
>> I can also send and email, buti f you have....
>
> I will do that, no problem
>
>>>
>>> I suspect they will tell us to use the XML output, but then, not all
>>> info I need seems to go there...
>> i think the same, and i suspect the XML output doesn't suffer of the
>> same problem.
>
> For me the XML is a no issue, since the NCBIXML parser does not really
> support PSI-BLAST searches:
> it can't get information on the rounds, convergence... If you have a
> look to NCBIXML.py you see a lot of XXX TODO PSI...
>
>>>
>>> Thanks a lot!
>>>
>>>
>> To you!!
>>> -- Miguel
>>>
>>>
>> And for my patch, is not a patch.I've checked now. To be fully
>> independent
>> from NcbiStandalone.py i didn't write a patch for it. I wrote a patch
>> in the sense that actually i remove from the blastpgp output, four
>> lines, starting
>> from the "Query 0" one, and then i submit the "new output" to the
>> parser.
>> In this way i'm reading the file twice (so it's not a good idea), but i
>> don't mind
>> if the NcbiStandalone.py change, because I'm fully independent from it.
>>
>> This is my "simple code":
>>
>> ## THIS IS NOT A PATCH. BUT IT WORKS.
>> ## THIS MEANS THAT IF WE FIND THE WAY
>> ## TO REMOVE FOUR LINES STARTING
>> ## FROM "Query 0" THE PROBLEM IS REALLY
>> ## SOLVED (NOW I DON'T HAVE PARSER
>> ## PROBLEMS AT ALL).
>> ## lines is a list derived from a  readlines() call of the
>> ## output of blastpgp.
>> ## newlines has to be reconverted into an handle
>> ## object.
>> def removeQuery0lines(lines):
>>        newlines = []
>>        count = 0
>>        for l in lines:
>>                if count == 4: count = 0
>>                if count != 0: count+=1
>>                if l.startswith('Query: 0'): count = 1
>>                if count == 0: newlines.append(l)
>>        return newlines
>>
>
> Thanks!
>
>>
>> It should be interesting to develope a patch that  works  inside the
>> parser.
>> I will try to work on it, in November, becaue now i cannot.
>> The right function to manipulate it should be (inside
>> NCBIStandalone.py):
>>
>> def _scan_hsp_alignment(self, uhandle, consumer):
>>        # Query: 11
>> GRGVSACA-------TCDGFFYRNQKVAVIGGGNTAVEEALYLSNIASEVHLIHRRDGF
>>        #           GRGVS+         TC    Y  + + V GGG+ + EE   L    
>> +   I R+
>>        # Sbjct: 12
>> GRGVSSVVRRCIHKPTCKE--YAVKIIDVTGGGSFSAEEVQELREATLKEVDILRKVSG
>>        #
>>        # Query: 64 AEKILIKR 71
>>        #              I +K
>>        # Sbjct: 70 PNIIQLKD 77
>>        #
>>
>>        while 1:
>>            # Blastn adds an extra line filled with spaces before Query
>>            attempt_read_and_call(uhandle, consumer.noevent,
>> start='     ')
>>            read_and_call(uhandle, consumer.query, start='Query')
>>            read_and_call(uhandle, consumer.align, start='     ')
>>            read_and_call(uhandle, consumer.sbjct, start='Sbjct')
>>            read_and_call_while(uhandle, consumer.noevent, blank=1)
>>            line = safe_peekline(uhandle)
>>            # Alignment continues if I see a 'Query' or the spaces for
>> Blastn.
>>            if not (line.startswith('Query') or line.startswith('    
>> ')):
>>                break
>>
>> changing it in:
>>
>> def _scan_hsp_alignment(self, uhandle, consumer):
>>        # Query: 11
>> GRGVSACA-------TCDGFFYRNQKVAVIGGGNTAVEEALYLSNIASEVHLIHRRDGF
>>        #           GRGVS+         TC    Y  + + V GGG+ + EE   L    
>> +   I R+
>>        # Sbjct: 12
>> GRGVSSVVRRCIHKPTCKE--YAVKIIDVTGGGSFSAEEVQELREATLKEVDILRKVSG
>>        #
>>        # Query: 64 AEKILIKR 71
>>        #              I +K
>>        # Sbjct: 70 PNIIQLKD 77
>>        #
>>        while 1:
>>            # Blastn adds an extra line filled with spaces before Query
>>            attempt_read_and_call(uhandle, consumer.noevent,
>> start='     ')
>>            # Remove Query 0 start (It is only at the beginning...)
>>            q0_count = attempt_read_and_call(uhandle, consumer.noevent,
>> start='Query: 0')
>>            if q0_count:
>>                  # if "Query 0" remove its alignment
>>                  read_and_call(uhandle, consumer.noevent, start='     ')
>>                  read_and_call(uhandle, consumer.noevent, start='Sbjct')
>>                  read_and_call_while(uhandle, consumer.noevent, blank=1)
>>            # Remove Query 0 end
>>            read_and_call(uhandle, consumer.query, start='Query')
>>            read_and_call(uhandle, consumer.align, start='     ')
>>            read_and_call(uhandle, consumer.sbjct, start='Sbjct')
>>            read_and_call_while(uhandle, consumer.noevent, blank=1)
>>            line = safe_peekline(uhandle)
>>            # Alignment continues if I see a 'Query' or the spaces for
>> Blastn.
>>            if not (line.startswith('Query') or line.startswith('    
>> ')):
>>                break
>>
>> BUT, i'm not sure of the patch and i didn't try at all... so i cannot
>> submit... It needs to be tryed and tested!!!!
>> And i'm also not sure if it is the right place to patch....!!!!
>>
>>
>>
>>
>> I hope this could help....
>> Miguel, have you time to try and test?
>>
>
> I'm afraid not in the next 6 weeks...
>
> Best,
>
>
>
> -- Miguel
>
>
So i will try in 3 weeks.. ;-)
And, as suggested from Peter, we will move the discussion to

http://bugzilla.open-bio.org/show_bug.cgi?id=2927

with some examples....

Ciao
Andrea




From natassa_g_2000 at yahoo.com  Thu Oct 15 16:00:28 2009
From: natassa_g_2000 at yahoo.com (natassa)
Date: Thu, 15 Oct 2009 09:00:28 -0700 (PDT)
Subject: [Biopython] Adaptor trimmer and dimers
Message-ID: <355533.31188.qm@web52001.mail.re2.yahoo.com>

Hallo Biopythoners, 
I followed a recent thread conversation about adaptor trimming, which I intend to do on Illumina runs, and I am not sure I know where exactly in github I could find Brad Chapman's code for trimming AFTER modifications that he has done based on the thread conversation. I d like to test that code, which looks very appealing to me if it computes a global alignment and allows for a certain simplicity, ex number of mismatches. The link in BradChapman's original post on the trimmer points to a non-Biopython Github (sorry if i understand bad those things!) and I have the impression it is not updated for the above (and other) features discussed in the thread.

On the same topic, I would like to ask people's experience on the detection of adaptor dimers. I have just started considering the issue, and my understanding is that Illumina technology at least is mostly biased for the presence of adapter dimers, rather than adapter fragments within the reads. This was confirmed by the company who did the sequencing for my samples. So I was surprised to find no discussion on dimers or no obvious adaptation on scripts for their detection. Maybe i am wrong?
I tested a perl script that detects 'adapter-only sequences' but when i tried to visually inspect those to see if they represent dimers, I realized the importance of doing a global alignment ;-), the script doing a local one. The fraction of the adapter-only sequences, if those represented the dimers I am looking for, is small, so i d be happy to filter them out. But I am not sure for this, and lacking a way to detect such dimers, I would happily give a go to a trimmer, not a very aggressive one!
Do you think adapter trimming is critical? What fractions of your illumina reads contained adapters?
Sorry for the overflow of questions!
Many thanks, 
Anastasia 
Anastasia Gioti
Post-Doc, Evolutionary Biology Department
Upssala University
Norbyv?gen 18D
SE-752 36? UPPSALA
anastasia.gioti at ebc.uu.se
Tel: +46-18-471 2837
Fax: +46-18-471 6310







      


From biopython at maubp.freeserve.co.uk  Thu Oct 15 16:09:33 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Oct 2009 17:09:33 +0100
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: <025E6C33-9CCE-4319-907B-6C39289C014C@gmail.com>
References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>
	<4CDC51BD-7953-47DE-910C-F1F50F0C3275@gmail.com>
	<4AD64602.9060603@biodec.com>
	<13B32636-1A38-4E0E-96A1-62D75B6AAEFC@gmail.com>
	<4AD72978.4030900@biodec.com>
	<05361598-93AF-4328-BE4B-25D4300DCDE0@gmail.com>
	<4AD739CA.6090403@biodec.com>
	
	<320fb6e00910150839s445f70fbx1998560c46389a94@mail.gmail.com>
	<025E6C33-9CCE-4319-907B-6C39289C014C@gmail.com>
Message-ID: <320fb6e00910150909i6cb01f59le2021300f70b6458@mail.gmail.com>

CC'd back to mailing list

On Thu, Oct 15, 2009 at 4:51 PM, Miguel Ortiz Lombardia
 wrote:
> Le 15 oct. 09 ? 17:39, Peter a ?crit :
>
>>> For me the XML is a no issue, since the NCBIXML parser does not really
>>> support PSI-BLAST searches:
>>> it can't get information on the rounds, convergence... If you have a look
>>> to NCBIXML.py you see a lot of XXX TODO PSI...
>>
>> There may well be some things missing in our parser, but last time I
>> checked, the XML file itself was missing lots of information found in
>> the plain text output.
>>
>> Peter
>
> I am sending to you an xml file from a PSI-Blast run that converged. You see
> there for example info about iteration number and convergence, for example.
> It's just 59 Kb, I can upload it to the bug 2927, but I suspect you prefer
> not, since this is a new issue (XML parser).
>
> IMHO, the NCBIXML parser should behave (concerning PSI-BLAST) just like
> NCBIStandalone.PSIBlastparser. Perhaps a NCBIXML.PSIBlasparser should be
> created ? Just an idea...

Michiel also thinks the PSI BLAST XML parser could be better, see:

http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006780.html
...
http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006791.html
http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006802.html

Can you file a new bug about PSI-BLAST XML parsing (and attach that example)
please? I'd have to look over the new PSI-BLAST XML files before having an
informed opinion.

Peter



From biopython at maubp.freeserve.co.uk  Thu Oct 15 16:20:47 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Oct 2009 17:20:47 +0100
Subject: [Biopython] Adaptor trimmer and dimers
In-Reply-To: <355533.31188.qm@web52001.mail.re2.yahoo.com>
References: <355533.31188.qm@web52001.mail.re2.yahoo.com>
Message-ID: <320fb6e00910150920y6ada0463s60ce0b4e5f788449@mail.gmail.com>

On Thu, Oct 15, 2009 at 5:00 PM, natassa  wrote:
> Hallo Biopythoners,
> I followed a recent thread conversation about adaptor trimming,
> which I intend to do on Illumina runs, and I am not sure I know
> where exactly in github I could find Brad Chapman's code for
> trimming AFTER modifications that he has done based on the
> thread conversation. ...

I guess you mean Brad's August Blog Post:
http://bcbio.wordpress.com/2009/08/09/trimming-adaptors-from-short-read-sequences/
and the following mailing list thread which included some tips on
speeding up the Biopython side of things:
http://lists.open-bio.org/pipermail/biopython/2009-August/005417.html

For anyone else interested, there are some simple examples in the
tutorial (using SeqRecord slicing - elegant and simple, but a bit slow):
http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:FASTQ-slicing-off-primer
http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:FASTQ-slicing-off-adaptor

And I did a blog post about low level FASTQ handling for speed
at the cost of flexibility and simplicity (using some of the same
ideas from the August mailing list discussion):
http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

Peter


From ibdeno at gmail.com  Thu Oct 15 16:24:24 2009
From: ibdeno at gmail.com (Miguel Ortiz Lombardia)
Date: Thu, 15 Oct 2009 18:24:24 +0200
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: <320fb6e00910150909i6cb01f59le2021300f70b6458@mail.gmail.com>
References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>
	<4CDC51BD-7953-47DE-910C-F1F50F0C3275@gmail.com>
	<4AD64602.9060603@biodec.com>
	<13B32636-1A38-4E0E-96A1-62D75B6AAEFC@gmail.com>
	<4AD72978.4030900@biodec.com>
	<05361598-93AF-4328-BE4B-25D4300DCDE0@gmail.com>
	<4AD739CA.6090403@biodec.com>
	
	<320fb6e00910150839s445f70fbx1998560c46389a94@mail.gmail.com>
	<025E6C33-9CCE-4319-907B-6C39289C014C@gmail.com>
	<320fb6e00910150909i6cb01f59le2021300f70b6458@mail.gmail.com>
Message-ID: 


Le 15 oct. 09 ? 18:09, Peter a ?crit :
>>
>> IMHO, the NCBIXML parser should behave (concerning PSI-BLAST) just  
>> like
>> NCBIStandalone.PSIBlastparser. Perhaps a NCBIXML.PSIBlasparser  
>> should be
>> created ? Just an idea...
>
> Michiel also thinks the PSI BLAST XML parser could be better, see:
>
> http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006780.html
> ...
> http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006791.html
> http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006802.html
>

Sorry to have missed them.
I still believe that the logic behind the  
NCBIStandalone.PSIBlastparser is correct or, at least, useful. But I  
could change my mind if you think otherwise.

The XML file that I sent to Peter came from blastpgp 2.2.22. It seems  
to me that it is a proper XML file, not a concatenation.

> Can you file a new bug about PSI-BLAST XML parsing (and attach that  
> example)
> please? I'd have to look over the new PSI-BLAST XML files before  
> having an
> informed opinion.


I have filed the bug:

http://bugzilla.open-bio.org/show_bug.cgi?id=2929

and have upload the XML from blastpgp v. 2.2.22 mentioned above.

Best,


-- Miguel






From biopython at maubp.freeserve.co.uk  Thu Oct 15 16:32:06 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Oct 2009 17:32:06 +0100
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: 
References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>
	<13B32636-1A38-4E0E-96A1-62D75B6AAEFC@gmail.com>
	<4AD72978.4030900@biodec.com>
	<05361598-93AF-4328-BE4B-25D4300DCDE0@gmail.com>
	<4AD739CA.6090403@biodec.com>
	
	<320fb6e00910150839s445f70fbx1998560c46389a94@mail.gmail.com>
	<025E6C33-9CCE-4319-907B-6C39289C014C@gmail.com>
	<320fb6e00910150909i6cb01f59le2021300f70b6458@mail.gmail.com>
	
Message-ID: <320fb6e00910150932x5cdef3c0k5ad9b28eea51930f@mail.gmail.com>

On Thu, Oct 15, 2009 at 5:24 PM, Miguel Ortiz Lombardia
 wrote:
>
> Le 15 oct. 09 ? 18:09, Peter a ?crit :
>>>
>>> IMHO, the NCBIXML parser should behave (concerning PSI-BLAST) just like
>>> NCBIStandalone.PSIBlastparser. Perhaps a NCBIXML.PSIBlasparser should be
>>> created ? Just an idea...
>>
>> Michiel also thinks the PSI BLAST XML parser could be better, see:
>>
>> http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006780.html
>> ...
>> http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006791.html
>> http://lists.open-bio.org/pipermail/biopython-dev/2009-September/006802.html
>>
>
> Sorry to have missed them.

They were on the dev list, so that makes sense.

> I still believe that the logic behind the NCBIStandalone.PSIBlastparser is
> correct or, at least, useful. But I could change my mind if you think
> otherwise.

The idea of the NCBIStandalone.PSIBlastparser plain text parser, and
its object structure makes sense.

> The XML file that I sent to Peter came from blastpgp 2.2.22. It seems to me
> that it is a proper XML file, not a concatenation.
>
>> Can you file a new bug about PSI-BLAST XML parsing (and attach that
>> example) please? I'd have to look over the new PSI-BLAST XML files
>> before having an informed opinion.
>
> I have filed the bug:
> http://bugzilla.open-bio.org/show_bug.cgi?id=2929
> and have upload the XML from blastpgp v. 2.2.22 mentioned above.

Lovely - thank you. So that is a single query, with 3 iterations.
What would be *really* nice, is a multiple query file (say three
queries, each needing just a few iterations to keep the file small).

Peter



From ibdeno at gmail.com  Thu Oct 15 16:52:45 2009
From: ibdeno at gmail.com (Miguel Ortiz Lombardia)
Date: Thu, 15 Oct 2009 18:52:45 +0200
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: <320fb6e00910150932x5cdef3c0k5ad9b28eea51930f@mail.gmail.com>
References: <52B983D8-CA0E-4C04-A272-7DA2011D14FC@gmail.com>
	<13B32636-1A38-4E0E-96A1-62D75B6AAEFC@gmail.com>
	<4AD72978.4030900@biodec.com>
	<05361598-93AF-4328-BE4B-25D4300DCDE0@gmail.com>
	<4AD739CA.6090403@biodec.com>
	
	<320fb6e00910150839s445f70fbx1998560c46389a94@mail.gmail.com>
	<025E6C33-9CCE-4319-907B-6C39289C014C@gmail.com>
	<320fb6e00910150909i6cb01f59le2021300f70b6458@mail.gmail.com>
	
	<320fb6e00910150932x5cdef3c0k5ad9b28eea51930f@mail.gmail.com>
Message-ID: <1FA32B70-84CC-4012-97F5-9B67D56BADC6@gmail.com>


Le 15 oct. 09 ? 18:32, Peter a ?crit :
>
>> I still believe that the logic behind the  
>> NCBIStandalone.PSIBlastparser is
>> correct or, at least, useful. But I could change my mind if you think
>> otherwise.
>
> The idea of the NCBIStandalone.PSIBlastparser plain text parser, and
> its object structure makes sense.
>

Good!

>> I have filed the bug:
>> http://bugzilla.open-bio.org/show_bug.cgi?id=2929
>> and have upload the XML from blastpgp v. 2.2.22 mentioned above.
>
> Lovely - thank you. So that is a single query, with 3 iterations.
> What would be *really* nice, is a multiple query file (say three
> queries, each needing just a few iterations to keep the file small).


Never used multiple query file... Do you mean starting from a multiple- 
alignment file with the -B option?

-- Miguel






From pengyu.ut at gmail.com  Thu Oct 15 21:17:26 2009
From: pengyu.ut at gmail.com (Peng Yu)
Date: Thu, 15 Oct 2009 16:17:26 -0500
Subject: [Biopython] How to get sequences upstream of TSS of genes?
Message-ID: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com>

I have a set of genes. I want to get the 5kb sequence that is upstream
of the TSS's of each gene.

I have the following specific questions. Could somebody help me? Thank you!

Which database I can access to get mouse genome?
Give a gene name what function I should call to get the gene's location?


From carlos.borroto at gmail.com  Thu Oct 15 21:18:17 2009
From: carlos.borroto at gmail.com (Carlos Javier Borroto)
Date: Thu, 15 Oct 2009 17:18:17 -0400
Subject: [Biopython] How to construct a SeqRecord with the info in the
	SeqFeatures type mRNA or CDS?
Message-ID: <65d4b7fc0910151418q30347842yb48e6508a2cab544@mail.gmail.com>

Hi,

I want to construct a SeqRecord with the sequence make from the sum of
the Locations of the SubFeatures I get from a SeqFeature type mRNA or
CDS. Does biopython has something already to do this? It looks like
something many people may want, but is proving to be king of difficult
to implement manually, so I'm wondering if is already there?

I read in the tutorial that you can splice a SeqRecord, but I can't
find a reference to how to form a SeqRecord from several different
splicing, something like:

new_record = record[1:200] + record[400:600]

thanks in advance,
-- 
Carlos Javier Borroto
Baltimore, MD
Google Voice: (410) 929 4020


From biopython at maubp.freeserve.co.uk  Thu Oct 15 21:35:52 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Oct 2009 22:35:52 +0100
Subject: [Biopython] How to construct a SeqRecord with the info in the
	SeqFeatures type mRNA or CDS?
In-Reply-To: <65d4b7fc0910151418q30347842yb48e6508a2cab544@mail.gmail.com>
References: <65d4b7fc0910151418q30347842yb48e6508a2cab544@mail.gmail.com>
Message-ID: <320fb6e00910151435k26f48292jb0eaab5661d83759@mail.gmail.com>

On Thu, Oct 15, 2009 at 10:18 PM, Carlos Javier Borroto
 wrote:
> Hi,
>
> I want to construct a SeqRecord with the sequence make from the sum of
> the Locations of the SubFeatures I get from a SeqFeature type mRNA or
> CDS. Does biopython has something already to do this? It looks like
> something many people may want, but is proving to be king of difficult
> to implement manually, so I'm wondering if is already there?

There isn't anything built in now, partly because to do it properly
means coping with a lot of possible fuzzy locations and joins.
I can go into more detail, but it would help to know what kind
of organisms are you working with? For prokaryotes and viruses,
CDS locations are (usually) trivial so you just need the start, end
and strand.

> I read in the tutorial that you can splice a SeqRecord, but I can't
> find a reference to how to form a SeqRecord from several different
> splicing, something like:
>
> new_record = record[1:200] + record[400:600]

That isn't built in, but is something I've been working on that
might be in Biopython in future. Do you fancy trying some
experimental code?

http://github.com/peterjc/biopython/tree/seqrecords
http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006851.html

Peter


From biopython at maubp.freeserve.co.uk  Thu Oct 15 21:42:41 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Oct 2009 22:42:41 +0100
Subject: [Biopython] How to get sequences upstream of TSS of genes?
In-Reply-To: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com>
References: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com>
Message-ID: <320fb6e00910151442v4a96cbd6j7a03d3f397b9c264@mail.gmail.com>

On Thu, Oct 15, 2009 at 10:17 PM, Peng Yu  wrote:
> I have a set of genes. I want to get the 5kb sequence that is upstream
> of the TSS's of each gene.
>
> I have the following specific questions. Could somebody help me? Thank you!
>
> Which database I can access to get mouse genome?
> Give a gene name what function I should call to get the gene's location?

I am not familiar with mouse specific databases.

My first instinct would be to download the GenBank files for
all the mouse chromosomes via FTP from the NCBI. You
can parse these with Biopython, and pull out the gene of
interest. Then using the gene's strand and the start/end
location, you can deduce the coordinates to the upstream
region, and take this section from the chromosome sequence
(and reverse complement if on the reverse strand).

Peter


From biopython at maubp.freeserve.co.uk  Thu Oct 15 21:48:20 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 15 Oct 2009 22:48:20 +0100
Subject: [Biopython] How to construct a SeqRecord with the info in the
	SeqFeatures type mRNA or CDS?
In-Reply-To: <320fb6e00910151435k26f48292jb0eaab5661d83759@mail.gmail.com>
References: <65d4b7fc0910151418q30347842yb48e6508a2cab544@mail.gmail.com>
	<320fb6e00910151435k26f48292jb0eaab5661d83759@mail.gmail.com>
Message-ID: <320fb6e00910151448i125bf77emb8dafcf30d9fdd1a@mail.gmail.com>

On Thu, Oct 15, 2009 at 10:35 PM, Peter wrote:
> On Thu, Oct 15, 2009 at 10:18 PM, Carlos Javier Borroto wrote:
>> Hi,
>>
>> I want to construct a SeqRecord with the sequence make from the sum of
>> the Locations of the SubFeatures I get from a SeqFeature type mRNA or
>> CDS. Does biopython has something already to do this? It looks like
>> something many people may want, but is proving to be king of difficult
>> to implement manually, so I'm wondering if is already there?
>
> There isn't anything built in now, partly because to do it properly
> means coping with a lot of possible fuzzy locations and joins.
> I can go into more detail, but it would help to know what kind
> of organisms are you working with? For prokaryotes and viruses,
> CDS locations are (usually) trivial so you just need the start, end
> and strand.

There is a partly tested function called get_feature_nuc in the
unit test file test_SeqIO_features.py, which takes a SeqFeature
and the parent Seq object. In fact looking at it now, some of
the comments look out of date (I think I fixed the GenBank
parser to cope with mixed strand features ...). This might do
what you want - but as I said, it needs more testing.

It had crossed my mind (as you can tell from the comments)
that this could be added to Biopython proper at some point.
One idea was as a method of the SeqRecord object, which
would take a SeqFeature (or just the integer index of the
desired feature in the SeqRecord's list of features).

Peter


From mjldehoon at yahoo.com  Fri Oct 16 01:04:20 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Thu, 15 Oct 2009 18:04:20 -0700 (PDT)
Subject: [Biopython] Problems parsing with PSIBlastParser
In-Reply-To: <1FA32B70-84CC-4012-97F5-9B67D56BADC6@gmail.com>
Message-ID: <737542.47267.qm@web62401.mail.re1.yahoo.com>

Last time I checked (which was a few weeks ago), a multiple-query PSIBlast search gives a file consisting of concatenated XML files. The problem is in the design of Blast XML output. For a single-query PSIBlast, the fields under  are used to store the output of the PSIBlast iterations. For multiple-query regular Blast, the same fields are used to store the search results of each query. With multiple-query PSIBlast, there is then no way to store the output in the current XML format.
I've been meaning to write to NCBI about this, but I haven't gotten round to it yet. Will do so this weekend.

--Michiel.

--- On Thu, 10/15/09, Miguel Ortiz Lombardia  wrote:

> From: Miguel Ortiz Lombardia 
> Subject: Re: [Biopython] Problems parsing with PSIBlastParser
> To: "Peter" 
> Cc: "Biopython Mailing List" 
> Date: Thursday, October 15, 2009, 12:52 PM
> 
> Le 15 oct. 09 ? 18:32, Peter a ?crit :
> > 
> >> I still believe that the logic behind the
> NCBIStandalone.PSIBlastparser is
> >> correct or, at least, useful. But I could change
> my mind if you think
> >> otherwise.
> > 
> > The idea of the NCBIStandalone.PSIBlastparser plain
> text parser, and
> > its object structure makes sense.
> > 
> 
> Good!
> 
> >> I have filed the bug:
> >> http://bugzilla.open-bio.org/show_bug.cgi?id=2929
> >> and have upload the XML from blastpgp v. 2.2.22
> mentioned above.
> > 
> > Lovely - thank you. So that is a single query, with 3
> iterations.
> > What would be *really* nice, is a multiple query file
> (say three
> > queries, each needing just a few iterations to keep
> the file small).
> 
> 
> Never used multiple query file... Do you mean starting from
> a multiple-alignment file with the -B option?
> 
> -- Miguel
> 
> 
> 
> 
> _______________________________________________
> Biopython mailing list? -? Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
> 


      



From biopython at maubp.freeserve.co.uk  Fri Oct 16 08:11:45 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 16 Oct 2009 09:11:45 +0100
Subject: [Biopython] How to construct a SeqRecord with the info in the
	SeqFeatures type mRNA or CDS?
In-Reply-To: <320fb6e00910151435k26f48292jb0eaab5661d83759@mail.gmail.com>
References: <65d4b7fc0910151418q30347842yb48e6508a2cab544@mail.gmail.com>
	<320fb6e00910151435k26f48292jb0eaab5661d83759@mail.gmail.com>
Message-ID: <320fb6e00910160111t4f999350we0ef349dc454902a@mail.gmail.com>

On Thu, Oct 15, 2009 at 10:35 PM, Peter  wrote:
> On Thu, Oct 15, 2009 at 10:18 PM, Carlos Javier Borroto wrote:
>> I read in the tutorial that you can splice a SeqRecord, but I can't
>> find a reference to how to form a SeqRecord from several different
>> splicing, something like:
>>
>> new_record = record[1:200] + record[400:600]
>
> That isn't built in, but is something I've been working on that
> might be in Biopython in future. Do you fancy trying some
> experimental code?
>
> http://github.com/peterjc/biopython/tree/seqrecords
> http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006851.html

What I should have added yesterday was how you would solve
this with Biopython is it is now (e.g. Biopython 1.52):

new_record = SeqRecord(record.seq[1:200]+record.seq[400:600])
new_record.id = record.id #if this makes sense
new_record.name = record.name #if this makes sense
...

Dealing with complex annotation however is (currently) more
complicated - hence the code I was working on.

Peter


From dalloliogm at gmail.com  Fri Oct 16 08:29:46 2009
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Fri, 16 Oct 2009 10:29:46 +0200
Subject: [Biopython] How to get sequences upstream of TSS of genes?
In-Reply-To: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com>
References: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com>
Message-ID: <5aa3b3570910160129g3a5e2f0fr1297db95d779e807@mail.gmail.com>

On Thu, Oct 15, 2009 at 11:17 PM, Peng Yu  wrote:
> I have a set of genes. I want to get the 5kb sequence that is upstream
> of the TSS's of each gene.

You can do that with biomart:
- http://www.ensembl.org/biomart/martview/a90f00892a48e04d438f762f551bf48a/a90f00892a48e04d438f762f551bf48a

select Ensembl56 as database, Mus Musculus as species, go to Filters
and fill the 'Id list limit' form to add the required geneIds, then go
to Attributes, select Sequences and then check 'Upstream Flank -
5000'.

As for doing that in python, I am not sure there are python interfaces
to BioMart. Galaxy (http://main.g2.bx.psu.edu/) is written in python,
so they must have written a library for that somewhere, but I don't
know their code.

If you use R (remember that you can mix python and R with rpy2) there
is a nice module in bioconductor called BioMart.


> I have the following specific questions. Could somebody help me? Thank you!
>
> Which database I can access to get mouse genome?
> Give a gene name what function I should call to get the gene's location?
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>



-- 
Giovanni Dall'Olio, phd student
Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)

My blog on bioinformatics: http://bioinfoblog.it



From pengyu.ut at gmail.com  Fri Oct 16 14:52:00 2009
From: pengyu.ut at gmail.com (Peng Yu)
Date: Fri, 16 Oct 2009 09:52:00 -0500
Subject: [Biopython] How to get sequences upstream of TSS of genes?
In-Reply-To: <5aa3b3570910160129g3a5e2f0fr1297db95d779e807@mail.gmail.com>
References: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com>
	<5aa3b3570910160129g3a5e2f0fr1297db95d779e807@mail.gmail.com>
Message-ID: <366c6f340910160752u2f6eef3gbc6f90c7aa9b534d@mail.gmail.com>

On Fri, Oct 16, 2009 at 3:29 AM, Giovanni Marco Dall'Olio
 wrote:
> On Thu, Oct 15, 2009 at 11:17 PM, Peng Yu  wrote:
>> I have a set of genes. I want to get the 5kb sequence that is upstream
>> of the TSS's of each gene.
>
> You can do that with biomart:
> - http://www.ensembl.org/biomart/martview/a90f00892a48e04d438f762f551bf48a/a90f00892a48e04d438f762f551bf48a
>
> select Ensembl56 as database, Mus Musculus as species, go to Filters
> and fill the 'Id list limit' form to add the required geneIds, then go
> to Attributes, select Sequences and then check 'Upstream Flank -
> 5000'.

I have gene names (for example, Krt83) what geneIDs shall I choose?

> As for doing that in python, I am not sure there are python interfaces
> to BioMart. Galaxy (http://main.g2.bx.psu.edu/) is written in python,
> so they must have written a library for that somewhere, but I don't
> know their code.
>
> If you use R (remember that you can mix python and R with rpy2) there
> is a nice module in bioconductor called BioMart.
>
>
>> I have the following specific questions. Could somebody help me? Thank you!
>>
>> Which database I can access to get mouse genome?
>> Give a gene name what function I should call to get the gene's location?
>> _______________________________________________
>> Biopython mailing list ?- ?Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>
>
>
> --
> Giovanni Dall'Olio, phd student
> Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)
>
> My blog on bioinformatics: http://bioinfoblog.it
>



From mailinglist.honeypot at gmail.com  Fri Oct 16 14:55:19 2009
From: mailinglist.honeypot at gmail.com (Steve Lianoglou)
Date: Fri, 16 Oct 2009 10:55:19 -0400
Subject: [Biopython] How to get sequences upstream of TSS of genes?
In-Reply-To: <366c6f340910160752u2f6eef3gbc6f90c7aa9b534d@mail.gmail.com>
References: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com>
	<5aa3b3570910160129g3a5e2f0fr1297db95d779e807@mail.gmail.com>
	<366c6f340910160752u2f6eef3gbc6f90c7aa9b534d@mail.gmail.com>
Message-ID: <40B70639-3317-4289-97C1-940EDFD16BEF@gmail.com>

Hi,

On Oct 16, 2009, at 10:52 AM, Peng Yu wrote:

> On Fri, Oct 16, 2009 at 3:29 AM, Giovanni Marco Dall'Olio
>  wrote:
>> On Thu, Oct 15, 2009 at 11:17 PM, Peng Yu   
>> wrote:
>>> I have a set of genes. I want to get the 5kb sequence that is  
>>> upstream
>>> of the TSS's of each gene.
>>
>> You can do that with biomart:
>> - http://www.ensembl.org/biomart/martview/a90f00892a48e04d438f762f551bf48a/a90f00892a48e04d438f762f551bf48a
>>
>> select Ensembl56 as database, Mus Musculus as species, go to Filters
>> and fill the 'Id list limit' form to add the required geneIds, then  
>> go
>> to Attributes, select Sequences and then check 'Upstream Flank -
>> 5000'.
>
> I have gene names (for example, Krt83) what geneIDs shall I choose?

Since your on ensembl's web site, I'd imagine ensembl gene id's might  
be a good bet, no? :-)

-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
   |  Memorial Sloan-Kettering Cancer Center
   |  Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact



From dalloliogm at gmail.com  Fri Oct 16 15:24:55 2009
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Fri, 16 Oct 2009 17:24:55 +0200
Subject: [Biopython] How to get sequences upstream of TSS of genes?
In-Reply-To: <40B70639-3317-4289-97C1-940EDFD16BEF@gmail.com>
References: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com> 
	<5aa3b3570910160129g3a5e2f0fr1297db95d779e807@mail.gmail.com> 
	<366c6f340910160752u2f6eef3gbc6f90c7aa9b534d@mail.gmail.com> 
	<40B70639-3317-4289-97C1-940EDFD16BEF@gmail.com>
Message-ID: <5aa3b3570910160824h7a569570x60a113909b8ea93f@mail.gmail.com>

On Fri, Oct 16, 2009 at 4:55 PM, Steve Lianoglou
 wrote:
> Hi,
>
> On Oct 16, 2009, at 10:52 AM, Peng Yu wrote:
>
>> On Fri, Oct 16, 2009 at 3:29 AM, Giovanni Marco Dall'Olio
>>  wrote:
>>>
>>
>> I have gene names (for example, Krt83) what geneIDs shall I choose?
>
> Since your on ensembl's web site, I'd imagine ensembl gene id's might be a
> good bet, no? :-)

exactly, but if you look at the form more carefully you will see that
there is a menu from which you can choose the type of geneId, for
example: ensembl, kegg, ncbi, etc...

note: I didn't send you the ufficial biomart's link. The right one is:
- http://www.ensembl.org/biomart/martview




> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
> ?| ?Memorial Sloan-Kettering Cancer Center
> ?| ?Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>



-- 
Giovanni Dall'Olio, phd student
Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)

My blog on bioinformatics: http://bioinfoblog.it



From pengyu.ut at gmail.com  Fri Oct 16 15:44:55 2009
From: pengyu.ut at gmail.com (Peng Yu)
Date: Fri, 16 Oct 2009 10:44:55 -0500
Subject: [Biopython] How to get sequences upstream of TSS of genes?
In-Reply-To: <5aa3b3570910160824h7a569570x60a113909b8ea93f@mail.gmail.com>
References: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com>
	<5aa3b3570910160129g3a5e2f0fr1297db95d779e807@mail.gmail.com>
	<366c6f340910160752u2f6eef3gbc6f90c7aa9b534d@mail.gmail.com>
	<40B70639-3317-4289-97C1-940EDFD16BEF@gmail.com>
	<5aa3b3570910160824h7a569570x60a113909b8ea93f@mail.gmail.com>
Message-ID: <366c6f340910160844x14e9d532o1b236d28b22cea4@mail.gmail.com>

On Fri, Oct 16, 2009 at 10:24 AM, Giovanni Marco Dall'Olio
 wrote:
> On Fri, Oct 16, 2009 at 4:55 PM, Steve Lianoglou
>  wrote:
>> Hi,
>>
>> On Oct 16, 2009, at 10:52 AM, Peng Yu wrote:
>>
>>> On Fri, Oct 16, 2009 at 3:29 AM, Giovanni Marco Dall'Olio
>>>  wrote:
>>>>
>>>
>>> I have gene names (for example, Krt83) what geneIDs shall I choose?
>>
>> Since your on ensembl's web site, I'd imagine ensembl gene id's might be a
>> good bet, no? :-)
>
> exactly, but if you look at the form more carefully you will see that
> there is a menu from which you can choose the type of geneId, for
> example: ensembl, kegg, ncbi, etc...
>
> note: I didn't send you the ufficial biomart's link. The right one is:
> - http://www.ensembl.org/biomart/martview

My question was how to figure what type of geneID it was for 'Krt83'?
I tried 'Ensembl Gene ID(s)' in the menu and put 'Krt83' in the box
below it. But I get an empty mart_export.txt file.


From mailinglist.honeypot at gmail.com  Fri Oct 16 15:56:03 2009
From: mailinglist.honeypot at gmail.com (Steve Lianoglou)
Date: Fri, 16 Oct 2009 11:56:03 -0400
Subject: [Biopython] How to get sequences upstream of TSS of genes?
In-Reply-To: <366c6f340910160844x14e9d532o1b236d28b22cea4@mail.gmail.com>
References: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com>
	<5aa3b3570910160129g3a5e2f0fr1297db95d779e807@mail.gmail.com>
	<366c6f340910160752u2f6eef3gbc6f90c7aa9b534d@mail.gmail.com>
	<40B70639-3317-4289-97C1-940EDFD16BEF@gmail.com>
	<5aa3b3570910160824h7a569570x60a113909b8ea93f@mail.gmail.com>
	<366c6f340910160844x14e9d532o1b236d28b22cea4@mail.gmail.com>
Message-ID: <3A3213F6-0D6D-413D-9487-D7F0EF24BB83@gmail.com>

Hi,

On Oct 16, 2009, at 11:44 AM, Peng Yu wrote:

> My question was how to figure what type of geneID it was for 'Krt83'?
> I tried 'Ensembl Gene ID(s)' in the menu and put 'Krt83' in the box
> below it. But I get an empty mart_export.txt file.


I'm guessing you're filters are set wrong.

Try with:
  * FILTER set to: MGI symbol
  * ATTRIBUTES set to: Ensembl Gene ID, Ensembl Transcript ID, MGI  
Symbol

You'd get:

Ensembl Gene ID	Ensembl Transcript ID	MGI symbol
ENSMUSG00000047641	ENSMUST00000108897	Krt83
ENSMUSG00000047641	ENSMUST00000081945	Krt83

-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
   |  Memorial Sloan-Kettering Cancer Center
   |  Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact



From dalloliogm at gmail.com  Fri Oct 16 15:57:05 2009
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Fri, 16 Oct 2009 17:57:05 +0200
Subject: [Biopython] How to get sequences upstream of TSS of genes?
In-Reply-To: <366c6f340910160844x14e9d532o1b236d28b22cea4@mail.gmail.com>
References: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com> 
	<5aa3b3570910160129g3a5e2f0fr1297db95d779e807@mail.gmail.com> 
	<366c6f340910160752u2f6eef3gbc6f90c7aa9b534d@mail.gmail.com> 
	<40B70639-3317-4289-97C1-940EDFD16BEF@gmail.com>
	<5aa3b3570910160824h7a569570x60a113909b8ea93f@mail.gmail.com> 
	<366c6f340910160844x14e9d532o1b236d28b22cea4@mail.gmail.com>
Message-ID: <5aa3b3570910160857x7ee6a137u327d3da0adad15fa@mail.gmail.com>

On Fri, Oct 16, 2009 at 5:44 PM, Peng Yu  wrote:
> On Fri, Oct 16, 2009 at 10:24 AM, Giovanni Marco Dall'Olio
>  wrote:
>> On Fri, Oct 16, 2009 at 4:55 PM, Steve Lianoglou
>>  wrote:
>>> Hi,
>>>
>>> On Oct 16, 2009, at 10:52 AM, Peng Yu wrote:
>>>
>>>> On Fri, Oct 16, 2009 at 3:29 AM, Giovanni Marco Dall'Olio
>>>>  wrote:
>>>>>
>>>>
>>>> I have gene names (for example, Krt83) what geneIDs shall I choose?
>>>
>>> Since your on ensembl's web site, I'd imagine ensembl gene id's might be
a
>>> good bet, no? :-)
>>
>> exactly, but if you look at the form more carefully you will see that
>> there is a menu from which you can choose the type of geneId, for
>> example: ensembl, kegg, ncbi, etc...
>>
>> note: I didn't send you the ufficial biomart's link. The right one is:
>> - http://www.ensembl.org/biomart/martview
>
> My question was how to figure what type of geneID it was for 'Krt83'?
> I tried 'Ensembl Gene ID(s)' in the menu and put 'Krt83' in the box
> below it. But I get an empty mart_export.txt file.

All ensembl Ids starts with 'ENSG0....'. Your Krt83 should be an EntrezGene
id:
- http://www.ensembl.org/Homo_sapiens/Search/Details?_C=eJwFwdEJgDAMBcA3inSBKqKIA7iA*gepEcXQ1JA6v3ck4Az6Mg4!9yoOevGYT32T1Ira7hzdmOewaYmrVksceRgD6Lp9qSLoWvxUeBcJ&_c=%2b15428165997832314387&_c=%2b18088233473301975577



> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>



-- 
Giovanni Dall'Olio, phd student
Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)

My blog on bioinformatics: http://bioinfoblog.it


From pengyu.ut at gmail.com  Sun Oct 18 15:44:58 2009
From: pengyu.ut at gmail.com (Peng Yu)
Date: Sun, 18 Oct 2009 10:44:58 -0500
Subject: [Biopython] How to get sequences upstream of TSS of genes?
In-Reply-To: <3A3213F6-0D6D-413D-9487-D7F0EF24BB83@gmail.com>
References: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com>
	<5aa3b3570910160129g3a5e2f0fr1297db95d779e807@mail.gmail.com>
	<366c6f340910160752u2f6eef3gbc6f90c7aa9b534d@mail.gmail.com>
	<40B70639-3317-4289-97C1-940EDFD16BEF@gmail.com>
	<5aa3b3570910160824h7a569570x60a113909b8ea93f@mail.gmail.com>
	<366c6f340910160844x14e9d532o1b236d28b22cea4@mail.gmail.com>
	<3A3213F6-0D6D-413D-9487-D7F0EF24BB83@gmail.com>
Message-ID: <366c6f340910180844o5924ea98v1e840a6e19150c17@mail.gmail.com>

On Fri, Oct 16, 2009 at 10:56 AM, Steve Lianoglou
 wrote:
> Hi,
>
> On Oct 16, 2009, at 11:44 AM, Peng Yu wrote:
>
>> My question was how to figure what type of geneID it was for 'Krt83'?
>> I tried 'Ensembl Gene ID(s)' in the menu and put 'Krt83' in the box
>> below it. But I get an empty mart_export.txt file.
>
>
> I'm guessing you're filters are set wrong.
>
> Try with:
> ?* FILTER set to: MGI symbol
> ?* ATTRIBUTES set to: Ensembl Gene ID, Ensembl Transcript ID, MGI Symbol
>
> You'd get:
>
> Ensembl Gene ID Ensembl Transcript ID ? MGI symbol
> ENSMUSG00000047641 ? ? ?ENSMUST00000108897 ? ? ?Krt83
> ENSMUSG00000047641 ? ? ?ENSMUST00000081945 ? ? ?Krt83

It seems that it can not report both MGI symbol and the 5kb upstream
sequences simultaneously from Ensembl website. Is it true? If so,
probably I will have to make a short program to combine the results.



From natassa_g_2000 at yahoo.com  Mon Oct 19 10:03:18 2009
From: natassa_g_2000 at yahoo.com (natassa)
Date: Mon, 19 Oct 2009 03:03:18 -0700 (PDT)
Subject: [Biopython] Adaptor trimmer and dimers
In-Reply-To: <320fb6e00910150920y6ada0463s60ce0b4e5f788449@mail.gmail.com>
Message-ID: <693756.78143.qm@web52011.mail.re2.yahoo.com>


Thanks Peter, 
I ve gone through these posts already, so my question was whether a global alignment script exists-Brad Chapman's script does a local alignment. Also, I would be mostly interested in discarding adapter-dimer reads and I do not find any adaptation on his code to detect those, unless I am wrong.. I would also like to discard their pairs, as I am inputting those to velvet assembler which takes into account the pair-read information for scaffolding. 
I can try to? write up something integrating the above features, I was just wondering if there is anything out there already and whether people find this a sensible approach.
Kind regards, 
Anastasia 
--- On Thu, 10/15/09, Peter  wrote:

From: Peter 
Subject: Re: [Biopython] Adaptor trimmer and dimers
To: "natassa" 
Cc: biopython at lists.open-bio.org
Date: Thursday, October 15, 2009, 12:20 PM

On Thu, Oct 15, 2009 at 5:00 PM, natassa  wrote:
> Hallo Biopythoners,
> I followed a recent thread conversation about adaptor trimming,
> which I intend to do on Illumina runs, and I am not sure I know
> where exactly in github I could find Brad Chapman's code for
> trimming AFTER modifications that he has done based on the
> thread conversation. ...

I guess you mean Brad's August Blog Post:
http://bcbio.wordpress.com/2009/08/09/trimming-adaptors-from-short-read-sequences/
and the following mailing list thread which included some tips on
speeding up the Biopython side of things:
http://lists.open-bio.org/pipermail/biopython/2009-August/005417.html

For anyone else interested, there are some simple examples in the
tutorial (using SeqRecord slicing - elegant and simple, but a bit slow):
http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:FASTQ-slicing-off-primer
http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:FASTQ-slicing-off-adaptor

And I did a blog post about low level FASTQ handling for speed
at the cost of flexibility and simplicity (using some of the same
ideas from the August mailing list discussion):
http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

Peter



      


From chapmanb at 50mail.com  Mon Oct 19 11:24:41 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Mon, 19 Oct 2009 07:24:41 -0400
Subject: [Biopython] Adaptor trimmer and dimers
In-Reply-To: <693756.78143.qm@web52011.mail.re2.yahoo.com>
References: <320fb6e00910150920y6ada0463s60ce0b4e5f788449@mail.gmail.com>
	<693756.78143.qm@web52011.mail.re2.yahoo.com>
Message-ID: <20091019112441.GA72523@sobchak.mgh.harvard.edu>

Hi Anastasia;

> I ve gone through these posts already, so my question was whether
> a global alignment script exists-Brad Chapman's script does a
> local alignment.

I found that local alignments behaved better in terms of trimming,
but if you want global alignments it's easy to change. Edit line 42
of the script from:

pairwise2.align.localms

to:

pairwise2.align.globalms


> Also, I would be mostly interested in discarding
> adapter-dimer reads and I do not find any adaptation on his code to
> detect those, unless I am wrong.. 

You should get back an empty or very short read, which you can then
discard in your script.

> I would also like to discard their
> pairs, as I am inputting those to velvet assembler which takes into
> account the pair-read information for scaffolding. 

This is also something you can do after calling the trimmer. Read
each end of the pair, trim both sequences and then check that they
pass your size threshold. If both pass, then write them to the file
you'll be using for assembly:

adaptor = "GATC"
num_errors = 2
size_thresh = 17
pair1 = read_seq()
pair2 = read_seq()
trim1 = trim_adaptor(pair1, adaptor, num_errors)
trim2 = trim_adaptor(pair2, adaptor, num_errors)

if len(trim1) >= size_thresh and len(trim2) >= size_thresh:
    write_pair(trim1, trim2)

Hope this helps,
Brad


> I can try to? write
> up something integrating the above features, I was just wondering if
> there is anything out there already and whether people find this a
> sensible approach. Kind regards,
> Anastasia 
> --- On Thu, 10/15/09, Peter  wrote:
> 
> From: Peter 
> Subject: Re: [Biopython] Adaptor trimmer and dimers
> To: "natassa" 
> Cc: biopython at lists.open-bio.org
> Date: Thursday, October 15, 2009, 12:20 PM
> 
> On Thu, Oct 15, 2009 at 5:00 PM, natassa  wrote:
> > Hallo Biopythoners,
> > I followed a recent thread conversation about adaptor trimming,
> > which I intend to do on Illumina runs, and I am not sure I know
> > where exactly in github I could find Brad Chapman's code for
> > trimming AFTER modifications that he has done based on the
> > thread conversation. ...
> 
> I guess you mean Brad's August Blog Post:
> http://bcbio.wordpress.com/2009/08/09/trimming-adaptors-from-short-read-sequences/
> and the following mailing list thread which included some tips on
> speeding up the Biopython side of things:
> http://lists.open-bio.org/pipermail/biopython/2009-August/005417.html
> 
> For anyone else interested, there are some simple examples in the
> tutorial (using SeqRecord slicing - elegant and simple, but a bit slow):
> http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:FASTQ-slicing-off-primer
> http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:FASTQ-slicing-off-adaptor
> 
> And I did a blog post about low level FASTQ handling for speed
> at the cost of flexibility and simplicity (using some of the same
> ideas from the August mailing list discussion):
> http://news.open-bio.org/news/2009/09/biopython-fast-fastq/
> 
> Peter
> 
> 
> 
>       
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From fkauff at biologie.uni-kl.de  Mon Oct 19 13:44:39 2009
From: fkauff at biologie.uni-kl.de (Frank Kauff)
Date: Mon, 19 Oct 2009 15:44:39 +0200
Subject: [Biopython] Combine nexus files but not concatenating them
In-Reply-To: <320fb6e00910080154m2e75a38eh7e120e807103773b@mail.gmail.com>
References: 	
	<320fb6e00910051242n54d6229fyacc23119401715e@mail.gmail.com>	
		
	<320fb6e00910051331u2f27c7ecmde744d29fb3ba2ab@mail.gmail.com>	
		
	<320fb6e00910070229n1b78542dj82998de13cf7eed7@mail.gmail.com>	
	
	<320fb6e00910080154m2e75a38eh7e120e807103773b@mail.gmail.com>
Message-ID: <4ADC6D47.2010409@biologie.uni-kl.de>

Hi all,

unfortunately, morphological data types and mixed data types are 
curretnly unsupported. For no special reason - I just never bothered to 
implement them... I think it's not trivial, though, because one would 
have to store the data type for each individual character in some way, 
which would probably mean to significantly change the data structure 
that is currently used to hold the alignment data...

With regard to splitting up a nexus file - yes, if there is a data 
partition defined, the individual subdivisions can be saved as 
individual nexus files with

mynexusinstance.write_nexus_data_partitions(charpartition='name_of_partition')

Please see the method for further details of customization. Otherwise, 
one could save the characters defined in a character set as nexus using

mynexusinstance.write_nexus_data(filename'charsetxy.nex',exclude=[c for 
c in range(mynexusinstance.nchar) if c not in 
mynexusinstance.charsets['name_of_charset_i_want_to_save']])

Cheers,
Frank


On 10/08/2009 10:54 AM, Peter wrote:
> On Thu, Oct 8, 2009 at 12:23 AM, Denzel Li  wrote:
>    
>> Hi Peter:
>> Regarding the Nexus datatype supported in Bio:Nexus:Nexus, I mean nexus like
>> the following, where the datatype is a "mixing" of "standard" and "DNA".
>> According to the function Bio:Nexus:Nexus._format (line 696), these
>> datatypes are not supported yet. I am just wondering does the team has the
>> plan to support these data types.
>>      
> Oh right - in in your example, the digits encode morphology, but they could
> also be phenotypes, or some other characteristic like gene copy number.
>
> As to Bio.Nexus supporting this, hopefully Frank or Cymon can comment.
>
> If Bio.Nexus did support this, then from the Bio.AlignIO point of view, with
> the current object structure we'd have to use a sequence object (holding
> both the digits, and the DNA) for the sequence strings (e.g. for s1 in your
> example, Seq("10010ACGT")) with a generic single letter alphabet. This
> would lose the fact that the first five characters are digits, but the rest are
> DNA. This isn't ideal, and would probably cause trouble for Nexus output
> (writing such alignments).
>
> Would you want to try and deal with such "mixed" alignments via the
> Bio.AlignIO interface?
>
> Peter
>
>    

-- 
J-Prof. Dr. Frank Kauff
Molecular Phylogenetics
FB Biologie, 13/276
TU Kaiserslautern
Postfach 3049
67653 Kaiserslautern

Tel. +49 (0)631 205-2562
Fax. +49 (0)631 205-2998
email: fkauff at biologie.uni-kl.de
skype: frank.kauff



From mike.thon at gmail.com  Mon Oct 19 17:35:49 2009
From: mike.thon at gmail.com (Michael Thon)
Date: Mon, 19 Oct 2009 19:35:49 +0200
Subject: [Biopython] parsing an in memory sequence string with SeqIO
Message-ID: <945EB0E1-4D74-4DF0-8B17-CD9A23CA6C56@gmail.com>

I have been looking at the documentation but I can't figure out how to  
parse a string of text in a python variable into a SeqRecord object.   
the main function (SeqIO.parse) requires a file handle.  I'm getting  
the text from a web server POST request and it seems a little  
inefficient to write it to a file before I do parsing with biopython.   
Maybe there is some way in python to create a handle to a variable?
Thanks
Mike


From kellrott at gmail.com  Mon Oct 19 17:48:17 2009
From: kellrott at gmail.com (Kyle Ellrott)
Date: Mon, 19 Oct 2009 10:48:17 -0700
Subject: [Biopython] parsing an in memory sequence string with SeqIO
In-Reply-To: <945EB0E1-4D74-4DF0-8B17-CD9A23CA6C56@gmail.com>
References: <945EB0E1-4D74-4DF0-8B17-CD9A23CA6C56@gmail.com>
Message-ID: 

Try the StringIO interface. http://docs.python.org/library/stringio.html

Kyle

On Mon, Oct 19, 2009 at 10:35 AM, Michael Thon  wrote:
> I have been looking at the documentation but I can't figure out how to parse
> a string of text in a python variable into a SeqRecord object. ?the main
> function (SeqIO.parse) requires a file handle. ?I'm getting the text from a
> web server POST request and it seems a little inefficient to write it to a
> file before I do parsing with biopython. ?Maybe there is some way in python
> to create a handle to a variable?
> Thanks
> Mike
> _______________________________________________
> Biopython mailing list ?- ?Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>



From mikelisanke at gmail.com  Mon Oct 19 19:37:10 2009
From: mikelisanke at gmail.com (Mike Lisanke)
Date: Mon, 19 Oct 2009 15:37:10 -0400
Subject: [Biopython] Windows installer does not find Python 2.63 with
	multiple pythons
Message-ID: <8c5c6e580910191237w54e857bfs1107ba86df05db7c@mail.gmail.com>

I had Python 3.0 installed prior to attempting a bio-python install. I
installed Python 2.6 to its own directory, and a proper registry entry was
made in HKEY_LOCAL_MACHINE\SOFTWARE\Python, however; the bio-python can not
find the Python 2.6 install. Is there a problem having multiple python
installs? Thanks.

-- 
Best regards,

Mike


From biopython at maubp.freeserve.co.uk  Mon Oct 19 21:29:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 19 Oct 2009 22:29:12 +0100
Subject: [Biopython] Windows installer does not find Python 2.63 with
	multiple pythons
In-Reply-To: <8c5c6e580910191237w54e857bfs1107ba86df05db7c@mail.gmail.com>
References: <8c5c6e580910191237w54e857bfs1107ba86df05db7c@mail.gmail.com>
Message-ID: <320fb6e00910191429l481223d8iae9fbe78702fff70@mail.gmail.com>

On Mon, Oct 19, 2009 at 8:37 PM, Mike Lisanke  wrote:
> I had Python 3.0 installed prior to attempting a bio-python install. I
> installed Python 2.6 to its own directory, and a proper registry entry was
> made in HKEY_LOCAL_MACHINE\SOFTWARE\Python, however;
> the bio-python can not find the Python 2.6 install. Is there a problem
> having multiple python installs? Thanks.

On my Windows machine I have Python 2.4, 2.5 and 2.6 all co-existing
fine (and I used to have 2.3 as well). These were all default installs to
C:\Python26 etc, and I didn't have to do anything funny to the registry.
I can try and remember to check the registry settings on my machine
if you like... but for now I can only suggest you might try uninstalling
Python 2.6, perhaps clean the registry, and then reinstall Python 2.6.

Peter

P.S.

I haven't tried putting Python 3.0 on my Windows machine (not that
I would bother, I would go straight to Python 3.1 now that it is out).


From tevang3 at gmail.com  Tue Oct 20 10:44:45 2009
From: tevang3 at gmail.com (Thomas Evangelidis)
Date: Tue, 20 Oct 2009 13:44:45 +0300
Subject: [Biopython] search Entrez with boolean operators
Message-ID: <833e1faf0910200344x29e1d636ra35325433b6263b7@mail.gmail.com>

Dear all,

is it possible to set the term parameter in Bio.Entrez.esearch() accordingly
so that it will search Entrez using boolean operators? I tried myself
several combinations with no luck. For instance lets say I want to query All
Fields of PubMed using this whole phrase (not intividual words): "ABC efflux
transporter", how should I write it?

thanks in advance.


From biopython at maubp.freeserve.co.uk  Tue Oct 20 10:53:59 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 20 Oct 2009 11:53:59 +0100
Subject: [Biopython] search Entrez with boolean operators
In-Reply-To: <833e1faf0910200344x29e1d636ra35325433b6263b7@mail.gmail.com>
References: <833e1faf0910200344x29e1d636ra35325433b6263b7@mail.gmail.com>
Message-ID: <320fb6e00910200353x2ce754edkdc197f8cfc6ece21@mail.gmail.com>

On Tue, Oct 20, 2009 at 11:44 AM, Thomas Evangelidis  wrote:
> Dear all,
>
> is it possible to set the term parameter in Bio.Entrez.esearch() accordingly
> so that it will search Entrez using boolean operators? I tried myself
> several combinations with no luck.

You can use AND in upper case, e.g.

abc[title] AND efflux[title] AND transporter[title]
abc[all] AND efflux[all] AND transporter[all]
abc AND efflux AND transporter

> For instance lets say I want to query All
> Fields of PubMed using this whole phrase (not intividual words): "ABC efflux
> transporter", how should I write it?

For phrases, you need quote characters - you can try this on the NCBI
Entrez webpage, e.g.

"ABC efflux transporter"
"ABC efflux transporter"[all]

Note that these give no hits!

Remember in Python there are at least two ways to build a string with
quotes in it,
for example single-quote double-quote text double-quote single-quote:

>>> search = '"ABC efflux transporter"'
>>> print search
"ABC efflux transporter"

Or, sticking with all double quotes you must escape some:

>>> search = "\"ABC efflux transporter\""
>>> print search
"ABC efflux transporter"

Peter


From pengyu.ut at gmail.com  Tue Oct 20 15:33:08 2009
From: pengyu.ut at gmail.com (Peng Yu)
Date: Tue, 20 Oct 2009 10:33:08 -0500
Subject: [Biopython] Making the tutorial more concise
Message-ID: <366c6f340910200833k5447fca3yda5a804c889a3733@mail.gmail.com>

I feel that the document can be made more concise.

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc73

For example, on the above link, it says
"Hey, everybody loves BLAST right? I mean, geez, how can get it get
any easier to do comparisons between one of your sequences and every
other sequence in the known world?"

I think this can be delete. Or it can be simply stated what Chapter 7
is about at the beginning.


From biopython at maubp.freeserve.co.uk  Tue Oct 20 15:41:21 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 20 Oct 2009 16:41:21 +0100
Subject: [Biopython] Making the tutorial more concise
In-Reply-To: <366c6f340910200833k5447fca3yda5a804c889a3733@mail.gmail.com>
References: <366c6f340910200833k5447fca3yda5a804c889a3733@mail.gmail.com>
Message-ID: <320fb6e00910200841r6081dcd4ga0c661a14fc7aa6f@mail.gmail.com>

On Tue, Oct 20, 2009 at 4:33 PM, Peng Yu  wrote:
> I feel that the document can be made more concise.
>
> http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc73
>
> For example, on the above link, it says
> "Hey, everybody loves BLAST right? I mean, geez, how can get it get
> any easier to do comparisons between one of your sequences and every
> other sequence in the known world?"
>
> I think this can be delete. Or it can be simply stated what Chapter 7
> is about at the beginning.

I agree - I think that might have been Brad's casual writing style ;)

I am planning to re-write the BLAST chapter soon, partly due to
Biopython switching to using command line wrappers in module
Bio.Blast.Applications with subprocess, but also we will want to
support the new BLAST+ tools from the NCBI (different command
line argument names etc).

Peter


From lueck at ipk-gatersleben.de  Tue Oct 20 16:01:41 2009
From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=)
Date: Tue, 20 Oct 2009 18:01:41 +0200
Subject: [Biopython] Making the tutorial more concise
References: <366c6f340910200833k5447fca3yda5a804c889a3733@mail.gmail.com>
Message-ID: <007a01ca519e$9cd68eb0$1022a8c0@ipkgatersleben.de>

>From my point of view, I like such comments. They make a tutorial not so dry 
;-)

Anyway, I only can thank all the people, which wrote this nice tutorial. It 
helped me already a lot and I don't mine small jokes ;-)

Nice evening!
Stefanie


----- Original Message ----- 
From: "Peng Yu" 
To: 
Sent: Tuesday, October 20, 2009 5:33 PM
Subject: [Biopython] Making the tutorial more concise


>I feel that the document can be made more concise.
>
> http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc73
>
> For example, on the above link, it says
> "Hey, everybody loves BLAST right? I mean, geez, how can get it get
> any easier to do comparisons between one of your sequences and every
> other sequence in the known world?"
>
> I think this can be delete. Or it can be simply stated what Chapter 7
> is about at the beginning.
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
> 



From fufezan at uni-muenster.de  Wed Oct 21 07:25:01 2009
From: fufezan at uni-muenster.de (Christian Fufezan)
Date: Wed, 21 Oct 2009 09:25:01 +0200
Subject: [Biopython] Biopython & p3d
Message-ID: <2034B111-0AB9-4FE8-B1F9-62E753DA0FE8@uni-muenster.de>

Hello Biopython,

we ( Michael Specht & I ) published recently p3d, a python module for  
structural bioinformatics and were wondering if it wouldn't be a good  
good thing if could join the Biopython project. We understand that  
Biopython has already a PDB parser but we programmed an alternative  
version since we found the Biopython.pdb syntax to be too non- 
pythonian. One example why is shown below:

Biopython:

def test6(structure):
	'''get protein surrounding (5) of NAG'''
	bucket = set()
	atom_list=Selection.unfold_entities(structure,'A')
	ns = NeighborSearch(atom_list)
	for model in structure.get_list():
		for chain in model.get_list():
			for residue in chain.get_list():
				if residue.get_resname() == 'NAG':
					for atom in residue.get_list():
						centre = atom.get_coord()
						R = 5.0
						neighbor_list = ns.search(centre,R)
						neighbors = Selection.unfold_entities(neighbor_list,'A')
						for atom2 in neighbors:
							if 'O' in atom2.get_name():
								bucket.add(atom2)
	print '     found',len(bucket),' oxygens around NAG'
	return

p3d:

def test6(pdb):
	''' protein surrounding (5) of resname NAG'''
	bgl = pdb.query('resname NAG')
	bucket = pdb.query('protein and oxygen and within 5 of ',bgl)
	print '     found',len(bucket),' oxygens around NAG'
	return

Certainly, Biopythons PDB module has its advantages and the is no way  
p3d could replace it, but both modules have their advantages :) The  
fact that biopythons.pdb parser uses a KTree written in C and we wrote  
one in python makes certain queries to the protein structure faster in  
Biopyhton; however if the query involves more complex demands,  
multiple loops are inevitable in biopython, whereas p3d offers a human  
readable query function that combines all aspects. The link to our  
publication is:
http://www.biomedcentral.com/1471-2105/10/258

Looking forward to hear from you, maybe one can also envision a  
combined module with a new all advantages together.

Kind regards

Christian Fufezan




From biopython at maubp.freeserve.co.uk  Wed Oct 21 09:18:17 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 21 Oct 2009 10:18:17 +0100
Subject: [Biopython] Biopython & p3d
In-Reply-To: <2034B111-0AB9-4FE8-B1F9-62E753DA0FE8@uni-muenster.de>
References: <2034B111-0AB9-4FE8-B1F9-62E753DA0FE8@uni-muenster.de>
Message-ID: <320fb6e00910210218n39b746bu6c5fd826efe6097f@mail.gmail.com>

On Wed, Oct 21, 2009 at 8:25 AM, Christian Fufezan
 wrote:
> Hello Biopython,
>
> we ( Michael Specht & I ) published recently p3d, a python module for
> structural bioinformatics and were wondering if it wouldn't be a good good
> thing if could join the Biopython project. We understand that Biopython has
> already a PDB parser but we programmed an alternative version since we found
> the Biopython.pdb syntax to be too non-pythonian. One example why is shown
> below:
>
> Biopython:
>
> def test6(structure):
> ? ? ? ?'''get protein surrounding (5) of NAG'''
> ? ? ? ?bucket = set()
> ? ? ? ?atom_list=Selection.unfold_entities(structure,'A')
> ? ? ? ?ns = NeighborSearch(atom_list)
> ? ? ? ?for model in structure.get_list():
> ? ? ? ? ? ? ? ?for chain in model.get_list():
> ? ? ? ? ? ? ? ? ? ? ? ?for residue in chain.get_list():

I'm not very familiar with the NeighborSearch code, but
I'm pretty sure the above for loops can be just:

for model in structure:
    for chain in model:
        for residue in chain:
            ...

And regarding detecting oxygen atoms, I think there is
a patch on bugzilla to record the (relatively) new atom
column from the PDB file (which will help with Hg and
mercury versus hydrogen).

Still, I would agree with you that some parts of Bio.PDB
are not very pythonic - too many functions names get_*()
which could be replaced with properties. This is something
we could evolve gradually (add new properties, keep the
old methods in place but gradually deprecate them).

Specific suggestions would be welcome.

> def test6(pdb):
> ? ? ? ?''' protein surrounding (5) of resname NAG'''
> ? ? ? ?bgl = pdb.query('resname NAG')
> ? ? ? ?bucket = pdb.query('protein and oxygen and within 5 of ',bgl)
> ? ? ? ?print ' ? ? found',len(bucket),' oxygens around NAG'
> ? ? ? ?return
>
> Certainly, Biopythons PDB module has its advantages and the is no way p3d
> could replace it, but both modules have their advantages :) The fact that
> biopythons.pdb parser uses a KTree written in C and we wrote one in python
> makes certain queries to the protein structure faster in Biopyhton; however
> if the query involves more complex demands, multiple loops are inevitable in
> biopython, whereas p3d offers a human readable query function that combines
> all aspects. The link to our publication is:
> http://www.biomedcentral.com/1471-2105/10/258

I remember skim reading it a month ago or so. I remember the final line of
the abstract was a very strong opinion ("a perfect tool"), and I was rather
surprised the reviewers and editor let you keep it - regardless of any bias
I might feel to Biopython ;)

> Looking forward to hear from you, maybe one can also envision a
> combined module with a new all advantages together.

That would be a good outcome.

>From the snippet of code and the examples in the paper, the big feature
you have that Bio.PDB lacks is "fancy selections", and that is certainly
something which could be improved in Biopython.

It is interesting you have implemented (invented?) a string based language
with logical and, within etc. In some ways it reminds me of the selection
formulae in VMD - have you used that 3D visualisation tool?

This also reminds me of the SQL language for database selections, and
how classical SQL code with Python just used SQL statements within
Python strings. Have you ever used SQLAlchemy, and looked at how
they handle SQL statements like filters, ands, ors, etc with a clever
object based interface? Perhaps something like that could work for
a 3D structure query API.

Regards,

Peter



From natassa_g_2000 at yahoo.com  Wed Oct 21 09:54:26 2009
From: natassa_g_2000 at yahoo.com (natassa)
Date: Wed, 21 Oct 2009 02:54:26 -0700 (PDT)
Subject: [Biopython] Adaptor trimmer and dimers
In-Reply-To: <20091019112441.GA72523@sobchak.mgh.harvard.edu>
Message-ID: <843737.47817.qm@web52003.mail.re2.yahoo.com>

Brad, 
Thank you for the tips. I adapted your code a bit to handle pairs (that is, I have both read1 and 2 of a pair in the same file and if I find the adaptor in any read of the pair, I discard the pair.) I also had to add an additional test for the length of the alignment output, as I got an index Error for the cases the adapter does not align at all. I am not sure i got this part right, I looked a bit at the related Biopython alignment code, and that is what? I concluded. 
My main problem now is performance of this script: On a file of 19 million reads of 76 bp it is running for more than 12 hours! So I copy here my code and would be very grateful if someone could indicate parts where it could be sped up. Also, Brad, could you check this extra test line in the handle_adaptor function? 
I am not very good in python for sure, but I am also pretty sure this is not an endless loop problem and I have run out of ideas how to make it faster (unless I abandon working with Seq Records). I am seriously thinking of inputting Fastas instead of Fastq-illumina files, but for a whole bunch of tests I am running now, being able to work with Fastq would be ideal...
Hope this is just a silly mistake of mine..
Here is the code:

from Bio import SeqIO
import os
from Bio import pairwise2
from Bio.Seq import Seq


def handle_adaptor(record, adaptor, num_errors):
??? '''returns 1 if no adaptor found as exact match or as a a pairwise alignment allowing two errors. Otherwise: none'''
??? gap_char = '-'
??? exact_pos = str(record.seq).find(adaptor)
??? #exact match
??? if exact_pos >= 0:
??????? seq_region = str(record.seq[exact_pos:exact_pos+len(adaptor)])
??????? adapt_region = adaptor
??? else:
??????? if len(pairwise2.align.localms(str(record.seq),str(adaptor), 5.0, -4.0, -9.0, -0.5, one_alignment_only=True, gap_char=gap_char)) ==0:
#no alignment at all
?????????? return 1
??????? else:
??????????? 
??????????? if len(pairwise2.align.localms(str(record.seq),str(adaptor), 5.0, -4.0, -9.0, -0.5, one_alignment_only=True, gap_char=gap_char)) >=1:?? 
??????????????? seq_a, adaptor_a, score, start, end = pairwise2.align.localms(str(record.seq),str(adaptor), 5.0, -4.0, -9.0, -0.5, one_alignment_only=True,
????????????????????????????????????????????????????????????????????????????? gap_char=gap_char)[0]


??????????????? adapt_region = adaptor_a[start:end]
??????????????? seq_region = seq_a[start:end]
??????? 
??? matches = sum((1 if s == adapt_region[i] else 0) for i, s in
????????????????? enumerate(seq_region))

??? # too many errors -- 
??? if (len(adaptor) - matches) > num_errors:
????????????????????? return 1????? 
??? 
???? 
??????????? 
??????? 
def Handle_shuffledFiles (path, number_of_adaptor, num_errors):
??? all_files=os.listdir(path) 
??? for file in all_files:
??????? if not file.endswith('fastq'):
??????????? continue
??????? else:
??????????? if '_afr_' in file :
??????????????? print "working on : "+file + "..." 

??????????????? if number_of_adaptor==1:
??????????????????? adaptor='ACACTCTTTCCCTACACGACGCTCTTCCGATCT'
??????????????????? output=path+'Adaptor1'+'_removedNat/'+file+'_Clean.txt'
??????????????? elif number_of_adaptor==2:
??????????????????? adaptor= 'TCTAGCCTTCTCGCCAAGTCGTCCTTACGGCTCTGGC'? 
??????????????????? output=path+'Adaptor2'+'_removedNat/'+file+'_Clean.txt'


??????????????? out_handle=open(output, "w")
?????????????? 
??????????????? iter = SeqIO.parse(open(path+file), "fastq-illumina")
??????????????? j=0
??????????????? k=0
??????????????? try:
??????????????????? while 1:
??????????????????????? rec1 = iter.next()??? 
??????????????????????? rec2 = iter.next()
??????????????????????? k=k+1
??????????????????????? 
??????????????????????? Ad_inR1 = handle_adaptor(rec1, adaptor, num_errors? ) #returns 1 if no adaptor found or if found with >2 mismatches
??????????????????????? Ad_inR2 = handle_adaptor(rec2, adaptor, num_errors? ) 
????????????????????? 
??????????????????????? if Ad_inR1 and Ad_inR2:
??????????????????????????? j=j+1
??????????????????????????? print 'Counting the %i th pair that has no adaptor ...' %j 
???????????????????????? 
??????????????????????????? SeqIO.write([rec1, rec2], out_handle, "fastq-illumina")
??????????????????????? 
??????????????? except StopIteration, e:
??????????????????? pass
?????? 
??????????? 
??????????????? out_handle.close() 
??????????????? print '..out of %i pairs total' %k?? 
??????? 
??????????????????????? 
???????????????????????????????????? 
if __name__ == "__main__":
??? path2Fastq="/Users/nat/Data/Illumina/Restricted_forTests/Fastq-Illumina/shuffled/"
??? Handle_shuffledFiles(path2Fastq, 1,? 2) 


Thanks!
Anastasia 

Post-Doc, Evolutionary Biology Department
Upssala University
Norbyv?gen 18D
SE-752 36? UPPSALA
anastasia.gioti at ebc.uu.se
Tel: +46-18-471 2837
Fax: +46-18-471 6310



      


From biopython at maubp.freeserve.co.uk  Wed Oct 21 10:18:09 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 21 Oct 2009 11:18:09 +0100
Subject: [Biopython] Adaptor trimmer and dimers
In-Reply-To: <843737.47817.qm@web52003.mail.re2.yahoo.com>
References: <20091019112441.GA72523@sobchak.mgh.harvard.edu>
	<843737.47817.qm@web52003.mail.re2.yahoo.com>
Message-ID: <320fb6e00910210318v622658daw3133f90761a7ab7d@mail.gmail.com>

On Wed, Oct 21, 2009 at 10:54 AM, natassa  wrote:
>
> My main problem now is performance of this script: On a file of
> 19 million reads of 76 bp it is running for more than 12 hours!
> So I copy here my code and would be very grateful if someone
> could indicate parts where it could be sped up.

The best way to answer that is to run some profiling yourself.
I would just make a small test file, and profile that.

> I am not very good in python for sure, but I am also pretty sure
> this is not an endless loop problem and I have run out of ideas
> how to make it faster (unless I abandon working with Seq Records).
> I am seriously thinking of inputting Fastas instead of Fastq-illumina
> files, but for a whole bunch of tests I am running now, being
> able to work with Fastq would be ideal...

You are using Bio.SeqIO to parse the FASTQ files, but you don't
use the quality scores as all. Therefore it would be faster to use
FASTA files, or keep working with FASTQ files but switch from
using SeqRecords to simple strings as described here:

http://lists.open-bio.org/pipermail/biopython/2009-August/005430.html
http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

Peter


From mjldehoon at yahoo.com  Wed Oct 21 10:15:35 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Wed, 21 Oct 2009 03:15:35 -0700 (PDT)
Subject: [Biopython] Biopython & p3d
In-Reply-To: <2034B111-0AB9-4FE8-B1F9-62E753DA0FE8@uni-muenster.de>
Message-ID: <416618.94041.qm@web62407.mail.re1.yahoo.com>

I think that we should avoid the situation that there are two PDB modules in Biopython. Can we somehow merge Bio.PDB and p3d? Take the best features of p3d and add them to Bio.PDB, or vice versa. If that is not possible, I think we should make a choice between Bio.PDB and p3d.

--Michiel.

--- On Wed, 10/21/09, Christian Fufezan  wrote:

> From: Christian Fufezan 
> Subject: [Biopython] Biopython & p3d
> To: biopython at biopython.org
> Cc: "Michael Specht" 
> Date: Wednesday, October 21, 2009, 3:25 AM
> Hello Biopython,
> 
> we ( Michael Specht & I ) published recently p3d, a
> python module for structural bioinformatics and were
> wondering if it wouldn't be a good good thing if could join
> the Biopython project. We understand that Biopython has
> already a PDB parser but we programmed an alternative
> version since we found the Biopython.pdb syntax to be too
> non-pythonian. One example why is shown below:
> 
> Biopython:
> 
> def test6(structure):
> ??? '''get protein surrounding (5) of
> NAG'''
> ??? bucket = set()
> ???
> atom_list=Selection.unfold_entities(structure,'A')
> ??? ns = NeighborSearch(atom_list)
> ??? for model in structure.get_list():
> ??? ??? for chain in
> model.get_list():
> ??? ??? ???
> for residue in chain.get_list():
> ??? ??? ???
> ??? if residue.get_resname() == 'NAG':
> ??? ??? ???
> ??? ??? for atom in
> residue.get_list():
> ??? ??? ???
> ??? ??? ???
> centre = atom.get_coord()
> ??? ??? ???
> ??? ??? ??? R =
> 5.0
> ??? ??? ???
> ??? ??? ???
> neighbor_list = ns.search(centre,R)
> ??? ??? ???
> ??? ??? ???
> neighbors = Selection.unfold_entities(neighbor_list,'A')
> ??? ??? ???
> ??? ??? ??? for
> atom2 in neighbors:
> ??? ??? ???
> ??? ??? ???
> ??? if 'O' in atom2.get_name():
> ??? ??? ???
> ??? ??? ???
> ??? ??? bucket.add(atom2)
> ??? print '?
> ???found',len(bucket),' oxygens around NAG'
> ??? return
> 
> p3d:
> 
> def test6(pdb):
> ??? ''' protein surrounding (5) of resname
> NAG'''
> ??? bgl = pdb.query('resname NAG')
> ??? bucket = pdb.query('protein and oxygen
> and within 5 of ',bgl)
> ??? print '?
> ???found',len(bucket),' oxygens around NAG'
> ??? return
> 
> Certainly, Biopythons PDB module has its advantages and the
> is no way p3d could replace it, but both modules have their
> advantages :) The fact that biopythons.pdb parser uses a
> KTree written in C and we wrote one in python makes certain
> queries to the protein structure faster in Biopyhton;
> however if the query involves more complex demands, multiple
> loops are inevitable in biopython, whereas p3d offers a
> human readable query function that combines all aspects. The
> link to our publication is:
> http://www.biomedcentral.com/1471-2105/10/258
> 
> Looking forward to hear from you, maybe one can also
> envision a combined module with a new all advantages
> together.
> 
> Kind regards
> 
> Christian Fufezan
> 
> 
> _______________________________________________
> Biopython mailing list? -? Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
> 


      



From biopython at maubp.freeserve.co.uk  Wed Oct 21 10:28:56 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 21 Oct 2009 11:28:56 +0100
Subject: [Biopython] Biopython & p3d
In-Reply-To: <416618.94041.qm@web62407.mail.re1.yahoo.com>
References: <2034B111-0AB9-4FE8-B1F9-62E753DA0FE8@uni-muenster.de>
	<416618.94041.qm@web62407.mail.re1.yahoo.com>
Message-ID: <320fb6e00910210328p538ef75lb52e53203ec42df9@mail.gmail.com>

On Wed, Oct 21, 2009 at 11:15 AM, Michiel de Hoon  wrote:
> I think that we should avoid the situation that there are two PDB modules
> in Biopython.

Agreed.

> Can we somehow merge Bio.PDB and p3d? Take the best features of p3d
> and add them to Bio.PDB, or vice versa.

That's what I was thinking. Note that Christian and Michael will have to
re-license any such contributions (p3d uses the GNU GPL V2 which is
not compatible).

> If that is not possible, I think we should make a choice between Bio.PDB
> and p3d.

As Christian pointed out, the two have some non-overlapping functionality,
so replacing Bio.PDB with pd3 isn't really an option (even if it was
re-licensed).

Peter


From fufezan at uni-muenster.de  Wed Oct 21 10:31:38 2009
From: fufezan at uni-muenster.de (Christian Fufezan)
Date: Wed, 21 Oct 2009 12:31:38 +0200
Subject: [Biopython] Biopython & p3d
In-Reply-To: <320fb6e00910210218n39b746bu6c5fd826efe6097f@mail.gmail.com>
References: <2034B111-0AB9-4FE8-B1F9-62E753DA0FE8@uni-muenster.de>
	<320fb6e00910210218n39b746bu6c5fd826efe6097f@mail.gmail.com>
Message-ID: <905F76E9-4BA4-4297-86E0-E6BDDDEEC624@uni-muenster.de>


On 21 Oct 2009, at 11:18, Peter wrote:

> On Wed, Oct 21, 2009 at 8:25 AM, Christian Fufezan
>  wrote:
>> Hello Biopython,
>>
>> we ( Michael Specht & I ) published recently p3d, a python module for
>> structural bioinformatics and were wondering if it wouldn't be a  
>> good good
>> thing if could join the Biopython project. We understand that  
>> Biopython has
>> already a PDB parser but we programmed an alternative version since  
>> we found
>> the Biopython.pdb syntax to be too non-pythonian. One example why  
>> is shown
>> below:
>>
>> Biopython:
>>
>> def test6(structure):
>>        '''get protein surrounding (5) of NAG'''
>>        bucket = set()
>>        atom_list=Selection.unfold_entities(structure,'A')
>>        ns = NeighborSearch(atom_list)
>>        for model in structure.get_list():
>>                for chain in model.get_list():
>>                        for residue in chain.get_list():
>
> I'm not very familiar with the NeighborSearch code, but
> I'm pretty sure the above for loops can be just:
>
> for model in structure:
>    for chain in model:
>        for residue in chain:
>            ...
>
> And regarding detecting oxygen atoms, I think there is
> a patch on bugzilla to record the (relatively) new atom
> column from the PDB file (which will help with Hg and
> mercury versus hydrogen).
>
> Still, I would agree with you that some parts of Bio.PDB
> are not very pythonic - too many functions names get_*()
> which could be replaced with properties. This is something
> we could evolve gradually (add new properties, keep the
> old methods in place but gradually deprecate them).
>
> Specific suggestions would be welcome.

That's maybe the biggest difference between biopython and p3d, which  
will make it difficult to merge the two modules.
A data structure that is build like that of Biopython.pdb imposes  
multiple nested loops and condition queries.
p3ds data structure is not nested and gains strength through  
combination of sets and BSPTree
This allows faster and more flexible looping. Looping over all alpha  
and beta-carbons for example and printing x-coordinates

p3d:
for atom in pdb.query('protein and atom type CB or atom type CA'):
	print atom.x

Still I think both methods could exists side by side. If it is  
efficient - I don't know. Replacing biopythons.pdb parser was never  
the intention and I think it has features that are really good and fast!

>
>> def test6(pdb):
>>        ''' protein surrounding (5) of resname NAG'''
>>        bgl = pdb.query('resname NAG')
>>        bucket = pdb.query('protein and oxygen and within 5 of ',bgl)
>>        print '     found',len(bucket),' oxygens around NAG'
>>        return
>>
>> Certainly, Biopythons PDB module has its advantages and the is no  
>> way p3d
>> could replace it, but both modules have their advantages :) The  
>> fact that
>> biopythons.pdb parser uses a KTree written in C and we wrote one in  
>> python
>> makes certain queries to the protein structure faster in Biopyhton;  
>> however
>> if the query involves more complex demands, multiple loops are  
>> inevitable in
>> biopython, whereas p3d offers a human readable query function that  
>> combines
>> all aspects. The link to our publication is:
>> http://www.biomedcentral.com/1471-2105/10/258
>
> I remember skim reading it a month ago or so. I remember the final  
> line of
> the abstract was a very strong opinion ("a perfect tool"), and I was  
> rather
> surprised the reviewers and editor let you keep it - regardless of  
> any bias
> I might feel to Biopython ;)
>

I guess it was a selling point ;)


>> Looking forward to hear from you, maybe one can also envision a
>> combined module with a new all advantages together.
>
> That would be a good outcome.
>
> From the snippet of code and the examples in the paper, the big  
> feature
> you have that Bio.PDB lacks is "fancy selections", and that is  
> certainly
> something which could be improved in Biopython.
>
Yes that was one thing that we were really missing. Also the fact that  
biopython requires the unfolded entity to be converted to vectors and  
so forth was a bit complex and we needed fast and direct access to the  
coordinates, the very essence of pdb files.

> It is interesting you have implemented (invented?) a string based  
> language
> with logical and, within etc. In some ways it reminds me of the  
> selection
> formulae in VMD - have you used that 3D visualisation tool?
>

Yes I use VMD a lot and the inspiration came certainly from there.
A few things are however unique in p3d, e.g. first residue of chain A  
and p3d supports residue 15 .. 20 to select a range of residues.

Michael has coded the parser that translates the human readable query  
into
set operations and functions and he even implemented a strategy in  
which new functions or query types can be build in in no time. E.g.  
"ligand containing sulfur" could be implemented in 5 min.
He has done truly a great job on this.

> This also reminds me of the SQL language for database selections, and
> how classical SQL code with Python just used SQL statements within
> Python strings. Have you ever used SQLAlchemy, and looked at how
> they handle SQL statements like filters, ands, ors, etc with a clever
> object based interface? Perhaps something like that could work for
> a 3D structure query API.

That certainly sounds very interesting. It would also allow to  
incorporate the actual pdb files into the database
which would reduce loading and tree building times. Surveys, pattern  
screening could be done very fast. One could also imagine
connecting other pdb databases, such as SCOP, Pfam or web services,  
e.g. PISCES.

Regards,

Christian


From biopython at maubp.freeserve.co.uk  Wed Oct 21 10:37:30 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 21 Oct 2009 11:37:30 +0100
Subject: [Biopython] Biopython & p3d
In-Reply-To: <905F76E9-4BA4-4297-86E0-E6BDDDEEC624@uni-muenster.de>
References: <2034B111-0AB9-4FE8-B1F9-62E753DA0FE8@uni-muenster.de>
	<320fb6e00910210218n39b746bu6c5fd826efe6097f@mail.gmail.com>
	<905F76E9-4BA4-4297-86E0-E6BDDDEEC624@uni-muenster.de>
Message-ID: <320fb6e00910210337r2a3b2eb9n36fef3a16ec02037@mail.gmail.com>

On Wed, Oct 21, 2009 at 11:31 AM, Christian Fufezan
 wrote:
>> This also reminds me of the SQL language for database selections, and
>> how classical SQL code with Python just used SQL statements within
>> Python strings. Have you ever used SQLAlchemy, and looked at how
>> they handle SQL statements like filters, ands, ors, etc with a clever
>> object based interface? Perhaps something like that could work for
>> a 3D structure query API.
>
> That certainly sounds very interesting. It would also allow to incorporate
> the actual pdb files into the database which would reduce loading and
> tree building times. Surveys, pattern screening could be done very fast.
> One could also imagine connecting other pdb databases, such as SCOP,
> Pfam or web services, e.g. PISCES.

I was actually suggesting having a object based API for building search
terms instead of parsing a human friendly string.

But yes, loading a PDB file into a database does have some advantages.

Peter


From biopython at maubp.freeserve.co.uk  Wed Oct 21 11:01:35 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 21 Oct 2009 12:01:35 +0100
Subject: [Biopython] Biopython & p3d
In-Reply-To: <905F76E9-4BA4-4297-86E0-E6BDDDEEC624@uni-muenster.de>
References: <2034B111-0AB9-4FE8-B1F9-62E753DA0FE8@uni-muenster.de>
	<320fb6e00910210218n39b746bu6c5fd826efe6097f@mail.gmail.com>
	<905F76E9-4BA4-4297-86E0-E6BDDDEEC624@uni-muenster.de>
Message-ID: <320fb6e00910210401l737252deg78de117143395279@mail.gmail.com>

On Wed, Oct 21, 2009 at 11:31 AM, Christian Fufezan
 wrote:
>
> A data structure that is build like that of Biopython.pdb imposes
> multiple nested loops and condition queries.

Not really - see below.

> p3ds data structure is not nested and gains strength through combination
> of sets and BSPTree
> This allows faster and more flexible looping. Looping over all alpha and
> beta-carbons for example and printing x-coordinates
>
> p3d:
> for atom in pdb.query('protein and atom type CB or atom type CA'):
> ? ? ? ?print atom.x

The Bio.PDB structure, model or chain object do offer direct access
to a "flat" list of atoms via the get_atoms() method. e.g.

from Bio import PDB
structure = Bio.PDB.PDBParser().get_structure("Test", "XXXX.pdb")
for atom in structure.get_atoms() :
	if atom.name in ["CA", "CB"] : print atom.coord

(I'd have to think a bit longer about how in general to restrict this to
proteins, here that is implicit since CA and CB are protein specific)

You can also of course use a list comprehension, e.g. to get all
the x-coordinates (which I guess is what your example does),

from Bio import PDB
structure = Bio.PDB.PDBParser().get_structure("Test", "XXXX.pdb")
x_list = [atom.coord[0] for atom in structure.get_atoms() \
             if atom.name in ["CA", "CB"]]

You can also drill down through the nested structure of models,
chains and residues to get to the atoms that way.

To me these are more Pythonic than the clever natural language
parsing in p3d (which seems ideal for a user interface, rather than
a programming API). Biopython might be improved by defining an
atoms property (list or iterator?) instead of the get_atoms() method.

One might also ask for x, y and z properties on the atom object
to provide direct access to the three coordinates as floats. Do
you think this sort of little thing would help improve Bio.PDB?

> Still I think both methods could exists side by side. If it is efficient - I
> don't know. Replacing biopythons.pdb parser was never the intention
> and I think it has features that are really good and fast!

Yes, it should be possible to offer nice nested access and nice flat
access from the same objects. Internally the current Biopython PDB
structure could perhaps be handled as filtered views of a complete
list of all the atoms (using sets and trees or a database or whatever).
That might make some things faster too.

> Yes that was one thing that we were really missing. Also the fact that
> biopython requires the unfolded entity to be converted to vectors and so
> forth was a bit complex and we needed fast and direct access to the
> coordinates, the very essence of pdb files.

I'm not quite sure what you mean here by "vectors". Could you
be a little more specific? Do you want NumPy style objects or
something else?

Peter



From chapmanb at 50mail.com  Wed Oct 21 12:34:22 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Wed, 21 Oct 2009 08:34:22 -0400
Subject: [Biopython] Adaptor trimmer and dimers
In-Reply-To: <843737.47817.qm@web52003.mail.re2.yahoo.com>
References: <20091019112441.GA72523@sobchak.mgh.harvard.edu>
	<843737.47817.qm@web52003.mail.re2.yahoo.com>
Message-ID: <20091021123422.GD72523@sobchak.mgh.harvard.edu>

Hi Anastasia;
Thanks for the additional info.

> I also had to add an additional test for the length of the alignment output,
> as I got an index Error for the cases the adapter does not align at
> all. 

Good catch on this. I updated the trimming code to handle that case:

http://github.com/chapmanb/bcbb/blob/master/align/adaptor_trim.py

> My main problem now is performance of this script: On a file of 19
> million reads of 76 bp it is running for more than 12 hours! So I copy
> here my code and would be very grateful if someone could indicate parts
> where it could be sped up. 

Peter had a good suggestion on profiling. The Python profile module
is quick to learn and can quickly point you in the direction of the
most used functions:

http://docs.python.org/library/profile.html

Based on reading your code there are a couple of things that stick
out to me:

- You are calling the pairwise2 alignment 3 times. You should call
  this once, assign the alignment information to a variable, and then
  perform your if/else tests on that. The updated trimming code above 
  is a good example of doing this.

- You are slicing SeqRecord objects, and then never using the sliced
  records. Your code doesn't look like adaptor trimming, but rather
  filtering out reads without a sequence. If you don't need the
  trimmed record, pass a string (str(rec1.seq) and str(rec2.seq)) to
  the handle_adaptor function instead of the record; the slicing is
  then done on a much simpler object and you avoid the substantial 
  overhead of slicing up quality scores that are never used.

If you end up needing trimmed fastq sequences, here is how I would
reimplement your basic logic with the trimmer and Peter's suggestion:

from Bio.SeqIO.QualityIO import FastqGeneralIterator
from adaptor_trim import trim_adaptor_w_qual

in_file = "test.fastq"
out_file = "trimmed.fastq"

in_handle = open(in_file)
out_handle = open(out_file, "w")
iterator = FastqGeneralIterator(in_handle)
adaptor = "AAAAAAAAAAAAAAAAAAAA"
num_errors = 2
while 1:
    try:
        title1, seq1, qual1 = iterator.next()
        title2, seq2, qual2 = iterator.next()
    except StopIteration:
        break

    tseq1, tqual1 = trim_adaptor_w_qual(seq1, qual1, adaptor, num_errors)
    tseq2, tqual2 = trim_adaptor_w_qual(seq2, qual2, adaptor, num_errors)

    # if neither has the adaptor
    if len(tseq1) == len(seq1) and len(tseq2) == len(seq2):
        out_handle.write("@%s\n%s\n+\n%s\n" % (title1, tseq1, tqual1))
        out_handle.write("@%s\n%s\n+\n%s\n" % (title2, tseq2, tqual2))
out_handle.close()
in_handle.close()

Hope this helps,
Brad


From biopython at maubp.freeserve.co.uk  Wed Oct 21 16:16:22 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 21 Oct 2009 17:16:22 +0100
Subject: [Biopython] Deprecating Bio.Clustalw?
Message-ID: <320fb6e00910210916l5d39aa2eje322f2a01e9ac020@mail.gmail.com>

Dear all,

In our most recent release, Biopython 1.52, Bio.Clustalw was
declared obsolete. This is just a label to indicate that it will at
some point be deprecated (issue a warning when used) and
later it will be removed completely.

The module provides two features - parsing Clustal alignments,
and calling the clustalw command line tool.

Bio.AlignIO took over the role for parsing alignments a year
and a half ago with Biopython 1.46 (June 2008).

More recently, Bio.Align.Applications took over the role for calling
ClustalW in Biopython 1.51 (August 17, 2009) as part of an on
going standardisation of our command line wrappers using the
built in Python module subprocess.

I recognise that Bio.Clustalw has been been widely used, and
there are likely to be many existing scripts out there using it.
Does leaving this module as "obsolete" for Biopython 1.53,
and deprecating it in Biopython 1.54 sound like a good plan?

If anyone is using it heavily, please say so - especially if you
try and update your code to use Bio.AlignIO or subprocess
and Bio.Align.Applications.

Peter


From fufezan at uni-muenster.de  Wed Oct 21 18:22:48 2009
From: fufezan at uni-muenster.de (Christian Fufezan)
Date: Wed, 21 Oct 2009 20:22:48 +0200
Subject: [Biopython] Biopython & p3d
In-Reply-To: <320fb6e00910210401l737252deg78de117143395279@mail.gmail.com>
References: <2034B111-0AB9-4FE8-B1F9-62E753DA0FE8@uni-muenster.de>
	<320fb6e00910210218n39b746bu6c5fd826efe6097f@mail.gmail.com>
	<905F76E9-4BA4-4297-86E0-E6BDDDEEC624@uni-muenster.de>
	<320fb6e00910210401l737252deg78de117143395279@mail.gmail.com>
Message-ID: 

>> A data structure that is build like that of Biopython.pdb imposes
>> multiple nested loops and condition queries.
>
> Not really - see below.

if things get more complicated, there might be a need ....

>> p3ds data structure is not nested and gains strength through  
>> combination
>> of sets and BSPTree
>> This allows faster and more flexible looping. Looping over all  
>> alpha and
>> beta-carbons for example and printing x-coordinates
>>
>> p3d:
>> for atom in pdb.query('protein and atom type CB or atom type CA'):
>>        print atom.x
>
> The Bio.PDB structure, model or chain object do offer direct access
> to a "flat" list of atoms via the get_atoms() method. e.g.
>
> from Bio import PDB
> structure = Bio.PDB.PDBParser().get_structure("Test", "XXXX.pdb")
> for atom in structure.get_atoms() :
> 	if atom.name in ["CA", "CB"] : print atom.coord
>
> (I'd have to think a bit longer about how in general to restrict  
> this to
> proteins, here that is implicit since CA and CB are protein specific)
>

That would be the second condition to check ... if the search should  
be limited to certain atoms of chain A and C then one would require  
another check. Personally, I can not see the advantages of a nested  
structure, but then I am not an expert.

> You can also of course use a list comprehension, e.g. to get all
> the x-coordinates (which I guess is what your example does),
>
> from Bio import PDB
> structure = Bio.PDB.PDBParser().get_structure("Test", "XXXX.pdb")
> x_list = [atom.coord[0] for atom in structure.get_atoms() \
>             if atom.name in ["CA", "CB"]]
>
> You can also drill down through the nested structure of models,
> chains and residues to get to the atoms that way.
>
> To me these are more Pythonic than the clever natural language
> parsing in p3d (which seems ideal for a user interface, rather than
> a programming API).

That is, I guess, a matter of taste. I am happy if an API helps me to  
reach my goal fast.
x_list = [atom.x for atom in pdb.query('protein and atom type CB or  
atom type CA')]
seems more intuitive and clearer than atom.coord[0] for atom in  
structure.get_atoms() if atom.name in ["CA", "CB"].
But I guess that's a matter of taste. Pythonian for me is readable  
source code. But again, that's a matter of taste.

If things get more complex than the power of a human readable  
interface becomes clearer.
For example consider you want to get all ALAs that are within a  
distance range of a point in space.
in p3d, one can define the point in space by a p3d.vector.Vector, lets  
say V1 and then form a query
similar to "within 20 of V1 and not within 10 of V1".

Or all proteinogenic oxygens that are not part of the backbone and  
within 4 ? of a ligand, e.g. ATP.
without knowing what kind of oxygens these could be (i.e. OG1, OG,  
OE1, OD1, OD2, OE2)
one can easily formulate a query in the form of "protein and oxygen  
and not backbone and within 4 of resname ATP"

The query can actually also be resolved to a set of set operations e.g.
for atom in pdb.hash["resid"][20] & pdb.hash["oxygen"][""]:
but the query function is simply to convenient ;)

> Biopython might be improved by defining an
> atoms property (list or iterator?) instead of the get_atoms() method.
>
agree.  I would argue that p3d's atom/vector class seems the way to go.

> One might also ask for x, y and z properties on the atom object
> to provide direct access to the three coordinates as floats. Do
> you think this sort of little thing would help improve Bio.PDB?
>
yes indeed, that is _the_ information a pdb module should offer  
without any addition.
Better would be even if the atoms are treatable as vectors (see below).
p3d has a series of atom object attributes that are convenient.

>> Still I think both methods could exists side by side. If it is  
>> efficient - I
>> don't know. Replacing biopythons.pdb parser was never the intention
>> and I think it has features that are really good and fast!
>
> Yes, it should be possible to offer nice nested access and nice flat
> access from the same objects. Internally the current Biopython PDB
> structure could perhaps be handled as filtered views of a complete
> list of all the atoms (using sets and trees or a database or  
> whatever).
> That might make some things faster too.

I agree to some extent. As above, I can only say that I cannot see the  
advantage of a nested data structure.
Maybe you can explain with an example where drilling through the  
nested structure could come in handy.

>> Yes that was one thing that we were really missing. Also the fact  
>> that
>> biopython requires the unfolded entity to be converted to vectors  
>> and so
>> forth was a bit complex and we needed fast and direct access to the
>> coordinates, the very essence of pdb files.
>
> I'm not quite sure what you mean here by "vectors". Could you
> be a little more specific? Do you want NumPy style objects or
> something else?


In p3d the atom objects are vectors, so writing an structural  
alignment script is straight forward (see e.g. http://p3d.fufezan.net/index.php?title=alignByATP 
). Or to find the geometric centre of the protein/a residue/ a chain  
or a custom set is simply
centre = p3d.vector.Vector()
for atom in atoms:
	centre += atom
centre = centre/len(atoms)

So distances between two atoms are the length of their subtraction, e.g
atomA.distanceTo(atomB) will yield the same as abs(atomA-atomB)

Yes similar to a NumPy object, but without the big NumPy overhead and  
more specific to atoms, e.g. atom.resid, atom.chain, atom.beta, atom.x.




From biopython at maubp.freeserve.co.uk  Wed Oct 21 22:14:10 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 21 Oct 2009 23:14:10 +0100
Subject: [Biopython] Biopython & p3d
In-Reply-To: 
References: <2034B111-0AB9-4FE8-B1F9-62E753DA0FE8@uni-muenster.de>
	<320fb6e00910210218n39b746bu6c5fd826efe6097f@mail.gmail.com>
	<905F76E9-4BA4-4297-86E0-E6BDDDEEC624@uni-muenster.de>
	<320fb6e00910210401l737252deg78de117143395279@mail.gmail.com>
	
Message-ID: <320fb6e00910211514r952732fw6c6d48f9be0e46c9@mail.gmail.com>

On Wed, Oct 21, 2009 at 7:22 PM, Christian Fufezan wrote:
>> Biopython might be improved by defining an atom
>> property (list or iterator?) instead of the get_atoms() method.
>
> agree. ?I would argue that p3d's atom/vector class seems the way to go.

We can probably have similar things for chains etc. Any other
views on this? I never liked the get_* and set_* methods in
Bio.PDB myself, and using Python properties seem more
natural here (they may not have existing when Bio.PDB was
first started - I'd have to check).

[We should probably break out specific suggestions like this
into new mailing list threads, and CC people like Thomas H.]

>> One might also ask for x, y and z properties on the atom object
>> to provide direct access to the three coordinates as floats. Do
>> you think this sort of little thing would help improve Bio.PDB?
>>
> yes indeed, that is _the_ information a pdb module should offer
> without any addition. Better would be even if the atoms are
> treatable as vectors (see below). p3d has a series of atom
> object attributes that are convenient.

I would argue that the x-y-z triple (which Biopython has) is
more important that separate x, y, and z floats. We seem
to agree here.

The Biopython atom's coord property is an x-y-z triple (as a
one dimensional numpy array). The Bio.PDB code also
defines its own vector objects on top of this, but my memory
of the details is hazy here. As I recall, I personally stuck
with the numpy objects in my scripts using Bio.PDB.

>> Yes, it should be possible to offer nice nested access and nice flat
>> access from the same objects. Internally the current Biopython PDB
>> structure could perhaps be handled as filtered views of a complete
>> list of all the atoms (using sets and trees or a database or whatever).
>> That might make some things faster too.
>
> I agree to some extent. As above, I can only say that I
> cannot see the advantage of a nested data structure.
> Maybe you can explain with an example where drilling
> through the nested structure could come in handy.

The drill down is great for selecting a particular residue or
chain (or for NMR, a particular model). It is also good for
looping over these structures - e.g. to process psi/phi
angles along a protein backbone.

>>> Yes that was one thing that we were really missing. Also the fact that
>>> biopython requires the unfolded entity to be converted to vectors and so
>>> forth was a bit complex and we needed fast and direct access to the
>>> coordinates, the very essence of pdb files.
>>
>> I'm not quite sure what you mean here by "vectors". Could you
>> be a little more specific? Do you want NumPy style objects or
>> something else?
>
> In p3d the atom objects are vectors,

I don't immediately see what the intention is here. What does
"adding" or "subtracting" two atom/vector objects give you? A
new non-atom vector would be my guess? What about
multiplying by a scaler? Again, getting a non-atom vector
object back makes most sense.

> so writing an structural alignment script is straight forward
> (see e.g. http://p3d.fufezan.net/index.php?title=alignByATP).

Structural alignment is not so different in Biopython - just the details. e.g.
http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/

> Or to find the geometric centre of the protein/a residue/ a chain
> or a custom set is simply
> centre = p3d.vector.Vector()
> for atom in atoms:
> ? ? ? ?centre += atom
> centre = centre/len(atoms)

And you can do all of that with the NumPy array of three coordinates
accessed via atom.coord - in many respects it is a "vector". For
example, with a typical Bio.PDB Residue object, the geometric
center/centre is just one line:

>>> centre = numpy.sum(atom.coord for atom in residue) / len(residue)
>>> centre
array([ -0.21274999,   2.609375  ,  13.95149994], dtype=float32)

The centre of mass would be more interesting to calculate,
but for that we need the atomic masses.

> So distances between two atoms are the length of their subtraction, e.g
> atomA.distanceTo(atomB) will yield the same as abs(atomA-atomB)

I guess your atomA-atomB returns a vector, and abs() gives
its length.

You can get the distance between to Bio.PDB atoms with
atomA-atomB (and you don't need to stick an abs on it either,
because our atoms are not trying to act like vectors - we
can just return a float).

> Yes similar to a NumPy object, but without the big NumPy overhead
> and more specific to atoms, e.g. atom.resid, atom.chain, atom.beta,
> atom.x.

Well, yes, NumPy is a big project, and Bio.PDB is one of the main
bits of Biopython that uses it. But it is very useful for numerical
work, and a good choice here I think. And assuming you *like*
numpy, having the Bio.PDB atom objects expose the x-y-z
coordinates as a simple one dimensional numpy array of floats
is very natural.

You said early:
>>> Also the fact that biopython requires the unfolded entity
>>> to be converted to vectors and so forth was a bit complex
>>> and we needed fast and direct access to the coordinates,
>>> the very essence of pdb files."

I disagree. The Biopython atom objects give "fast and direct
access to the coordindates" via the coord property, which is a
a one-dimensional numpy array (aka, a vector). For fast
and efficient numerical operations there is no need to
convert this into anything else (although a bespoke vector
object may make things more elegant).

Peter

P.S. This thread is proving quite interesting :)



From biopython at maubp.freeserve.co.uk  Wed Oct 21 22:55:11 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 21 Oct 2009 23:55:11 +0100
Subject: [Biopython] Biopython on Jython
Message-ID: <320fb6e00910211555m4379dd34wfecdd9373cfd0498@mail.gmail.com>

Hi Kyle,

You probably noticed I merged some of your fixes to get (the non C and
non NumPy bits of) Biopython to work on Jython, but not all. Could you
update your github branch to the trunk at some point? That would help
in picking up more of your fixes.

Many of the issues related to large python methods exceeding JVM size
restrictions, something which Jython was going to try and fix in 2.5.1
(but didn't seem to be solved in the release candidate I was trying),
see e.g. http://bugs.jython.org/issue527524
Do you (Kyle) know about more about the Jython plans and if/when they
might resolve this? I would prefer to avoid any ugly Jython specific
fixes in Biopython - especially if the next release of Jython may
resolve many of these points.

Thanks,

Peter


From biopython at maubp.freeserve.co.uk  Thu Oct 22 09:15:27 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 22 Oct 2009 10:15:27 +0100
Subject: [Biopython] Biopython on Jython
In-Reply-To: 
References: <320fb6e00910211555m4379dd34wfecdd9373cfd0498@mail.gmail.com>
	
Message-ID: <320fb6e00910220215u323dd237r28e2b8a15b651e43@mail.gmail.com>

Hi all,

I probably should have started this thread with a more general question,
is anyone other than Kyle interested in running Biopython under Jython?
http://lists.open-bio.org/pipermail/biopython/2009-October/005734.html

Some of the fixes this required are minor things that will also help with
other Python variants like IronPython (e.g. unit tests shouldn't make any
assumptions about the order of dictionary keys), and are worthwhile in
their own right. Others (as discussed below) are less general...

On Thu, Oct 22, 2009 at 5:47 AM, Kyle Ellrott  wrote:
>
>> You probably noticed I merged some of your fixes to get (the non C and
>> non NumPy bits of) Biopython to work on Jython, but not all. Could you
>> update your github branch to the trunk at some point? That would help
>> in picking up more of your fixes.
>
> I've tried to keep my branch up to speed with the mainline. ?But I didn't
> branch my work from master, so it may harder to extract...

True, but I can probably manage.

>> Many of the issues related to large python methods exceeding JVM size
>> restrictions, something which Jython was going to try and fix in 2.5.1
>> (but didn't seem to be solved in the release candidate I was trying),
>> see e.g. http://bugs.jython.org/issue527524
>> Do you (Kyle) know about more about the Jython plans and if/when they
>> might resolve this? I would prefer to avoid any ugly Jython specific
>> fixes in Biopython - especially if the next release of Jython may
>> resolve many of these points.
>
> One of the main Jython developers pointed this possible solution out to me.
> From his email:
>
>> You may be interested to know that one of the things on my development
>> backlog is to complete a Python bytecode compiler so that we can run
>> arbitrarily long methods. This works because Jython 2.5.0 includes a VM to
>> run Python bytecode (org.python.core.PyBytecode).

That sounds like what I have seen references to online, originally
targeted for Jython 2.5.1 but which seems to have slipped.

>> In a pinch, you could do
>> the same thing too now by creating a .pyc file with CPython instead of the
>> $py.class file, then using "import pycimport" in a startup script to install
>> that as a custom inporter. It's not terribly convenient however for
>> distribution, unfortunately.
>
> It sounds like it would make the Jython BioPython code more 'hacky'.

The pycimport thing does sound messy, I agree.

> I managed to isolate all of the 'large method code' that was in BioPython.
> The easiest way to fix those problems was to take large functions and split
> them into 'a', 'b', 'c', etc,? functions.

Yes, and for things like the unit tests I don't mind this. For some of the
main code, the fix really didn't help with the readability of the code -
which is why I am hoping the Python bytecode compiler in Jython
happens soon.

> One other side project to watch out for is ctypes for Jython.? I've heard
> several of the Jython developers talking about it.? And if they get it to
> work, C modules written for python, wrapped with the ctypes module,
> may be able to work in Jython.

That would be good.

Another issue is NumPy on Jython, where even a slow compatibility
library would be useful to us for getting Bio.PDB to work on Jython.
Things like Bio.Cluster interface with the NumPy C code are of
course not so feasible. I noticed you added something to the
Biopython setup.py on your branch to assume NumPy will not be
available under Jython (and not prompt the user about it being
missing). I should merge that into the trunk...

Peter



From natassa_g_2000 at yahoo.com  Thu Oct 22 09:29:35 2009
From: natassa_g_2000 at yahoo.com (natassa)
Date: Thu, 22 Oct 2009 02:29:35 -0700 (PDT)
Subject: [Biopython] Adaptor trimmer and dimers
In-Reply-To: <20091021123422.GD72523@sobchak.mgh.harvard.edu>
Message-ID: <258333.91161.qm@web52007.mail.re2.yahoo.com>

Hi Brad, 
Thank you very much for your comments! 

Peter had a good suggestion on profiling. The Python profile module
is quick to learn and can quickly point you in the direction of the
most used functions:
http://docs.python.org/library/profile.html


I looked at the profile module, I am still not sure about the input type I may give to cProfile (my module name?) - it is syntax comprehension problem now, but i am sure i ll solve it ;-)


- You are calling the pairwise2 alignment 3 times. You should call
? this once, assign the alignment information to a variable, and then
? perform your if/else tests on that. The updated trimming code above 
? is a good example of doing this.
Thanks! I forgot to clean up the code after I solved out this index error-this was my 'dirty' version when I was trying to understand this issue.


- You are slicing SeqRecord objects, and then never using the sliced
? records. Your code doesn't look like adaptor trimming, but rather
? filtering out reads without a sequence. If you don't need the
? trimmed record, pass a string (str(rec1.seq) and str(rec2.seq)) to
? the handle_adaptor function instead of the record; the slicing is
? then done on a much simpler object and you avoid the substantial 
? overhead of slicing up quality scores that are never used.

Again, not very clean code as I have been oscillating between trimming/removing? for some days now. I finally decided that if I don't have a big proportion of nearly exact (max 2 errors) matches to the adaptor in my reads, I may just discard them, as trimming a 33/37 bp adaptor from a 55-bp read does not leave much anyway. 
You were right about passing a string to the function, I had not thought that passing the whole record would be more heavy. The revised script (for removing, but taking into account all your suggestions, so using the general iterator) is still running for very long, unfortunately without a profiler-I need to understand this module more..
Thanks for all suggestions!
Anastasia


Anastasia Gioti
Post-Doc, Evolutionary Biology Department
Upssala University
Norbyv?gen 18D
SE-752 36? UPPSALA
anastasia.gioti at ebc.uu.se
Tel: +46-18-471 2837
Fax: +46-18-471 6310



      


From mavata at gmail.com  Thu Oct 22 09:45:13 2009
From: mavata at gmail.com (Manu Tamminen)
Date: Thu, 22 Oct 2009 12:45:13 +0300
Subject: [Biopython] About BLAST parser
Message-ID: <751C56D3-48C4-43F4-876B-BE2C023DAC48@gmail.com>

I have a problem with the Biopython BLAST parser. I'm using the parser  
to extract relevant information from an XML result file into a tab- 
separated table. It seems the XML file occasionally contains errors  
that cause the script to abort. This is especially common and annoying  
with sequence alignments that contain thousands of sequences.

Is it possible to write the script so that when an error occurs, the  
script would jump into the next sequence rather than abort completely?  
I will include below an example of such error. This error is about a  
mismatched tag - sometimes the error has also been about a missing tag.

     for blast_record in blast_records:
   File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ 
python2.6/site-packages/Bio/Blast/NCBIXML.py", line 660, in parse
     expat_parser.Parse(text, True) # End of XML record
xml.parsers.expat.ExpatError: mismatched tag: line 82921, column 4

Any help appreciated! Thanks!
Manu


From biopython at maubp.freeserve.co.uk  Thu Oct 22 09:56:32 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 22 Oct 2009 10:56:32 +0100
Subject: [Biopython] About BLAST parser
In-Reply-To: <751C56D3-48C4-43F4-876B-BE2C023DAC48@gmail.com>
References: <751C56D3-48C4-43F4-876B-BE2C023DAC48@gmail.com>
Message-ID: <320fb6e00910220256j2c8e3399s19019d617bfa0420@mail.gmail.com>

On Thu, Oct 22, 2009 at 10:45 AM, Manu Tamminen  wrote:
> I have a problem with the Biopython BLAST parser. I'm using the parser to
> extract relevant information from an XML result file into a tab-separated
> table. It seems the XML file occasionally contains errors that cause the
> script to abort. This is especially common and annoying with sequence
> alignments that contain thousands of sequences.
>
> Is it possible to write the script so that when an error occurs, the script
> would jump into the next sequence rather than abort completely? I will
> include below an example of such error. This error is about a mismatched tag
> - sometimes the error has also been about a missing tag.
>
> ? ?for blast_record in blast_records:
> ?File
> "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/Bio/Blast/NCBIXML.py",
> line 660, in parse
> ? ?expat_parser.Parse(text, True) # End of XML record
> xml.parsers.expat.ExpatError: mismatched tag: line 82921, column 4

XML is a strict file format with tags like  having a closing
tag . If the XML file is truncated or something, you can
have mismatched tags (e.g. an  without an  ) which
means the XML file is invalid. This is basically what that error
message is about.

I can make some suggestions that may help, but it first are you
running BLAST locally or online? Are you saving the results to
a file, or parsing directly from the handle? How many query
sequences do you have?

Peter



From mavata at gmail.com  Thu Oct 22 10:06:47 2009
From: mavata at gmail.com (Manu Tamminen)
Date: Thu, 22 Oct 2009 13:06:47 +0300
Subject: [Biopython] About BLAST parser
In-Reply-To: <320fb6e00910220256j2c8e3399s19019d617bfa0420@mail.gmail.com>
References: <751C56D3-48C4-43F4-876B-BE2C023DAC48@gmail.com>
	<320fb6e00910220256j2c8e3399s19019d617bfa0420@mail.gmail.com>
Message-ID: 


Hi Peter! Thanks for your prompt reply! I've run the BLAST analysis on  
a supercomputer cluster, saved the results into a XML file and then  
transferred the output file to my computer. I then run the script on  
my computer to parse the results into a tab separated file. With the  
current dataset I have 1115 sequences of around 500 bp each.
Manu

On Oct 22, 2009, at 12:56 PM, Peter wrote:

> On Thu, Oct 22, 2009 at 10:45 AM, Manu Tamminen   
> wrote:
>> I have a problem with the Biopython BLAST parser. I'm using the  
>> parser to
>> extract relevant information from an XML result file into a tab- 
>> separated
>> table. It seems the XML file occasionally contains errors that  
>> cause the
>> script to abort. This is especially common and annoying with sequence
>> alignments that contain thousands of sequences.
>>
>> Is it possible to write the script so that when an error occurs,  
>> the script
>> would jump into the next sequence rather than abort completely? I  
>> will
>> include below an example of such error. This error is about a  
>> mismatched tag
>> - sometimes the error has also been about a missing tag.
>>
>>    for blast_record in blast_records:
>>  File
>> "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/ 
>> site-packages/Bio/Blast/NCBIXML.py",
>> line 660, in parse
>>    expat_parser.Parse(text, True) # End of XML record
>> xml.parsers.expat.ExpatError: mismatched tag: line 82921, column 4
>
> XML is a strict file format with tags like  having a closing
> tag . If the XML file is truncated or something, you can
> have mismatched tags (e.g. an  without an  ) which
> means the XML file is invalid. This is basically what that error
> message is about.
>
> I can make some suggestions that may help, but it first are you
> running BLAST locally or online? Are you saving the results to
> a file, or parsing directly from the handle? How many query
> sequences do you have?
>
> Peter


---
Manu Tamminen, M.Sc.
University of Helsinki
Department of Applied Chemistry and Microbiology, Division of  
Microbiology
P.O. Box 56
00014 HELSINKI
FINLAND

tel: +358 (0)9191 57585
fax:  +358 (0)9191 59322
e-mail: manu.tamminen at helsinki.fi
home: http://www.mm.helsinki.fi/~mvtammin/



From biopython at maubp.freeserve.co.uk  Thu Oct 22 10:19:02 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 22 Oct 2009 11:19:02 +0100
Subject: [Biopython] About BLAST parser
In-Reply-To: 
References: <751C56D3-48C4-43F4-876B-BE2C023DAC48@gmail.com>
	<320fb6e00910220256j2c8e3399s19019d617bfa0420@mail.gmail.com>
	
Message-ID: <320fb6e00910220319t55c7401bgc31f1502db1a639b@mail.gmail.com>

On Thu, Oct 22, 2009 at 11:06 AM, Manu Tamminen  wrote:
>
> Hi Peter! Thanks for your prompt reply! I've run the BLAST analysis on a
> supercomputer cluster, saved the results into a XML file and then
> transferred the output file to my computer. I then run the script on my
> computer to parse the results into a tab separated file. With the current
> dataset I have 1115 sequences of around 500 bp each.
> Manu

Based on the Biopython error message, I suspect your XML file is
broken. How big is the XML file (MB). There are online tools for this,
but uploading a large file is out of the question. You could also open
the file in a suitable editor, go to the line number given in the Biopython
error message, and look at the file by eye to see if there is anything
obvious.

It is possible that the XML file was corrupted when you copied it to
your local machine (e.g. a network error). You could try zipping it
up, and then copying it again. It is also possible that the XML file
was corrupted on the disk on the cluster (rare, but this can happen).
In this case you might be able to fix the XML by hand, or re-run it.

Alternatively, it is possible that the file is valid, and the Biopython parser
(or the Python library we use internally) has a bug. As long as the
XML file isn't too big (say 10MB), you could email it to me personally
(NOT the mailing list) and I can try and have a look at it.

Personally, I would break up the task into jobs (maybe six jobs of
up to 200 sequences each - or even one sequence per job). On
most clusters this is a good idea anyway, as they can then be
handled by different cluster nodes. For the analysis, you just have
to parse the separate XML files. Any corrupted XML file will then
only affect a few sequences, and checking it or re-running it is
going to be much quicker and easier.

Peter


From mavata at gmail.com  Thu Oct 22 10:34:55 2009
From: mavata at gmail.com (Manu Tamminen)
Date: Thu, 22 Oct 2009 13:34:55 +0300
Subject: [Biopython] About BLAST parser
In-Reply-To: <320fb6e00910220319t55c7401bgc31f1502db1a639b@mail.gmail.com>
References: <751C56D3-48C4-43F4-876B-BE2C023DAC48@gmail.com>
	<320fb6e00910220256j2c8e3399s19019d617bfa0420@mail.gmail.com>
	
	<320fb6e00910220319t55c7401bgc31f1502db1a639b@mail.gmail.com>
Message-ID: <69C1EBC7-8A0E-47EE-910C-478BE263125C@gmail.com>


With all blast hits included, the output file is around 1 gigabyte.  
Therefore just opening and searching for the broken parts is  
challenging with regular text editors. Furthermore, I'm not very  
familiar with XML syntax and therefore would probably not recognize  
the broken parts.

Breaking down the search into smaller parts sounds like a good idea.  
However, I'm also considering writing a more robust script. Would it  
be possible to make the script ignore the broken entries in the XML  
file and skip into next correct one?

On Oct 22, 2009, at 1:19 PM, Peter wrote:

> On Thu, Oct 22, 2009 at 11:06 AM, Manu Tamminen   
> wrote:
>>
>> Hi Peter! Thanks for your prompt reply! I've run the BLAST analysis  
>> on a
>> supercomputer cluster, saved the results into a XML file and then
>> transferred the output file to my computer. I then run the script  
>> on my
>> computer to parse the results into a tab separated file. With the  
>> current
>> dataset I have 1115 sequences of around 500 bp each.
>> Manu
>
> Based on the Biopython error message, I suspect your XML file is
> broken. How big is the XML file (MB). There are online tools for this,
> but uploading a large file is out of the question. You could also open
> the file in a suitable editor, go to the line number given in the  
> Biopython
> error message, and look at the file by eye to see if there is anything
> obvious.
>
> It is possible that the XML file was corrupted when you copied it to
> your local machine (e.g. a network error). You could try zipping it
> up, and then copying it again. It is also possible that the XML file
> was corrupted on the disk on the cluster (rare, but this can happen).
> In this case you might be able to fix the XML by hand, or re-run it.
>
> Alternatively, it is possible that the file is valid, and the  
> Biopython parser
> (or the Python library we use internally) has a bug. As long as the
> XML file isn't too big (say 10MB), you could email it to me personally
> (NOT the mailing list) and I can try and have a look at it.
>
> Personally, I would break up the task into jobs (maybe six jobs of
> up to 200 sequences each - or even one sequence per job). On
> most clusters this is a good idea anyway, as they can then be
> handled by different cluster nodes. For the analysis, you just have
> to parse the separate XML files. Any corrupted XML file will then
> only affect a few sequences, and checking it or re-running it is
> going to be much quicker and easier.
>
> Peter


---
Manu Tamminen, M.Sc.
University of Helsinki
Department of Applied Chemistry and Microbiology, Division of  
Microbiology
P.O. Box 56
00014 HELSINKI
FINLAND

tel: +358 (0)9191 57585
fax:  +358 (0)9191 59322
e-mail: manu.tamminen at helsinki.fi
home: http://www.mm.helsinki.fi/~mvtammin/



From biopython at maubp.freeserve.co.uk  Thu Oct 22 10:51:45 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 22 Oct 2009 11:51:45 +0100
Subject: [Biopython] About BLAST parser
In-Reply-To: <69C1EBC7-8A0E-47EE-910C-478BE263125C@gmail.com>
References: <751C56D3-48C4-43F4-876B-BE2C023DAC48@gmail.com>
	<320fb6e00910220256j2c8e3399s19019d617bfa0420@mail.gmail.com>
	
	<320fb6e00910220319t55c7401bgc31f1502db1a639b@mail.gmail.com>
	<69C1EBC7-8A0E-47EE-910C-478BE263125C@gmail.com>
Message-ID: <320fb6e00910220351g3a10ce91mef097ceaf0ab5d0@mail.gmail.com>

On Thu, Oct 22, 2009 at 11:34 AM, Manu Tamminen  wrote:
>
> With all blast hits included, the output file is around 1 gigabyte.
> Therefore just opening and searching for the broken parts is challenging
> with regular text editors. Furthermore, I'm not very familiar with XML
> syntax and therefore would probably not recognize the broken parts.

There is probably a neat way to extract a chunk using Unix command
line tools. Or just try something like this in Python:

error_line = 82921
input_handle = open("really_big.xml")
output_handle = open("fragment.txt", "w")
for line_number, line in enumerate(input_handle) :
    if error_line - 1000 < error_line and error_line < error_line + 1000 :
        output_handle.write(line)
input_handle.close()
output_handle.close()

I would still suggest you re-try copying it from the cluster to your
machine, in case it was just a network error corrupting the machine.

> Breaking down the search into smaller parts sounds like a good idea.
> However, I'm also considering writing a more robust script. Would it be
> possible to make the script ignore the broken entries in the XML file and
> skip into next correct one?

I think that will be tricky. Part of idea about XML is it is a strictly defined
file format where there are standards about how to interpret and abort
with bad data. Tolerant XML parsers are considered to be a bad thing.

What should be possible is a simple script that removes the broken
section of the file, giving a (partial) but valid XML file covering most
of the sequences. It might be more effort than just re-doing the search
(in parts this time).

Peter


From mavata at gmail.com  Thu Oct 22 11:10:11 2009
From: mavata at gmail.com (Manu Tamminen)
Date: Thu, 22 Oct 2009 14:10:11 +0300
Subject: [Biopython] About BLAST parser
In-Reply-To: <320fb6e00910220351g3a10ce91mef097ceaf0ab5d0@mail.gmail.com>
References: <751C56D3-48C4-43F4-876B-BE2C023DAC48@gmail.com>
	<320fb6e00910220256j2c8e3399s19019d617bfa0420@mail.gmail.com>
	
	<320fb6e00910220319t55c7401bgc31f1502db1a639b@mail.gmail.com>
	<69C1EBC7-8A0E-47EE-910C-478BE263125C@gmail.com>
	<320fb6e00910220351g3a10ce91mef097ceaf0ab5d0@mail.gmail.com>
Message-ID: <1BC8860E-E03C-4654-A012-CBA9CD889370@gmail.com>

Thanks very much for your help and suggestions! I think I'll manage  
from here on!
Manu

On Oct 22, 2009, at 1:51 PM, Peter wrote:

> On Thu, Oct 22, 2009 at 11:34 AM, Manu Tamminen   
> wrote:
>>
>> With all blast hits included, the output file is around 1 gigabyte.
>> Therefore just opening and searching for the broken parts is  
>> challenging
>> with regular text editors. Furthermore, I'm not very familiar with  
>> XML
>> syntax and therefore would probably not recognize the broken parts.
>
> There is probably a neat way to extract a chunk using Unix command
> line tools. Or just try something like this in Python:
>
> error_line = 82921
> input_handle = open("really_big.xml")
> output_handle = open("fragment.txt", "w")
> for line_number, line in enumerate(input_handle) :
>    if error_line - 1000 < error_line and error_line < error_line +  
> 1000 :
>        output_handle.write(line)
> input_handle.close()
> output_handle.close()
>
> I would still suggest you re-try copying it from the cluster to your
> machine, in case it was just a network error corrupting the machine.
>
>> Breaking down the search into smaller parts sounds like a good idea.
>> However, I'm also considering writing a more robust script. Would  
>> it be
>> possible to make the script ignore the broken entries in the XML  
>> file and
>> skip into next correct one?
>
> I think that will be tricky. Part of idea about XML is it is a  
> strictly defined
> file format where there are standards about how to interpret and abort
> with bad data. Tolerant XML parsers are considered to be a bad thing.
>
> What should be possible is a simple script that removes the broken
> section of the file, giving a (partial) but valid XML file covering  
> most
> of the sequences. It might be more effort than just re-doing the  
> search
> (in parts this time).
>
> Peter



From biopython at maubp.freeserve.co.uk  Thu Oct 22 11:13:22 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 22 Oct 2009 12:13:22 +0100
Subject: [Biopython] About BLAST parser
In-Reply-To: <1BC8860E-E03C-4654-A012-CBA9CD889370@gmail.com>
References: <751C56D3-48C4-43F4-876B-BE2C023DAC48@gmail.com>
	<320fb6e00910220256j2c8e3399s19019d617bfa0420@mail.gmail.com>
	
	<320fb6e00910220319t55c7401bgc31f1502db1a639b@mail.gmail.com>
	<69C1EBC7-8A0E-47EE-910C-478BE263125C@gmail.com>
	<320fb6e00910220351g3a10ce91mef097ceaf0ab5d0@mail.gmail.com>
	<1BC8860E-E03C-4654-A012-CBA9CD889370@gmail.com>
Message-ID: <320fb6e00910220413h107142fdn992cc149e9afc099@mail.gmail.com>

On Thu, Oct 22, 2009 at 12:10 PM, Manu Tamminen  wrote:
>
> Thanks very much for your help and suggestions! I think I'll manage from
> here on!
> Manu

Good luck,

Peter


From biopython at maubp.freeserve.co.uk  Thu Oct 22 11:38:46 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 22 Oct 2009 12:38:46 +0100
Subject: [Biopython] Biopython on Jython
In-Reply-To: <320fb6e00910220215u323dd237r28e2b8a15b651e43@mail.gmail.com>
References: <320fb6e00910211555m4379dd34wfecdd9373cfd0498@mail.gmail.com>
	
	<320fb6e00910220215u323dd237r28e2b8a15b651e43@mail.gmail.com>
Message-ID: <320fb6e00910220438o1f6363a5mb82b00d967491617@mail.gmail.com>

On Thu, Oct 22, 2009 at 10:15 AM, Peter  wrote:
> Hi all,
>
> On Thu, Oct 22, 2009 at 5:47 AM, Kyle Ellrott  wrote:
>>> You probably noticed I merged some of your fixes to get (the non C and
>>> non NumPy bits of) Biopython to work on Jython, but not all. Could you
>>> update your github branch to the trunk at some point? That would help
>>> in picking up more of your fixes.
>>
>> I've tried to keep my branch up to speed with the mainline. ?But I didn't
>> branch my work from master, so it may harder to extract...
>
> True, but I can probably manage.

Thanks for updating your branch to the trunk.

I've grabbed the BLAST XML fix (and tweaked it) - thanks.

I also made test_Entrez.py get skipped on Jython (although
I just reused the missing dependency trick). See:
http://bugzilla.open-bio.org/show_bug.cgi?id=2918
http://bugs.jython.org/issue1447

>>> Many of the issues related to large python methods exceeding JVM size
>>> restrictions, something which Jython was going to try and fix in 2.5.1
>>> (but didn't seem to be solved in the release candidate I was trying),
>>> see e.g. http://bugs.jython.org/issue527524
>>> ...

This single issue covers the remaining test failures, and persists
on Jython 2.5.1 (final). They may solve it in the next release,
or I can look again at the work arounds on your branch.

We must of course skip anything requiring C code, or NumPy,
but most of Biopython is looking pretty good on Jython now.
Good work Kyle :)

Peter



From mikelisanke at gmail.com  Thu Oct 22 18:19:42 2009
From: mikelisanke at gmail.com (Mike Lisanke)
Date: Thu, 22 Oct 2009 14:19:42 -0400
Subject: [Biopython] Windows installer does not find Python 2.63 with
	multiple pythons
In-Reply-To: <320fb6e00910191429l481223d8iae9fbe78702fff70@mail.gmail.com>
References: <8c5c6e580910191237w54e857bfs1107ba86df05db7c@mail.gmail.com>
	<320fb6e00910191429l481223d8iae9fbe78702fff70@mail.gmail.com>
Message-ID: <8c5c6e580910221119o1d36426dkfdb2d96d681828c9@mail.gmail.com>

Peter,

The problem was python-2.6.3-amd64 for which their is a
numpy-1.3.0-win-amd64-py2.6 installer. Is there a reason NumPy and BioPython
have a specific dependency to work with the AMD64 build of python?

I had assumed python would be considered the runtime environment for numpy
and biopython and the dependency would only be language level. Its
disappointing to think these problems are only caused by registry check
dependencies in the windows installers of these applications. Thanks.

On Mon, Oct 19, 2009 at 5:29 PM, Peter wrote:

> On Mon, Oct 19, 2009 at 8:37 PM, Mike Lisanke 
> wrote:
> > I had Python 3.0 installed prior to attempting a bio-python install. I
> > installed Python 2.6 to its own directory, and a proper registry entry
> was
> > made in HKEY_LOCAL_MACHINE\SOFTWARE\Python, however;
> > the bio-python can not find the Python 2.6 install. Is there a problem
> > having multiple python installs? Thanks.
>
> On my Windows machine I have Python 2.4, 2.5 and 2.6 all co-existing
> fine (and I used to have 2.3 as well). These were all default installs to
> C:\Python26 etc, and I didn't have to do anything funny to the registry.
> I can try and remember to check the registry settings on my machine
> if you like... but for now I can only suggest you might try uninstalling
> Python 2.6, perhaps clean the registry, and then reinstall Python 2.6.
>
> Peter
>
> P.S.
>
> I haven't tried putting Python 3.0 on my Windows machine (not that
> I would bother, I would go straight to Python 3.1 now that it is out).
>



-- 
Best regards,

Mike


From biopython at maubp.freeserve.co.uk  Thu Oct 22 19:45:04 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 22 Oct 2009 20:45:04 +0100
Subject: [Biopython] Windows installer does not find Python 2.63 with
	multiple pythons
In-Reply-To: <8c5c6e580910221119o1d36426dkfdb2d96d681828c9@mail.gmail.com>
References: <8c5c6e580910191237w54e857bfs1107ba86df05db7c@mail.gmail.com>
	<320fb6e00910191429l481223d8iae9fbe78702fff70@mail.gmail.com>
	<8c5c6e580910221119o1d36426dkfdb2d96d681828c9@mail.gmail.com>
Message-ID: <320fb6e00910221245w3eff78dcj54aa3eb57990300a@mail.gmail.com>

On Thu, Oct 22, 2009 at 7:19 PM, Mike Lisanke  wrote:
> Peter,
>
> The problem was python-2.6.3-amd64 for which their is a
> numpy-1.3.0-win-amd64-py2.6 installer. Is there a reason
> NumPy and BioPython have a specific dependency to
> work with the AMD64 build of python?

Are you running on 64 bit Windows then? XP or Vista?

It sounds like you are trying to mix 32 and 64 bit versions
of Python.

If you installed the 64 bit version of Python and Numpy,
then you will need a 64 bit compiled version of Biopython
too - but we don't have one of those yet. We'd need a
developer or a volunteer with a 64bit Windows machine
to do this.

You should be to install a 32 bit version of Python,

http://python.org/ftp/python/2.6.3/python-2.6.3.msi

plus the 32 bit Windows installer for Numpy:

http://sourceforge.net/projects/numpy/files/NumPy/1.3.0/numpy-1.3.0-win32-superpack-python2.6.exe/download

and the 32 bit Windows installer for Biopython:

http://biopython.org/DIST/biopython-1.52.win32-py2.6.exe

(i.e. look for win32 in the filenames, not amd64).

Peter


From michael.koeris at gmail.com  Fri Oct 23 00:56:16 2009
From: michael.koeris at gmail.com (Michael S. Koeris)
Date: Thu, 22 Oct 2009 20:56:16 -0400
Subject: [Biopython] Querying NCBI
Message-ID: <263904DC-B930-4E1B-9C6C-510C83473640@gmail.com>

I don't know if it's the servers today but when I ran this query as a  
regular efetch with 80+ gi numbers it ran for 30+min before i stopped it

handle = Entrez.efetch(db='nucleotide',id=AccNo,rettype='gb')

anyone else experiencing problems?

I also noted that my outbound packet rate dropped to about 4kbp


From mhdhussain at gmail.com  Fri Oct 23 02:16:03 2009
From: mhdhussain at gmail.com (M. Hussain)
Date: Fri, 23 Oct 2009 13:16:03 +1100
Subject: [Biopython] Python Codes for 3rd codon position
Message-ID: 

Hi,



I wonder if anybody could help to write a program to read a file in and
print out the third codon position of two aligned sequences



Thanks


From biopython at maubp.freeserve.co.uk  Fri Oct 23 09:04:56 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 23 Oct 2009 10:04:56 +0100
Subject: [Biopython] Python Codes for 3rd codon position
In-Reply-To: 
References: 
Message-ID: <320fb6e00910230204x3f82a950ieeea1fe4a2b14bad@mail.gmail.com>

On Fri, Oct 23, 2009 at 3:16 AM, M. Hussain  wrote:
> Hi,
>
> I wonder if anybody could help to write a program to read a file in and
> print out the third codon position of two aligned sequences
>
> Thanks

Could you explain in a little more detail what you want to do?
Are your two sequences already aligned? Are there gaps in
the alignment? Showing an example alignment and the data
you want would help greatly.

Regards,

Peter


From biopython at maubp.freeserve.co.uk  Fri Oct 23 09:08:06 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 23 Oct 2009 10:08:06 +0100
Subject: [Biopython] Querying NCBI
In-Reply-To: <263904DC-B930-4E1B-9C6C-510C83473640@gmail.com>
References: <263904DC-B930-4E1B-9C6C-510C83473640@gmail.com>
Message-ID: <320fb6e00910230208l6fdbb7e3i44083a9d7db14ecf@mail.gmail.com>

On Fri, Oct 23, 2009 at 1:56 AM, Michael S. Koeris
 wrote:
> I don't know if it's the servers today but when I ran this query as a
> regular efetch with 80+ gi numbers it ran for 30+min before i stopped it
>
> handle = Entrez.efetch(db='nucleotide',id=AccNo,rettype='gb')
>
> anyone else experiencing problems?

I was asleep, so no ;)

Are you sending one single efetch call with 80+ GI numbers, or
are your sending 80+ individual efetch calls, or something in
between? That may make a difference.

> I also noted that my outbound packet rate dropped to about 4kbp

That suggests a local network issue.

Did you include your email address as the NCBI request?
If they have blocked or throttled your access (if they felt it
was excessive), I would expect them to email you about it.

Peter


From chapmanb at 50mail.com  Fri Oct 23 12:28:43 2009
From: chapmanb at 50mail.com (Brad Chapman)
Date: Fri, 23 Oct 2009 08:28:43 -0400
Subject: [Biopython] Adaptor trimmer and dimers
In-Reply-To: <258333.91161.qm@web52007.mail.re2.yahoo.com>
References: <20091021123422.GD72523@sobchak.mgh.harvard.edu>
	<258333.91161.qm@web52007.mail.re2.yahoo.com>
Message-ID: <20091023122843.GJ72523@sobchak.mgh.harvard.edu>

Hi Anastasia;

> Again, not very clean code as I have been oscillating between
> trimming/removing? for some days now. I finally decided that if I
> don't have a big proportion of nearly exact (max 2 errors) matches
> to the adaptor in my reads, I may just discard them, as trimming a
> 33/37 bp adaptor from a 55-bp read does not leave much anyway. 
>
> The revised script
> (for removing, but taking into account all your suggestions, so using
> the general iterator) is still running for very long, 

This was written with the idea that the adaptor would be present in
most of the sequences. This was the case with the data I was using
it on -- expression profiling with short tags -- but does not sound
like what you are tackling here. My approach speeds up the trimming
by avoiding doing local alignments for many reads since an exact
match is often found. Only in cases where the adaptor is missing or
has one or more sequencing errors does the expensive local alignment
need to be done.

If most reads do not have adaptors, then this approach is
algorithmically slow. Doing a local alignment for nearly every read
is going to take time, independent of the implementation. Profiling
this should reveal most of the time is spent in pairwise alignment.

My suggestion would be to use a heuristic seed-based approach similar to
what short query aligners do:

- Break your adaptor into three smaller seed regions of 12bp
- For each read:
  - Do a fast string find() with the seed regions to the read
  - If two or more of the seed regions match exactly, discard the
    read

This will run much quicker and should catch a majority of the cases
where you have reads. Regions with lots of errors, or errors spaced
evenly through the adaptor, will be missed. Making the code
tractable is probably worth that few that you'll let through.

Hope this helps,
Brad


From michael.koeris at gmail.com  Fri Oct 23 13:11:45 2009
From: michael.koeris at gmail.com (Michael S. Koeris)
Date: Fri, 23 Oct 2009 09:11:45 -0400
Subject: [Biopython] Querying NCBI
In-Reply-To: <320fb6e00910230208l6fdbb7e3i44083a9d7db14ecf@mail.gmail.com>
References: <263904DC-B930-4E1B-9C6C-510C83473640@gmail.com>
	<320fb6e00910230208l6fdbb7e3i44083a9d7db14ecf@mail.gmail.com>
Message-ID: <0D6CA88F-775A-4061-8EA2-7A56FCAE6461@gmail.com>

I am submitting 80 single queries - alternatively i can batch them but  
then when I try to parse them out from the records object I get:

 >>> records
>

I don't know if this is a different object because it's batched

 >>> parser = GenBank.RecordParser()
 >>> recordGenBank = parser.parse(records)
Traceback (most recent call last):
   File "", line 1, in 
   File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ 
python2.6/site-packages/Bio/GenBank/__init__.py", line 172, in parse
     self._scanner.feed(handle, self._consumer)
   File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ 
python2.6/site-packages/Bio/GenBank/Scanner.py", line 380, in feed
     misc_lines, sequence_string = self.parse_footer()
   File "/Library/Frameworks/Python.framework/Versions/2.6/lib/ 
python2.6/site-packages/Bio/GenBank/Scanner.py", line 762, in  
parse_footer
     raise ValueError("Premature end of file in sequence data")
ValueError: Premature end of file in sequence data



--
Michael S. Koeris
michael.koeris at gmail.com

On Oct 23, 2009, at 5:08 AM, Peter wrote:

> On Fri, Oct 23, 2009 at 1:56 AM, Michael S. Koeris
>  wrote:
>> I don't know if it's the servers today but when I ran this query as a
>> regular efetch with 80+ gi numbers it ran for 30+min before i  
>> stopped it
>>
>> handle = Entrez.efetch(db='nucleotide',id=AccNo,rettype='gb')
>>
>> anyone else experiencing problems?
>
> I was asleep, so no ;)
>
> Are you sending one single efetch call with 80+ GI numbers, or
> are your sending 80+ individual efetch calls, or something in
> between? That may make a difference.
>
>> I also noted that my outbound packet rate dropped to about 4kbp
>
> That suggests a local network issue.
>
> Did you include your email address as the NCBI request?
> If they have blocked or throttled your access (if they felt it
> was excessive), I would expect them to email you about it.
>
> Peter



From biopython at maubp.freeserve.co.uk  Fri Oct 23 14:33:35 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 23 Oct 2009 15:33:35 +0100
Subject: [Biopython] Querying NCBI
In-Reply-To: <0D6CA88F-775A-4061-8EA2-7A56FCAE6461@gmail.com>
References: <263904DC-B930-4E1B-9C6C-510C83473640@gmail.com>
	<320fb6e00910230208l6fdbb7e3i44083a9d7db14ecf@mail.gmail.com>
	<0D6CA88F-775A-4061-8EA2-7A56FCAE6461@gmail.com>
Message-ID: <320fb6e00910230733m726d573bhc0d052ff143e6c7e@mail.gmail.com>

On Fri, Oct 23, 2009 at 2:11 PM, Michael S. Koeris
 wrote:
>
> I am submitting 80 single queries - alternatively i can batch them but then
> when I try to parse them out from the records object I get:
>
>>>> records
> >

That looks like records is a URL handle object - probably you've
mixed up your variable names.

> I don't know if this is a different object because it's batched
>
>>>> parser = GenBank.RecordParser()
>>>> recordGenBank = parser.parse(records)
> Traceback (most recent call last):
> ...
> line 762, in parse_footer
> ? ?raise ValueError("Premature end of file in sequence data")
> ValueError: Premature end of file in sequence data

That suggests either a parser bug, or simply a network error meaning
the file was truncated.

As you are trying to download 80 queries, I would strongly recommend
you download them directly to files, and then parse the files. This also
means you'll only need to do the downloading once as you work on
the rest of the script (whatever you are trying to do with the data).

Peter



From biopython at maubp.freeserve.co.uk  Fri Oct 23 14:43:40 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 23 Oct 2009 15:43:40 +0100
Subject: [Biopython] Querying NCBI
In-Reply-To: <8F8DA214-8AC3-41B0-9D38-2AB7864205DC@gmail.com>
References: <263904DC-B930-4E1B-9C6C-510C83473640@gmail.com>
	<320fb6e00910230208l6fdbb7e3i44083a9d7db14ecf@mail.gmail.com>
	<0D6CA88F-775A-4061-8EA2-7A56FCAE6461@gmail.com>
	<320fb6e00910230733m726d573bhc0d052ff143e6c7e@mail.gmail.com>
	<8F8DA214-8AC3-41B0-9D38-2AB7864205DC@gmail.com>
Message-ID: <320fb6e00910230743k46b301ebl85189c5434a65b82@mail.gmail.com>

On Fri, Oct 23, 2009 at 3:35 PM, Michael S. Koeris
 wrote:
>
> That's a good idea how do I do that though?

Something like this:

from Bio import Entrez
Entrez.email = "michael.koeris at gmail.com"
gi = "12345678"
out_handle = open("%s.gbk" % gi, "w")
network_handle = Entrez.efetch(db="nucleotide", id=gi, rettype="gb")
for line in network_handle : out_handle.write(line)
out_handle.close()
network_handle.close()

Stick that in a for loop if you want a separate file for each record.

Is the Biopython tutorial not clear enough on this?

Peter


From biopython at maubp.freeserve.co.uk  Fri Oct 23 14:56:18 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 23 Oct 2009 15:56:18 +0100
Subject: [Biopython] Querying NCBI
In-Reply-To: 
References: <263904DC-B930-4E1B-9C6C-510C83473640@gmail.com>
	<320fb6e00910230208l6fdbb7e3i44083a9d7db14ecf@mail.gmail.com>
	<0D6CA88F-775A-4061-8EA2-7A56FCAE6461@gmail.com>
	<320fb6e00910230733m726d573bhc0d052ff143e6c7e@mail.gmail.com>
	<8F8DA214-8AC3-41B0-9D38-2AB7864205DC@gmail.com>
	<320fb6e00910230743k46b301ebl85189c5434a65b82@mail.gmail.com>
	
Message-ID: <320fb6e00910230756t22d54402p5dba99a2c689e521@mail.gmail.com>

On Fri, Oct 23, 2009 at 3:48 PM, Michael S. Koeris
 wrote:
>
> Thanks much!
>
> The tutorial actually just mentions parsing out from direct queries on page
> 91. Could be useful to mention this approach to speed up queries.
>

Which version of the tutorial do you have? I'm looking at page 91 in the
current PDF (included with Biopython 1.52) and that is the start of the
section on EFetch. At the end of that section (bottom of page 93, start
of page 94) is an example checking if a GenBank file exists locally, and
if not, downloading it.

http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
http://biopython.org/DIST/docs/tutorial/Tutorial.html

I'm hoping you are looking at an older version, but if not, maybe we
can re-order that section or something to make it clearer. Feedback
on documentation is very useful.

Peter

P.S. Please CC the mailing list.


From biopython at maubp.freeserve.co.uk  Fri Oct 23 15:08:29 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 23 Oct 2009 16:08:29 +0100
Subject: [Biopython] Querying NCBI
In-Reply-To: 
References: <263904DC-B930-4E1B-9C6C-510C83473640@gmail.com>
	<320fb6e00910230208l6fdbb7e3i44083a9d7db14ecf@mail.gmail.com>
	<0D6CA88F-775A-4061-8EA2-7A56FCAE6461@gmail.com>
	<320fb6e00910230733m726d573bhc0d052ff143e6c7e@mail.gmail.com>
	<8F8DA214-8AC3-41B0-9D38-2AB7864205DC@gmail.com>
	<320fb6e00910230743k46b301ebl85189c5434a65b82@mail.gmail.com>
	
	<320fb6e00910230756t22d54402p5dba99a2c689e521@mail.gmail.com>
	
Message-ID: <320fb6e00910230808x3d5cc7cepe68f0f5c233e9132@mail.gmail.com>

On Fri, Oct 23, 2009 at 4:06 PM, Michael S. Koeris
 wrote:
>
> On Oct 23, 2009, at 10:56 AM, Peter wrote:
>> I'm hoping you are looking at an older version, but if not, maybe we
>> can re-order that section or something to make it clearer. Feedback
>> on documentation is very useful.
>>
>> Peter
>
> Yeah i must be looking at an older one - that example in the new version
> is pretty clear!
>
> thanks again

OK - great.

Peter


From mikelisanke at gmail.com  Fri Oct 23 15:27:55 2009
From: mikelisanke at gmail.com (Mike Lisanke)
Date: Fri, 23 Oct 2009 11:27:55 -0400
Subject: [Biopython] Fwd: Windows installer does not find Python 2.63 with
	multiple pythons
In-Reply-To: <8c5c6e580910230826o44861568ld70c135a094b0694@mail.gmail.com>
References: <8c5c6e580910191237w54e857bfs1107ba86df05db7c@mail.gmail.com>
	<320fb6e00910191429l481223d8iae9fbe78702fff70@mail.gmail.com>
	<8c5c6e580910221119o1d36426dkfdb2d96d681828c9@mail.gmail.com>
	<320fb6e00910221245w3eff78dcj54aa3eb57990300a@mail.gmail.com>
	<8c5c6e580910230826o44861568ld70c135a094b0694@mail.gmail.com>
Message-ID: <8c5c6e580910230827t1711f8dcub1b8b2d05c552c4f@mail.gmail.com>

---------- Forwarded message ----------
From: Mike Lisanke 
Date: Fri, Oct 23, 2009 at 11:26 AM
Subject: Re: [Biopython] Windows installer does not find Python 2.63 with
multiple pythons
To: Peter 


Peter,

Yes. I got a clue when I saw Numpy (which worked (has a AMD64 build)). and
failed when switched to and earlier python level (2.6 -> 2.5). Numpy only
has a Win32 installer, and it reported the same failure with the
python-2.5-amd64 registry values.

If I can, I will prepare (the libraries?) for a Biopython-2.x-AMD64 package.
I haven't installed a C/C++ build environment on my windows machine (yet),
but; I'm adept at Linux and Windows C/C++ development. And, I'd like to have
a 64bit Biopython . From your email, I now assume Biopython is not
strictly python code (which should run on whatever python is installed).

I'll dig into the source + documentation, but you probably can give me the
short answer. Does this build from a GCC on windows (e.g. Cygwin or
GnuWin32), or a Microsoft build environment (e.g. Visual C++)? And, I assume
it is not cross-platform prepared from Linux (e.g. fake-root)? Thanks.

On Thu, Oct 22, 2009 at 3:45 PM, Peter wrote:

> On Thu, Oct 22, 2009 at 7:19 PM, Mike Lisanke 
> wrote:
> > Peter,
> >
> > The problem was python-2.6.3-amd64 for which their is a
> > numpy-1.3.0-win-amd64-py2.6 installer. Is there a reason
> > NumPy and BioPython have a specific dependency to
> > work with the AMD64 build of python?
>
> Are you running on 64 bit Windows then? XP or Vista?
>
> It sounds like you are trying to mix 32 and 64 bit versions
> of Python.
>
> If you installed the 64 bit version of Python and Numpy,
> then you will need a 64 bit compiled version of Biopython
> too - but we don't have one of those yet. We'd need a
> developer or a volunteer with a 64bit Windows machine
> to do this.
>
> You should be to install a 32 bit version of Python,
>
> http://python.org/ftp/python/2.6.3/python-2.6.3.msi
>
> plus the 32 bit Windows installer for Numpy:
>
>
> http://sourceforge.net/projects/numpy/files/NumPy/1.3.0/numpy-1.3.0-win32-superpack-python2.6.exe/download
>
> and the 32 bit Windows installer for Biopython:
>
> http://biopython.org/DIST/biopython-1.52.win32-py2.6.exe
>
> (i.e. look for win32 in the filenames, not amd64).
>
> Peter
>



-- 
Best regards,

Mike



-- 
Best regards,

Mike


From peter at maubp.freeserve.co.uk  Fri Oct 23 15:47:55 2009
From: peter at maubp.freeserve.co.uk (Peter)
Date: Fri, 23 Oct 2009 16:47:55 +0100
Subject: [Biopython] Fwd: Windows installer does not find Python 2.63
	with multiple pythons
In-Reply-To: <8c5c6e580910230827t1711f8dcub1b8b2d05c552c4f@mail.gmail.com>
References: <8c5c6e580910191237w54e857bfs1107ba86df05db7c@mail.gmail.com>
	<320fb6e00910191429l481223d8iae9fbe78702fff70@mail.gmail.com>
	<8c5c6e580910221119o1d36426dkfdb2d96d681828c9@mail.gmail.com>
	<320fb6e00910221245w3eff78dcj54aa3eb57990300a@mail.gmail.com>
	<8c5c6e580910230826o44861568ld70c135a094b0694@mail.gmail.com>
	<8c5c6e580910230827t1711f8dcub1b8b2d05c552c4f@mail.gmail.com>
Message-ID: <320fb6e00910230847m163960ceneeea268880c88bf2@mail.gmail.com>

On Fri, Oct 23, 2009 at 4:27 PM, Mike Lisanke  wrote:
>
> Peter,
>
> Yes. I got a clue when I saw Numpy (which worked (has a AMD64 build)). and
> failed when switched to and earlier python level (2.6 -> 2.5). Numpy only
> has a Win32 installer, and it reported the same failure with the
> python-2.5-amd64 registry values.

That makes sense.

> If I can, I will prepare (the libraries?) for a Biopython-2.x-AMD64 package.
> I haven't installed a C/C++ build environment on my windows machine (yet),
> but; I'm adept at Linux and Windows C/C++ development. And, I'd like to have
> a 64bit Biopython . From your email, I now assume Biopython is not
> strictly python code (which should run on whatever python is installed).

That is correct - Biopython includes some C code (like NumPy).

> I'll dig into the source + documentation, but you probably can give me the
> short answer. Does this build from a GCC on windows (e.g. Cygwin or
> GnuWin32), or a Microsoft build environment (e.g. Visual C++)? And, I assume
> it is not cross-platform prepared from Linux (e.g. fake-root)? Thanks.

We compile the Biopython Windows 32 bit Installers on a 32 bit Windows
XP machine. The compiler depends on which version of Python you want
to use. See the "Installing from source on Windows" section of this document:
http://biopython.org/DIST/docs/install/Installation.html
http://biopython.org/DIST/docs/install/Installation.pdf

You may be the first person to try this on 64 bit Windows. At least,
no-one has responded to my email to the dev list yesterday:
http://lists.open-bio.org/pipermail/biopython-dev/2009-October/006901.html

Peter


From ap12 at sanger.ac.uk  Fri Oct 23 15:57:53 2009
From: ap12 at sanger.ac.uk (Anne Pajon)
Date: Fri, 23 Oct 2009 16:57:53 +0100
Subject: [Biopython] fasta-m10 al_start and al_end?
Message-ID: <82A8DBBC-6964-4433-AF9C-EF049160A2DF@sanger.ac.uk>

Dear,

I am using Biopython to parse a fasta alignment file:

     alignments = AlignIO.parse(open("fastaresults/ 
78_Spneumoniae_ATCC700669/all_bases_435_1055_cds.fres"), "fasta-m10",  
seq_count=2)
     for alignment in alignments:

         record_query = alignment[0]
         record_match = alignment[1]

         print alignment._annotations["sw_score"],  
alignment._annotations["sw_ident"]
	print record_query.annotations["original_length"]
         # print record_query.annotations["al_start"],  
record_query.annotations["al_end"]

I would like to print the start/end of each aligned sequences.

I can see in Bio.AlignIO.FastaIO.next() that sq_len is stored in  
annotations:
	record.annotations["original_length"] = int(query_annotation["sq_len"])
but I cannot find a way of accessing at_start and al_end.

Thanks in advance for your help.
Kind regards,
Anne.
--
Dr Anne Pajon - Pathogen Genomics
Sanger Institute, Wellcome Trust Genome Campus, Hinxton
Cambridge CB10 1SA, United Kingdom
+44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile)



-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 


From biopython at maubp.freeserve.co.uk  Fri Oct 23 18:40:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 23 Oct 2009 19:40:12 +0100
Subject: [Biopython] fasta-m10 al_start and al_end?
In-Reply-To: <82A8DBBC-6964-4433-AF9C-EF049160A2DF@sanger.ac.uk>
References: <82A8DBBC-6964-4433-AF9C-EF049160A2DF@sanger.ac.uk>
Message-ID: <320fb6e00910231140w46243f9bp9751b6476e3e55b0@mail.gmail.com>

On Fri, Oct 23, 2009 at 4:57 PM, Anne Pajon  wrote:
> Dear,
>
> I am using Biopython to parse a fasta alignment file:
>
> ? ?alignments =
> AlignIO.parse(open("fastaresults/78_Spneumoniae_ATCC700669/all_bases_435_1055_cds.fres"),
> "fasta-m10", seq_count=2)
> ? ?for alignment in alignments:
>
> ? ? ? ?record_query = alignment[0]
> ? ? ? ?record_match = alignment[1]
>
> ? ? ? ?print alignment._annotations["sw_score"],
> alignment._annotations["sw_ident"]
> ? ? ? ?print record_query.annotations["original_length"]
> ? ? ? ?# print record_query.annotations["al_start"],
> record_query.annotations["al_end"]
>
> I would like to print the start/end of each aligned sequences.
>
> I can see in Bio.AlignIO.FastaIO.next() that sq_len is stored in
> annotations:
> ? ? ? ?record.annotations["original_length"] =
> int(query_annotation["sq_len"])
> but I cannot find a way of accessing at_start and al_end.
>
> Thanks in advance for your help.
> Kind regards,
> Anne.

Hi Anne,

That's a good question, but the answer may be a little
disappointing.

That information isn't currently recorded in the SeqRecord,
partly because at the time I didn't need it, but mainly I was
undecided about if the start location should be converted
into python counting or not (zero based versus one based).
What would you prefer? My inclination is python counting.

Peter

P.S. Most of the alignment level annotation is recorded,
but is currently hidden in a "private" property (leading
underscore). You can access this, but be warned that this
will change in future - Improving the alignment object is
something I am working on for a future release.



From biopython at maubp.freeserve.co.uk  Mon Oct 26 10:04:21 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 26 Oct 2009 10:04:21 +0000
Subject: [Biopython] fasta-m10 al_start and al_end?
In-Reply-To: <1E613DD7-3CF3-4F12-8C61-D12440F5AE1D@sanger.ac.uk>
References: <82A8DBBC-6964-4433-AF9C-EF049160A2DF@sanger.ac.uk>
	<320fb6e00910231140w46243f9bp9751b6476e3e55b0@mail.gmail.com>
	<1E613DD7-3CF3-4F12-8C61-D12440F5AE1D@sanger.ac.uk>
Message-ID: <320fb6e00910260304j417fa20ep2b353b6e74ca055d@mail.gmail.com>

On Fri, Oct 23, 2009 at 11:00 PM, Anne Pajon  wrote:
>
> Hi Peter,
>
> Thanks for your fast answer.
>
> I've already discovered the _annotations and I am prepared to update my
> code as soon as a better solution is provided.

Good.

> Concerning the al_start and al_end, I am looking for a solution very soon,
> as I am working on an annotation pipeline prototype in python. What would be
> your recommendation? Writing a parser myself, using another tool (but which
> one?), or helping storing this information in SeqRecord in biopython as it
> is almost there. Thanks to let me know.

I would rather not add them directly to the SeqRecord annotations
dictionary because that will make doing something meaningful with
slicing (the SeqRecord, or in future the Alignment) much harder. I
think the best way to handle these is in the Alignment object, but
this isn't really supported at the moment.

Are you happy to run a development version of Biopython, or at least
to update the file Bio/AlignIO/FastaIO.py? I'm thinking in the short
term we can record these bits of information as private properties of
the SeqRecord, i.e. _al_start and _al_end

Would that suit you for now?

Peter


From biopython at maubp.freeserve.co.uk  Mon Oct 26 14:17:50 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 26 Oct 2009 14:17:50 +0000
Subject: [Biopython] fasta-m10 al_start and al_end?
In-Reply-To: <320fb6e00910260304j417fa20ep2b353b6e74ca055d@mail.gmail.com>
References: <82A8DBBC-6964-4433-AF9C-EF049160A2DF@sanger.ac.uk>
	<320fb6e00910231140w46243f9bp9751b6476e3e55b0@mail.gmail.com>
	<1E613DD7-3CF3-4F12-8C61-D12440F5AE1D@sanger.ac.uk>
	<320fb6e00910260304j417fa20ep2b353b6e74ca055d@mail.gmail.com>
Message-ID: <320fb6e00910260717r50974e1epb7ee94d41ff5aa6c@mail.gmail.com>

On Mon, Oct 26, 2009 at 10:04 AM, Peter  wrote:
> On Fri, Oct 23, 2009 at 11:00 PM, Anne Pajon  wrote:
>>
>> Hi Peter,
>>
>> Thanks for your fast answer.
>>
>> I've already discovered the _annotations and I am prepared to update my
>> code as soon as a better solution is provided.
>
> Good.
>
>> Concerning the al_start and al_end, I am looking for a solution very soon,
>> as I am working on an annotation pipeline prototype in python. What would be
>> your recommendation? Writing a parser myself, using another tool (but which
>> one?), or helping storing this information in SeqRecord in biopython as it
>> is almost there. Thanks to let me know.
>
> I would rather not add them directly to the SeqRecord annotations
> dictionary because that will make doing something meaningful with
> slicing (the SeqRecord, or in future the Alignment) much harder. I
> think the best way to handle these is in the Alignment object, but
> this isn't really supported at the moment.
>
> Are you happy to run a development version of Biopython, or at least
> to update the file Bio/AlignIO/FastaIO.py? I'm thinking in the short
> term we can record these bits of information as private properties of
> the SeqRecord, i.e. _al_start and _al_end

Make that _al_start and _al_end (to match the field names used in
the FASTA output). This change is in the repository now, which you
can grab via github.  See http://www.biopython.org/wiki/SourceCode

As with any "private" variables (leading underscore), they are not
really intended for public use, but should at least solve your
immediate requirement for now.

Peter


From eric.talevich at gmail.com  Mon Oct 26 15:44:23 2009
From: eric.talevich at gmail.com (Eric Talevich)
Date: Mon, 26 Oct 2009 11:44:23 -0400
Subject: [Biopython] fasta-m10 al_start and al_end?
Message-ID: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com>

>
> On Fri, Oct 23, 2009 at 4:57 PM, Anne Pajon  wrote:
> > Dear,
> >
> > I am using Biopython to parse a fasta alignment file:
> >
> ...
> >
> > I would like to print the start/end of each aligned sequences.
> >
> > I can see in Bio.AlignIO.FastaIO.next() that sq_len is stored in
> > annotations:
> > ? ? ? ?record.annotations["original_length"] =
> > int(query_annotation["sq_len"])
> > but I cannot find a way of accessing at_start and al_end.
> >
> > Thanks in advance for your help.
> > Kind regards,
> > Anne.
>
> Hi Anne,
>
> That's a good question, but the answer may be a little
> disappointing.
>
> That information isn't currently recorded in the SeqRecord,
> partly because at the time I didn't need it, but mainly I was
> undecided about if the start location should be converted
> into python counting or not (zero based versus one based).
> What would you prefer? My inclination is python counting.
>
> Peter
>
> P.S. Most of the alignment level annotation is recorded,
> but is currently hidden in a "private" property (leading
> underscore). You can access this, but be warned that this
> will change in future - Improving the alignment object is
> something I am working on for a future release.
>
>
Hi Peter,

Here's +1 for Python counting. That would match SeqFeature and the
ProteinDomain class in Bio.Tree.PhyloXML.

While we're on this topic -- I have some unpublished code for rendering an
alignment object in HTML, with plans for colorization, conservation
profiles, etc. I rolled my own alignment class since the one in
Bio.Align.Generic didn't have the attributes (start, end, selected columns)
for a particular file format I was parsing. It's not urgent, but at some
point could you publish your plans for the Alignment classes so I (and
probably others) can stay/become compatible?

Thanks,
Eric


From biopython at maubp.freeserve.co.uk  Mon Oct 26 16:07:04 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 26 Oct 2009 16:07:04 +0000
Subject: [Biopython] fasta-m10 al_start and al_end?
In-Reply-To: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com>
References: <3f6baf360910260844g2bcbec57y747ad65a59325588@mail.gmail.com>
Message-ID: <320fb6e00910260907i47e23a0akb313344df4dfecb6@mail.gmail.com>

On Mon, Oct 26, 2009 at 3:44 PM, Eric Talevich  wrote:
> Hi Peter,
>
> Here's +1 for Python counting. That would match SeqFeature and the
> ProteinDomain class in Bio.Tree.PhyloXML.
>
> While we're on this topic -- I have some unpublished code for rendering an
> alignment object in HTML, with plans for colorization, conservation
> profiles, etc. I rolled my own alignment class since the one in
> Bio.Align.Generic didn't have the attributes (start, end, selected columns)
> for a particular file format I was parsing. It's not urgent, but at some
> point could you publish your plans for the Alignment classes so I (and
> probably others) can stay/become compatible?

My rough work in progress in on github - at the moment I'm still trying
things out, and don't assume anything is set in stone. If you want to
have a play with this code, feedback is very welcome - probably best
on the dev list rather than here. See:

http://github.com/peterjc/biopython/tree/seqrecords

(a lot of the alignment things I want to support, like slicing and adding
are very closely linked to doing the same operations to SeqRecords)

Peter


From yvan.strahm at bccs.uib.no  Tue Oct 27 09:41:43 2009
From: yvan.strahm at bccs.uib.no (Yvan Strahm)
Date: Tue, 27 Oct 2009 10:41:43 +0100
Subject: [Biopython] how to validate fasta format
Message-ID: <4AE6C057.9050604@bccs.uib.no>

Hello All,

Is it possible to validate a sequence format, for example while the sequence is parsed by 
SeqIO.parse and using IUPAC.py? Or should I try to search for illegal characters in .seq?

Cheers,
yvan


From biopython at maubp.freeserve.co.uk  Tue Oct 27 10:08:41 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 27 Oct 2009 10:08:41 +0000
Subject: [Biopython] how to validate fasta format
In-Reply-To: <4AE6C057.9050604@bccs.uib.no>
References: <4AE6C057.9050604@bccs.uib.no>
Message-ID: <320fb6e00910270308s2b51893fucd3be6307946e2b5@mail.gmail.com>

On Tue, Oct 27, 2009 at 9:41 AM, Yvan Strahm  wrote:
> Hello All,
>
> Is it possible to validate a sequence format, for example while the sequence
> is parsed by SeqIO.parse and using IUPAC.py? Or should I try to search for
> illegal characters in .seq?
>
> Cheers,
> yvan

It depends on what you mean by validate - if you want to check for
specific letters against a whitelist, then currently you would have to
look at the letters in the sequence. I would use sets for this. e.g.

wanted = set("ACGT")
for record in SeqIO.parse(handle, "fasta") :
    if not wanted.isuperset(record.seq) :
         print "Bad: %s" % record.id

Making the Seq object validate against explicit alphabets (where
the allowed letters are given) is something I have wondered about
for the future.

Peter


From yvan.strahm at bccs.uib.no  Tue Oct 27 12:03:11 2009
From: yvan.strahm at bccs.uib.no (Yvan Strahm)
Date: Tue, 27 Oct 2009 13:03:11 +0100
Subject: [Biopython] how to validate fasta format
In-Reply-To: <320fb6e00910270308s2b51893fucd3be6307946e2b5@mail.gmail.com>
References: <4AE6C057.9050604@bccs.uib.no>
	<320fb6e00910270308s2b51893fucd3be6307946e2b5@mail.gmail.com>
Message-ID: <4AE6E17F.2030407@bccs.uib.no>


Peter wrote:
> On Tue, Oct 27, 2009 at 9:41 AM, Yvan Strahm  wrote:
>> Hello All,
>>
>> Is it possible to validate a sequence format, for example while the sequence
>> is parsed by SeqIO.parse and using IUPAC.py? Or should I try to search for
>> illegal characters in .seq?
>>
>> Cheers,
>> yvan
> 
> It depends on what you mean by validate - if you want to check for
> specific letters against a whitelist, then currently you would have to
> look at the letters in the sequence. I would use sets for this. e.g.
> 
> wanted = set("ACGT")
> for record in SeqIO.parse(handle, "fasta") :
>     if not wanted.isuperset(record.seq) :
>          print "Bad: %s" % record.id
> 
> Making the Seq object validate against explicit alphabets (where
> the allowed letters are given) is something I have wondered about
> for the future.
> 
> Peter

Thanks for the quick reply.

Yes by validating I mainly meant check for the correct alphabet in the Seq object but also the 
correct header's format. So I guess, I have to trust the user.... ;-)
thanks again
yvan



From biopython at maubp.freeserve.co.uk  Tue Oct 27 12:36:52 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 27 Oct 2009 12:36:52 +0000
Subject: [Biopython] how to validate fasta format
In-Reply-To: <4AE6E17F.2030407@bccs.uib.no>
References: <4AE6C057.9050604@bccs.uib.no>
	<320fb6e00910270308s2b51893fucd3be6307946e2b5@mail.gmail.com>
	<4AE6E17F.2030407@bccs.uib.no>
Message-ID: <320fb6e00910270536v4a2a4c22xc734085dfc5c6091@mail.gmail.com>

On Tue, Oct 27, 2009 at 12:03 PM, Yvan Strahm  wrote:
> Yes by validating I mainly meant check for the correct alphabet in the Seq
> object but also the correct header's format. So I guess, I have to trust the
> user.... ;-)

The FASTA header is basically free format - almost anything is valid,
although some tools object to things like pipes and underscores.
You will need to test the data in terms of your own criteria.

Peter


From biopython at maubp.freeserve.co.uk  Tue Oct 27 13:20:58 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 27 Oct 2009 13:20:58 +0000
Subject: [Biopython] how to validate fasta format
In-Reply-To: <1256649260.5941.7.camel@Neo>
References: <4AE6C057.9050604@bccs.uib.no>
	<320fb6e00910270308s2b51893fucd3be6307946e2b5@mail.gmail.com>
	<1256649260.5941.7.camel@Neo>
Message-ID: <320fb6e00910270620ofd3cca2pc59fd30b86dab7f7@mail.gmail.com>

On Tue, Oct 27, 2009 at 1:14 PM, Steve Darnell  wrote:
>
> Greetings,
>
> This particular thread addresses a topic we've revisited lately,
> ambiguity codes (particularly in the amino acid alphabet). ?I would like
> to query the group for their opinion of the remaining 6 characters after
> you remove the 20 standard amino acids. ?Here's our list:
>
> B - Asn or Asp
> J - Ile or Leu
> O - ???
> U - seleno-Cys
> X - Any
> Z - Gln or Glu

Your list is incomplete. According to the Biopython
ExtendedIUPACProtein alphabet docstring, which is based on the IUPAC
standards or recommendations:

    B = "Asx";  Aspartic acid (R) or Asparagine (N)
    X = "Xxx";  Unknown or 'other' amino acid
    Z = "Glx";  Glutamic acid (E) or Glutamine (Q)
    J = "Xle";  Leucine (L) or Isoleucine (I), used in mass-spec (NMR)
    U = "Sec";  Selenocysteine
    O = "Pyl";  Pyrrolysine

In practice, X is also often used to mean any amino acid or a stop
codon too (although this really would benefit from a more explicit
character in my personal opinion).

Peter



From darnells at dnastar.com  Tue Oct 27 13:14:20 2009
From: darnells at dnastar.com (Steve Darnell)
Date: Tue, 27 Oct 2009 08:14:20 -0500
Subject: [Biopython] how to validate fasta format
In-Reply-To: <320fb6e00910270308s2b51893fucd3be6307946e2b5@mail.gmail.com>
References: <4AE6C057.9050604@bccs.uib.no>
	<320fb6e00910270308s2b51893fucd3be6307946e2b5@mail.gmail.com>
Message-ID: <1256649260.5941.7.camel@Neo>

Greetings,

This particular thread addresses a topic we've revisited lately,
ambiguity codes (particularly in the amino acid alphabet).  I would like
to query the group for their opinion of the remaining 6 characters after
you remove the 20 standard amino acids.  Here's our list:

B - Asn or Asp
J - Ile or Leu
O - ???
U - seleno-Cys
X - Any
Z - Gln or Glu

~Steve


On Tue, 2009-10-27 at 10:08 +0000, Peter wrote:
> On Tue, Oct 27, 2009 at 9:41 AM, Yvan Strahm  wrote:
> > Hello All,
> >
> > Is it possible to validate a sequence format, for example while the sequence
> > is parsed by SeqIO.parse and using IUPAC.py? Or should I try to search for
> > illegal characters in .seq?
> >
> > Cheers,
> > yvan
> 
> It depends on what you mean by validate - if you want to check for
> specific letters against a whitelist, then currently you would have to
> look at the letters in the sequence. I would use sets for this. e.g.
> 
> wanted = set("ACGT")
> for record in SeqIO.parse(handle, "fasta") :
>     if not wanted.isuperset(record.seq) :
>          print "Bad: %s" % record.id
> 
> Making the Seq object validate against explicit alphabets (where
> the allowed letters are given) is something I have wondered about
> for the future.
> 
> Peter
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython



From dalloliogm at gmail.com  Tue Oct 27 13:41:36 2009
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Tue, 27 Oct 2009 14:41:36 +0100
Subject: [Biopython] how to validate fasta format
In-Reply-To: <320fb6e00910270536v4a2a4c22xc734085dfc5c6091@mail.gmail.com>
References: <4AE6C057.9050604@bccs.uib.no>
	<320fb6e00910270308s2b51893fucd3be6307946e2b5@mail.gmail.com> 
	<4AE6E17F.2030407@bccs.uib.no>
	<320fb6e00910270536v4a2a4c22xc734085dfc5c6091@mail.gmail.com>
Message-ID: <5aa3b3570910270641r47615ad7wf51e42cca7d846a3@mail.gmail.com>

On Tue, Oct 27, 2009 at 1:36 PM, Peter wrote:

> On Tue, Oct 27, 2009 at 12:03 PM, Yvan Strahm 
> wrote:
> > Yes by validating I mainly meant check for the correct alphabet in the
> Seq
> > object but also the correct header's format. So I guess, I have to trust
> the
> > user.... ;-)
>
> The FASTA header is basically free format - almost anything is valid,
> although some tools object to things like pipes and underscores.
> You will need to test the data in terms of your own criteria.
>
>

In principle is as you say, but if you want to implement a validator, I
would take into account that:
- many programs fail if the first character after the '>' is a space
- the first word after the '>' is usually considered as being the name of
the sequence; further descriptions must be separed by spaces or '|'
- the sequence is continuous and it should not be interrupted by blank lines




Peter
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>



-- 
Giovanni Dall'Olio, phd student
Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)

My blog on bioinformatics: http://bioinfoblog.it


From biopython at maubp.freeserve.co.uk  Tue Oct 27 14:07:05 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 27 Oct 2009 14:07:05 +0000
Subject: [Biopython] how to validate fasta format
In-Reply-To: <5aa3b3570910270641r47615ad7wf51e42cca7d846a3@mail.gmail.com>
References: <4AE6C057.9050604@bccs.uib.no>
	<320fb6e00910270308s2b51893fucd3be6307946e2b5@mail.gmail.com>
	<4AE6E17F.2030407@bccs.uib.no>
	<320fb6e00910270536v4a2a4c22xc734085dfc5c6091@mail.gmail.com>
	<5aa3b3570910270641r47615ad7wf51e42cca7d846a3@mail.gmail.com>
Message-ID: <320fb6e00910270707w7a9ab424m43564e2de1acbe46@mail.gmail.com>

On Tue, Oct 27, 2009 at 1:41 PM, Giovanni Marco Dall'Olio
 wrote:
>
> In principle is as you say, but if you want to implement a validator, I
> would take into account that:
> - many programs fail if the first character after the '>' is a space

Good point. I'd interpret that a  record without a name/identifier,
but with a description. We should double check Biopython does
handle this gracefully.

> - the first word after the '>' is usually considered as being the name of
> the sequence; further descriptions must be separed by spaces or '|'

I'm not sure what you mean about the pipe (|) in descriptions - this
is basically a case of anything is allowed, but some tools are fussy.

> - the sequence is continuous and it should not be interrupted by blank lines

I think according to the original FASTA tools, blank lines are fine.
But again, some tools are fussy. Here Biopython should tolerate
this on input, and not do it on output.

i.e. FASTA "validation" always depends on what you are going it
for. Another example, preparing data for TMHMM it is sensible to
impose a minimum length on the sequence - but a short or
even zero length sequence is valid in FASTA files in general.

Peter


From bassbabyface at yahoo.com  Tue Oct 27 15:12:13 2009
From: bassbabyface at yahoo.com (Ben O'Loghlin)
Date: Wed, 28 Oct 2009 02:12:13 +1100
Subject: [Biopython] Entrez.read return value is typed as a string??
Message-ID: <01aa01ca5717$dec90220$9c5b0660$@com>

Hi all,

I'm new to BioPython, having spent < 4 hours playing with it, and I'm mighty
impressed with what it can do for me once I get it working. Unfortunately
I've spent about 3.5 of those hours inanely grappling with Entrez.read, so I
turn to more experienced BioPythoneers for assistance.

I'm trying to use Entrez to extract and manipulate records from PubMed, and
I'm stumped. I was expecting the return value of Entrez.read to be a
structured object, and instead it seems to return a string which would
require further parsing to do anything useful with.

I'm not sure if this is the expected output and I have misunderstood, or if
PubMed is just returning results in unexpected formats which break the
parser in Entrez.read, or if Bio just doesn't work after midnight (2:06 am
Australian EST).

Is anyone able/willing to assist? The goal here is to have some way of
extracting individual fields from the returned records, e.g. print out the
Abstract for PMID 17206916.

I'm using BioPython 1.5.2 and Python 2.6.4 on Vista. Script and output
below...

Many thanks in advance,
Ben

#########################################################################
#  Biotest.py
#########################################################################
from Bio import Entrez

PMID = "17206916"
database = "pubmed"

# Fetch the full article details
handle1 = Entrez.efetch(db=database, id=PMID)
full = handle1.read()
print "\nProperties of full record object: "
print type(full)
print
print full[0:180]

#Fetch and print the summary details
handle2 = Entrez.esummary(db=database, id=PMID)
summary = handle2.read()
print "\nProperties of summary record object: "
print type(summary)
print
print summary[0:300]
#########################################################################


#########################################################################
#  Output from Biotest.py
#########################################################################

C:\Data\Personal\Dev\Python\PubMed>c:\Python26\python.exe biotest.py

Properties of full record object:


PmFetch response
Pubmed-entry ::= {
  pmid 17206916,
  medent {
    em std {
      year 2007,
      month 1,
      day 8
    },
    ci

Properties of summary record object:






        17206916
        2006
        
References: <01aa01ca5717$dec90220$9c5b0660$@com>
Message-ID: <320fb6e00910270842p4ae53933k91f5ce16a7b4d7e2@mail.gmail.com>

On Tue, Oct 27, 2009 at 3:12 PM, Ben O'Loghlin  wrote:
> Hi all,
>
> I'm new to BioPython, having spent < 4 hours playing with it, and I'm mighty
> impressed with what it can do for me once I get it working. Unfortunately
> I've spent about 3.5 of those hours inanely grappling with Entrez.read, so I
> turn to more experienced BioPythoneers for assistance.

Oh dear - were you working though the Entrez chapter in the Tutorial?
If not, what where you looking at?

> I'm trying to use Entrez to extract and manipulate records from PubMed, and
> I'm stumped. I was expecting the return value of Entrez.read to be a
> structured object, and instead it seems to return a string which would
> require further parsing to do anything useful with.

That doesn't sound right. The Bio.Entrez.read() should take a handle,
in XML format, and return a nested collection of python objects.

> I'm not sure if this is the expected output and I have misunderstood, or if
> PubMed is just returning results in unexpected formats which break the
> parser in Entrez.read, or if Bio just doesn't work after midnight (2:06 am
> Australian EST).
>
> Is anyone able/willing to assist? The goal here is to have some way of
> extracting individual fields from the returned records, e.g. print out the
> Abstract for PMID 17206916.

First of all, handles give access to data via the read() and other methods,
like readline()

>>> from Bio import Entrez
>>> handle = Entrez.efetch(db="pubmed", id="17206916")
>>> print handle.readline()
PmFetch response

So you see by default, the NCBI is returning HTML. We can ask for XML:

>>> handle = Entrez.efetch(db="pubmed", id="17206916", retmode="XML")
>>> print handle.readline()


You could parse this with Bio.Entrez.read() if you wanted to:

>>> handle = Entrez.efetch(db="pubmed", id="17206916", retmode="XML")
>>> record = Entrez.read(handle)
>>> print record
[{u'MedlineCitation': ... ]

Or, rather than XML designed for a computer to parse, you could ask for
the plain text MEDLINE format,

>>> handle = Entrez.efetch(db="pubmed", id="17206916", retmode="text", rettype="medline")
>>> print handle.read()
PMID- 17206916
OWN - NLM
STAT- MEDLINE
DA  - 20070108
DCOM- 20070130
...

Does that help?

Peter


From biopython at maubp.freeserve.co.uk  Tue Oct 27 15:51:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 27 Oct 2009 15:51:12 +0000
Subject: [Biopython] Entrez.read return value is typed as a string??
In-Reply-To: <320fb6e00910270842p4ae53933k91f5ce16a7b4d7e2@mail.gmail.com>
References: <01aa01ca5717$dec90220$9c5b0660$@com>
	<320fb6e00910270842p4ae53933k91f5ce16a7b4d7e2@mail.gmail.com>
Message-ID: <320fb6e00910270851n3db7984dv861b9d225ead878e@mail.gmail.com>

On Tue, Oct 27, 2009 at 3:42 PM, Peter  wrote:
> On Tue, Oct 27, 2009 at 3:12 PM, Ben O'Loghlin  wrote:
>> I'm trying to use Entrez to extract and manipulate records from PubMed, and
>> I'm stumped. I was expecting the return value of Entrez.read to be a
>> structured object, and instead it seems to return a string which would
>> require further parsing to do anything useful with.
>
> That doesn't sound right. The Bio.Entrez.read() should take a handle,
> in XML format, and return a nested collection of python objects.

I think I've worked out what you may have been doing wrong - trying
to feed HTML into Bio.Entrez.read(). I would have expected a helpful
error message, but it returns an empty string. I've filed Bug 2938.

http://bugzilla.open-bio.org/show_bug.cgi?id=2938

Michiel - could you take a look at this please?

Thanks,

Peter


From danielchubb at gmail.com  Tue Oct 27 18:55:39 2009
From: danielchubb at gmail.com (Daniel Chubb)
Date: Tue, 27 Oct 2009 18:55:39 +0000
Subject: [Biopython] Bio.PDB.ResidueDepth help
Message-ID: 

Hi, I'm trying to calculate residue depth using this module and I'd really
appreciate it if someone could help me make some sense out of the output.


Here is some code:

>>> from Bio.PDB import *
>>> parser=PDBParser()
>>> structure=parser.get_structure("scr",'/.../d1t3ta3.pdb')
>>> model=structure[0]
>>> rd=ResidueDepth(model, '/.../d1t3ta3.pdb')
>>> for i in rd:
...     print i

I then get this output:

...

 ...
(, (941.50269996685836,
938.52026632473292))
(, (943.30248293205898,
935.73449250166789))
(, (956.22610923774971,
929.58401500468858))
(, (946.39762766474189,
929.1969204628009))
(, (980.35736194344759,
952.50174666095472))
(, (943.33749438200709,
941.41471544399076))
(, (1005.0456481617543,
1021.4687548192563))
(, (998.26228815878574,
1014.7065537464257))
(, (954.34720196525564,
933.69587405187428))
(, (865.68049599904009,
859.80537822913527))
(, (888.74360153732255,
871.36588689619543))
(, (887.82610875300952,
870.97697239966283))
(, (882.65307575266002,
870.71143243803749))
(, (1038.6138896432872,
986.73921610486354))
(, (1036.0337702261368,
984.51578671438835))

....

As I understand it, the two values in the tuple (e.g. (941.50269996685836,
938.52026632473292)) for residue 1) are residue depth and Ca depth. But
those values don't seem to make sense to me. Are they not supposed to be in
Angstroms? They range in my output from about 865 to 1200, I would expect
some to be 0 (or around that).

Could anyone point out what has gone wrong/what I'm doing wrong?


Thanks a lot for the help

Daniel Chubb





From laszlo at vpac.org  Tue Oct 27 22:23:22 2009
From: laszlo at vpac.org (Laszlo Kun)
Date: Wed, 28 Oct 2009 09:23:22 +1100 (EST)
Subject: [Biopython] KOBAS - KEGG Orthology Based Annotation System XML file
 empty problem
In-Reply-To: <973378923.5269591256682126088.JavaMail.root@mail.vpac.org>
Message-ID: <1915267259.5269661256682202048.JavaMail.root@mail.vpac.org>

Dear All,

I am trying to install for a user the KOBAS software, which is done apparently, but after about 3 hours is felt over with
the error message:

======================
[rossh at tango Ov_KOBAS]$ cat NY.e789941
Traceback (most recent call last):
File "/usr/local/python/2.6.2-gcc/bin/blast2ko.py", line 90, in 
annots = dict([ (i.query, i) for i in annotator.annotate() ])
File
"/usr/local/python/2.6.2-gcc/lib/python2.6/site-packages/kobas/annot.py",
line 151, in annotate
for record in self.reader:
File
"/usr/local/python/2.6.2-gcc/lib/python2.6/site-packages/Bio/Blast/NCBIXML.py",
line 605, in parse
raise ValueError("Your XML file was empty")
ValueError: Your XML file was empty

=============================


The script appears to have completed the blast section against the KOBAS
database, but has fallen over on the annotation pass.

I haven't come across this error before.

Thanks again for your help.

cheers,
Laszlo 


From biopython at maubp.freeserve.co.uk  Wed Oct 28 10:48:24 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 28 Oct 2009 10:48:24 +0000
Subject: [Biopython] KOBAS - KEGG Orthology Based Annotation System XML
	file empty problem
In-Reply-To: <1915267259.5269661256682202048.JavaMail.root@mail.vpac.org>
References: <973378923.5269591256682126088.JavaMail.root@mail.vpac.org>
	<1915267259.5269661256682202048.JavaMail.root@mail.vpac.org>
Message-ID: <320fb6e00910280348u5bdb7860te59db883ae995362@mail.gmail.com>

On Tue, Oct 27, 2009 at 10:23 PM, Laszlo Kun  wrote:
> Dear All,
>
> I am trying to install for a user the KOBAS software, which is
> done apparently, but after about 3 hours is felt over with
> the error message:
>
> ======================
> [rossh at tango Ov_KOBAS]$ cat NY.e789941
> Traceback (most recent call last):
> File "/usr/local/python/2.6.2-gcc/bin/blast2ko.py", line 90, in 
> annots = dict([ (i.query, i) for i in annotator.annotate() ])
> File
> "/usr/local/python/2.6.2-gcc/lib/python2.6/site-packages/kobas/annot.py",
> line 151, in annotate
> for record in self.reader:
> File
> "/usr/local/python/2.6.2-gcc/lib/python2.6/site-packages/Bio/Blast/NCBIXML.py",
> line 605, in parse
> raise ValueError("Your XML file was empty")
> ValueError: Your XML file was empty
>
> =============================
>
> The script appears to have completed the blast section
> against the KOBAS database, but has fallen over on
> the annotation pass.
>
> I haven't come across this error before.
>
> Thanks again for your help.
>
> cheers,
> Laszlo

Hi Laszlo,

Have you previously ever had KOBAS working? I would
guess this is your first attempt...

The error message from Biopython seems quite clear,
KOBAS is trying to parse an empty XML file. This may
have been due to a problem calling BLAST - which
they probably do via Biopython. Have you checked
your installation of standalone NCBI blast (i.e. the
command line tool blastall) is working? I don't know
what NCBI databases are needed, probably nr.

Unfortunately, there is anther issue here too...

KOBAS is described here:

Mao et al. (2005) Bioinformatics 21(19) pp. 3787-93
http://dx.doi.org/10.1093/bioinformatics/bti430

Wu et al. (2006) Nucleic Acids Research 34
http://dx.doi.org/10.1093/nar/gkl167

The link given in the original paper seems to be dead now:
http://genome.cbi.pku.edu.cn/download.html

Their second paper gives http://kobas.cbi.pku.edu.cn/
which includes links to download their source code.
I had a quick look at this (KOBOS 1.1.0), and it has
not been updated recently. As you are using Python
2.6, you'll see some harmless deprecation warnings
about the sets module (a trivial issue to fix).

What version of Biopython do you have installed?

Their website says they need Biopython 1.24 or later,
but this isn't true. Their file fasta.py uses Biopython's
Bio.SeqIO module which was added in Biopython 1.43.
Their file annot.py uses Bio.Blast.NCBIXML.parse
function, which was also added in Biopython 1.43.

Also, and perhaps most importantly (as mentioned in
the first paper) they are using Martel for parsing KEGG.
We have dropped Martel, and Biopython 1.50 was
the last release to include it. I'm not sure at what
point in the pipeline they use KEGG, but I guess
this will cause trouble after the BLAST step. We
*could* provide the final version of Martel as a
separate standalone package - I'd need to find
half a day free. Note I would strongly recommend
using mxTextTools version 2 (not version 3) as
something about the unicode related API changes
are known to cause some subtle problem with
Martel as used in older versions of Biopython.

I think you (or Biopython) need to get in touch with
the KOBAS authors. They can at least tell us what
version of Biopython they used to delvelop KOBAS
1.1.0. Also, they may have already updated their
code for the webservice, and just not updated the
download files.

Regards,

Peter


From pengyu.ut at gmail.com  Wed Oct 28 22:20:32 2009
From: pengyu.ut at gmail.com (Peng Yu)
Date: Wed, 28 Oct 2009 17:20:32 -0500
Subject: [Biopython] How to get sequences upstream of TSS of genes?
In-Reply-To: <5aa3b3570910160129g3a5e2f0fr1297db95d779e807@mail.gmail.com>
References: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com>
	<5aa3b3570910160129g3a5e2f0fr1297db95d779e807@mail.gmail.com>
Message-ID: <366c6f340910281520w227bbe88o914f889f693588f0@mail.gmail.com>

On Fri, Oct 16, 2009 at 3:29 AM, Giovanni Marco Dall'Olio
 wrote:
> On Thu, Oct 15, 2009 at 11:17 PM, Peng Yu  wrote:
>> I have a set of genes. I want to get the 5kb sequence that is upstream
>> of the TSS's of each gene.
>
> You can do that with biomart:
> - http://www.ensembl.org/biomart/martview/a90f00892a48e04d438f762f551bf48a/a90f00892a48e04d438f762f551bf48a
>
> select Ensembl56 as database, Mus Musculus as species, go to Filters
> and fill the 'Id list limit' form to add the required geneIds, then go
> to Attributes, select Sequences and then check 'Upstream Flank -
> 5000'.

If I want both 5kb upstream of TSS and .5kb downstream of TSS, is
there a way to do so?


> As for doing that in python, I am not sure there are python interfaces
> to BioMart. Galaxy (http://main.g2.bx.psu.edu/) is written in python,
> so they must have written a library for that somewhere, but I don't
> know their code.
>
> If you use R (remember that you can mix python and R with rpy2) there
> is a nice module in bioconductor called BioMart.
>
>
>> I have the following specific questions. Could somebody help me? Thank you!
>>
>> Which database I can access to get mouse genome?
>> Give a gene name what function I should call to get the gene's location?
>> _______________________________________________
>> Biopython mailing list ?- ?Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>
>
>
> --
> Giovanni Dall'Olio, phd student
> Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)
>
> My blog on bioinformatics: http://bioinfoblog.it
>



From bassbabyface at yahoo.com  Thu Oct 29 03:19:09 2009
From: bassbabyface at yahoo.com (Ben O'Loghlin)
Date: Thu, 29 Oct 2009 14:19:09 +1100
Subject: [Biopython] Entrez.read return value is typed as a string??
In-Reply-To: <320fb6e00910270842p4ae53933k91f5ce16a7b4d7e2@mail.gmail.com>
References: <01aa01ca5717$dec90220$9c5b0660$@com>
	<320fb6e00910270842p4ae53933k91f5ce16a7b4d7e2@mail.gmail.com>
Message-ID: <001901ca5846$96f69d60$c4e3d820$@com>

Hi Peter,

Many thanks for your post, you cleared up a world of confusion for me.

A few answers/comments:

>> Oh dear - were you working though the Entrez chapter in the Tutorial?
>> If not, what where you looking at?

No, I didn't find the tutorial until you mentioned it. I came across
BioPython by Googling "python pubmed", the most relevant hit on the first
screenful seemed to be the first one, at
http://baoilleach.blogspot.com/2008/02/searching-pubmed-with-python.html.

This brief blog describes access via the Bio.EUtils package which seems to
have disappeared, and it took me about 45 mins to realise that it was no
longer in the distro and to track down Bio.Entrez.

Then Googling BioPython Entrez, the first hit took me to the documentation
(I missed spotting the tutorial link!) and all subsequent attempts were
based on reading this doco and the source code, and scratching my head and
trying random things.

>So you see by default, the NCBI is returning HTML. We can ask for XML:
>
>>>> handle = Entrez.efetch(db="pubmed", id="17206916", retmode="XML")
>>>> print handle.readline()
>

This all makes sense now, I wasn't aware of the different 'retmode' options.
The Bio.Entrez.efetch() documentation points me to
http://www.ncbi.nlm.nih.gov/corehtml/query/static/efetch_help.html, which
doesn't mention the 'retmode' or 'rettype' parameters. In fact I couldn't
find any explicit reference to it in the Tutorial either, just the use of
'rettype=text' in one of the example code snippets.

I subsequently tracked down this page
http://www.ncbi.nlm.nih.gov/corehtml/query/static/efetchlit_help.html which
does at least indicate the different rettypes and retmodes available.
 
>You could parse this with Bio.Entrez.read() if you wanted to:
>
>>>> handle = Entrez.efetch(db="pubmed", id="17206916", retmode="XML")
>>>> record = Entrez.read(handle)
>>>> print record
>[{u'MedlineCitation': ... ]

I'm interested in using this format, however I don't understand how to
read/write fields and subtrees of the object type
'Bio.Entrez.Parser.ListElement' returned by Entrez.read(handle) with retmode
XML. 

I'm finding it hard to track down references to this [{u'x':['y']}] object
format in Python , possibly due to the fact that I can't get Google to
search for strings like [{u'. I am however appreciative that there appears
to be a u'SpaceFlightMission' tag in Pubmed's default rettype. :)

I'm also a little confused about why handle.read() returns a string in XML
format whereas Entrez.read(handle) returns the
Bio.Entrez.Parser.ListElement. In fact I only knew about this latter method
from your email, since the example in the Bio.Entrez doco only uses the
handle.read() syntax, and doesn't mention that there's any distinction, nor
which might be more appropriate for which task. 

> Does that help?

Immensely.

If you (or any other Bio.Wizards) have the time and the inclination to help
me further, I'd be very grateful for any thoughts relevant to my ponderings
above.

Thanks again,

Ben




From mjldehoon at yahoo.com  Thu Oct 29 03:50:07 2009
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Wed, 28 Oct 2009 20:50:07 -0700 (PDT)
Subject: [Biopython] Entrez.read return value is typed as a string??
In-Reply-To: <001901ca5846$96f69d60$c4e3d820$@com>
Message-ID: <109726.94290.qm@web62408.mail.re1.yahoo.com>



--- On Wed, 10/28/09, Ben O'Loghlin  wrote:
> >>>> handle = Entrez.efetch(db="pubmed", id="17206916", retmode="XML")
> >>>> record = Entrez.read(handle)
> >>>> print record
> >[{u'MedlineCitation': ... ]
> 
> I'm interested in using this format, however I don't
> understand how to
> read/write fields and subtrees of the object type
> 'Bio.Entrez.Parser.ListElement' returned by
> Entrez.read(handle) with retmode XML. 
> 
> I'm finding it hard to track down references to this
> [{u'x':['y']}] object format in Python ...

Look at the outermost two brackets [].
You can treat this object as a Python list.

So if record = [{u'x':['y']}],
then record[0] = {u'x':['y']}

Now look at the two outermost braces {}.
You can treat record[0] as a dictionary.
So record[0]['x'] will return ['y'].
Which can then be treated as a Python list.

--Michiel.


      


From dejmail at gmail.com  Thu Oct 29 04:53:32 2009
From: dejmail at gmail.com (Liam Thompson)
Date: Thu, 29 Oct 2009 06:53:32 +0200
Subject: [Biopython] losing information
Message-ID: 

hi everyone

I'm running a simple script to remove genbank records from a GB file
that I have indentified as undesirable. The only
problem is that when the script is run, all the annotation info (CDS
etc) for entries is lost, only the sequence and ID is kept.
I was wondering if there is an option I am missing, or if I am using
an incorrect variable type somewhere. I just
can't seem to get all the info written.

from Bio import SeqIO

outhandle = open("HBV_seqs.gb", "w")
inhandle = open("all_hbv_seqs_reannotated.gb", "rU")
newrecords = []
badlist = list(open("deletionrecords.txt", "rU"))
badrecord=[]

for items in badlist:
    badrecord.append(items[:-1])

for record in SeqIO.parse(inhandle, "genbank"):
    if record.name not in badrecord:
            newrecords.append(record)

print "writing records..."
SeqIO.write(newrecords, outhandle, "genbank")
print "writing done"
outhandle.close()


I would appreciate any pointers.

Thanks
Liam


From dalloliogm at gmail.com  Thu Oct 29 09:21:15 2009
From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio)
Date: Thu, 29 Oct 2009 10:21:15 +0100
Subject: [Biopython] How to get sequences upstream of TSS of genes?
In-Reply-To: <366c6f340910281520w227bbe88o914f889f693588f0@mail.gmail.com>
References: <366c6f340910151417o77390302n9a8ec3f8b2da8e61@mail.gmail.com> 
	<5aa3b3570910160129g3a5e2f0fr1297db95d779e807@mail.gmail.com> 
	<366c6f340910281520w227bbe88o914f889f693588f0@mail.gmail.com>
Message-ID: <5aa3b3570910290221t289b8e90sa3b722da7e4d5ded@mail.gmail.com>

I suppose it is Flank(Transcript), with upstream=5000 and downstream=5000
-
http://www.ensembl.org/biomart/martview/7675ba9923b086fb5d3a76f753cd5c98/7675ba9923b086fb5d3a76f753cd5c98

it seems you have to execute the query two times, one for upstream and one
for downstream.


On Wed, Oct 28, 2009 at 11:20 PM, Peng Yu  wrote:

> On Fri, Oct 16, 2009 at 3:29 AM, Giovanni Marco Dall'Olio
>  wrote:
> > On Thu, Oct 15, 2009 at 11:17 PM, Peng Yu  wrote:
> >> I have a set of genes. I want to get the 5kb sequence that is upstream
> >> of the TSS's of each gene.
> >
> > You can do that with biomart:
> > -
> http://www.ensembl.org/biomart/martview/a90f00892a48e04d438f762f551bf48a/a90f00892a48e04d438f762f551bf48a
> >
> > select Ensembl56 as database, Mus Musculus as species, go to Filters
> > and fill the 'Id list limit' form to add the required geneIds, then go
> > to Attributes, select Sequences and then check 'Upstream Flank -
> > 5000'.
>
> If I want both 5kb upstream of TSS and .5kb downstream of TSS, is
> there a way to do so?
>
>
> > As for doing that in python, I am not sure there are python interfaces
> > to BioMart. Galaxy (http://main.g2.bx.psu.edu/) is written in python,
> > so they must have written a library for that somewhere, but I don't
> > know their code.
> >
> > If you use R (remember that you can mix python and R with rpy2) there
> > is a nice module in bioconductor called BioMart.
> >
> >
> >> I have the following specific questions. Could somebody help me? Thank
> you!
> >>
> >> Which database I can access to get mouse genome?
> >> Give a gene name what function I should call to get the gene's location?
> >> _______________________________________________
> >> Biopython mailing list  -  Biopython at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biopython
> >>
> >
> >
> >
> > --
> > Giovanni Dall'Olio, phd student
> > Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)
> >
> > My blog on bioinformatics: http://bioinfoblog.it
> >
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>



-- 
Giovanni Dall'Olio, phd student
Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain)

My blog on bioinformatics: http://bioinfoblog.it


From biopython at maubp.freeserve.co.uk  Thu Oct 29 10:13:04 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 29 Oct 2009 10:13:04 +0000
Subject: [Biopython] losing information
In-Reply-To: 
References: 
Message-ID: <320fb6e00910290313x48c8de1ew37efe425b632fa87@mail.gmail.com>

On Thu, Oct 29, 2009 at 4:53 AM, Liam Thompson  wrote:
> hi everyone
>
> I'm running a simple script to remove genbank records from
> a GB file that I have indentified as undesirable. The only
> problem is that when the script is run, all the annotation
> info (CDS etc) for entries is lost, only the sequence and ID
> is kept. I was wondering if there is an option I am missing,
> or if I am using an incorrect variable type somewhere. I just
> can't seem to get all the info written.

I guess since you are losing the CDS features you have an
old version of Biopython. From 1.51 onwards we do write
out the feature table, see:
http://www.biopython.org/wiki/SeqIO#File_Formats

However, using Bio.SeqIO to parse and write GenBank files
is still lossy. References are not (yet) written out for example.

There are alternatives: Internally Bio.SeqIO is using
Bio.GenBank to parse the files, and this offers two parsers,
one giving SeqRecord objects (used by SeqIO), and one
giving GenBank specific Records. This later parser should
do a better jobs of preserving the data on output.

That said, I would approach your problem in a very different
way. I would NOT parse the file into objects at all - I would
just loop over the lines, toggling between desired or not,
and outputting the lines for desired records as is. This
assumes your criteria for "desired" is simple to define,
e.g. a list of LOCUS identifiers.

Peter


From biopython at maubp.freeserve.co.uk  Thu Oct 29 10:29:43 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 29 Oct 2009 10:29:43 +0000
Subject: [Biopython] Entrez.read return value is typed as a string??
In-Reply-To: <001901ca5846$96f69d60$c4e3d820$@com>
References: <01aa01ca5717$dec90220$9c5b0660$@com>
	<320fb6e00910270842p4ae53933k91f5ce16a7b4d7e2@mail.gmail.com>
	<001901ca5846$96f69d60$c4e3d820$@com>
Message-ID: <320fb6e00910290329g78330f61qc5b524b4ba4f5f81@mail.gmail.com>

On Thu, Oct 29, 2009 at 3:19 AM, Ben O'Loghlin  wrote:
>
> Hi Peter,
>
> Many thanks for your post, you cleared up a world of confusion for me.
>
> A few answers/comments:
>
>>> Oh dear - were you working though the Entrez chapter in the Tutorial?
>>> If not, what where you looking at?
>
> No, I didn't find the tutorial until you mentioned it.

Did you look at the Biopython website at all? We do try and highlight
the Tutorial as it is the primary documentation, especially for newcomers.
Perhaps you can suggest how to make it more prominent? A fresh set
of eyes can give useful perspective.

> I came across
> BioPython by Googling "python pubmed", the most relevant hit on the first
> screenful seemed to be the first one, at
> http://baoilleach.blogspot.com/2008/02/searching-pubmed-with-python.html.
>
> This brief blog describes access via the Bio.EUtils package which seems to
> have disappeared, and it took me about 45 mins to realise that it was no
> longer in the distro and to track down Bio.Entrez.

Deprecations are recorded in the DEPRECATED file included with the
source code, the latest version can be viewed here:
http://github.com/biopython/biopython/blob/master/DEPRECATED

The removal of Bio.EUtils happened in Biopython 1.52, and was in this
case also noted in the NEWS file, but not the actual release notice:
http://github.com/biopython/biopython/blob/master/NEWS
http://news.open-bio.org/news/2009/09/biopython-release-152/

> Then Googling BioPython Entrez, the first hit took me to the documentation
> (I missed spotting the tutorial link!) and all subsequent attempts were
> based on reading this doco and the source code, and scratching my head and
> trying random things.

Do you mean the API documentation, available via Python though the help
command and viable online here:

http://www.biopython.org/DIST/docs/api/Bio.Entrez-module.html

You can probably tell we put more effort into the Tutorial as an introduction
document.

>>So you see by default, the NCBI is returning HTML. We can ask for XML:
>>
>>>>> handle = Entrez.efetch(db="pubmed", id="17206916", retmode="XML")
>>>>> print handle.readline()
>>
>
> This all makes sense now, I wasn't aware of the different 'retmode' options.
> The Bio.Entrez.efetch() documentation points me to
> http://www.ncbi.nlm.nih.gov/corehtml/query/static/efetch_help.html, which
> doesn't mention the 'retmode' or 'rettype' parameters. In fact I couldn't
> find any explicit reference to it in the Tutorial either, just the use of
> 'rettype=text' in one of the example code snippets.
>
> I subsequently tracked down this page
> http://www.ncbi.nlm.nih.gov/corehtml/query/static/efetchlit_help.html which
> does at least indicate the different rettypes and retmodes available.

I agree the NCBI Entrez documentation is very unhelpful to beginners.
We do try and make this easier in our tutorial, but perhaps "retmode"
and "rettype" need to be discussed more on the EFetch section (they
are mentioned a little later in the chapter in the context of other formats)

>>You could parse this with Bio.Entrez.read() if you wanted to:
>>
>>>>> handle = Entrez.efetch(db="pubmed", id="17206916", retmode="XML")
>>>>> record = Entrez.read(handle)
>>>>> print record
>>[{u'MedlineCitation': ... ]
>
> I'm interested in using this format, however I don't understand how to
> read/write fields and subtrees of the object type
> 'Bio.Entrez.Parser.ListElement' returned by Entrez.read(handle) with retmode
> XML.
>
> I'm finding it hard to track down references to this [{u'x':['y']}] object
> format in Python , possibly due to the fact that I can't get Google to
> search for strings like [{u'. I am however appreciative that there appears
> to be a u'SpaceFlightMission' tag in Pubmed's default rettype. :)

Michiel has tried to answer this. Are you familiar with the basic Python
datatypes?

> I'm also a little confused about why handle.read() returns a string in XML
> format whereas Entrez.read(handle) returns the
> Bio.Entrez.Parser.ListElement. In fact I only knew about this latter method
> from your email, since the example in the Bio.Entrez doco only uses the
> handle.read() syntax, and doesn't mention that there's any distinction, nor
> which might be more appropriate for which task.

In handle.read(), read is a method of an object called handle, in this
case a handle to a network connection.

In Entrez.read(), read is a function of the Entrez module.

In Python, xxx.yyy() means either the "yyy" method of object "xxx" (where
"xxx" is a variable), or the "yyy" could be a function or class of the module
"xxx".

>> Does that help?
>
> Immensely.
>
> If you (or any other Bio.Wizards) have the time and the inclination to help
> me further, I'd be very grateful for any thoughts relevant to my ponderings
> above.

I would suggest you read through some Python introductions, and then
go through the Biopython tutorial again. We have to assume our readers
know a bit of Python - and my guess is from your questions that many
of your issues are with Python itself rather than Biopython. But you are
learning :)

Peter


From dejmail at gmail.com  Thu Oct 29 10:52:23 2009
From: dejmail at gmail.com (Liam Thompson)
Date: Thu, 29 Oct 2009 12:52:23 +0200
Subject: [Biopython] losing information
In-Reply-To: <320fb6e00910290313x48c8de1ew37efe425b632fa87@mail.gmail.com>
References: 
	<320fb6e00910290313x48c8de1ew37efe425b632fa87@mail.gmail.com>
Message-ID: 

Hi Peter

Thanks for the helpful reply as always. I upgraded to 1.51 from 1.49,
but it made
no difference, the information is still lost. You are right that it
would be better not
to write the data to file, and just check over the file, and I will
try to incorporate
this into the next few functions I'm adding.

Let me attempt the Bio.Genbank feature

Regards
Liam

On Thu, Oct 29, 2009 at 12:13 PM, Peter  wrote:
> On Thu, Oct 29, 2009 at 4:53 AM, Liam Thompson  wrote:
>> hi everyone
>>
>> I'm running a simple script to remove genbank records from
>> a GB file that I have indentified as undesirable. The only
>> problem is that when the script is run, all the annotation
>> info (CDS etc) for entries is lost, only the sequence and ID
>> is kept. I was wondering if there is an option I am missing,
>> or if I am using an incorrect variable type somewhere. I just
>> can't seem to get all the info written.
>
> I guess since you are losing the CDS features you have an
> old version of Biopython. From 1.51 onwards we do write
> out the feature table, see:
> http://www.biopython.org/wiki/SeqIO#File_Formats
>
> However, using Bio.SeqIO to parse and write GenBank files
> is still lossy. References are not (yet) written out for example.
>
> There are alternatives: Internally Bio.SeqIO is using
> Bio.GenBank to parse the files, and this offers two parsers,
> one giving SeqRecord objects (used by SeqIO), and one
> giving GenBank specific Records. This later parser should
> do a better jobs of preserving the data on output.
>
> That said, I would approach your problem in a very different
> way. I would NOT parse the file into objects at all - I would
> just loop over the lines, toggling between desired or not,
> and outputting the lines for desired records as is. This
> assumes your criteria for "desired" is simple to define,
> e.g. a list of LOCUS identifiers.
>
> Peter
>


From biopython at maubp.freeserve.co.uk  Thu Oct 29 11:07:09 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 29 Oct 2009 11:07:09 +0000
Subject: [Biopython] Entrez.read return value is typed as a string??
In-Reply-To: <320fb6e00910290329g78330f61qc5b524b4ba4f5f81@mail.gmail.com>
References: <01aa01ca5717$dec90220$9c5b0660$@com>
	<320fb6e00910270842p4ae53933k91f5ce16a7b4d7e2@mail.gmail.com>
	<001901ca5846$96f69d60$c4e3d820$@com>
	<320fb6e00910290329g78330f61qc5b524b4ba4f5f81@mail.gmail.com>
Message-ID: <320fb6e00910290407r15e1c7d5h246de938a8229ad@mail.gmail.com>

On Thu, Oct 29, 2009 at 10:29 AM, Peter  wrote:
> On Thu, Oct 29, 2009 at 3:19 AM, Ben O'Loghlin  wrote:
>> This all makes sense now, I wasn't aware of the different 'retmode' options.
>> The Bio.Entrez.efetch() documentation points me to
>> http://www.ncbi.nlm.nih.gov/corehtml/query/static/efetch_help.html, which
>> doesn't mention the 'retmode' or 'rettype' parameters. In fact I couldn't
>> find any explicit reference to it in the Tutorial either, just the use of
>> 'rettype=text' in one of the example code snippets.
>>
>> I subsequently tracked down this page
>> http://www.ncbi.nlm.nih.gov/corehtml/query/static/efetchlit_help.html which
>> does at least indicate the different rettypes and retmodes available.
>
> I agree the NCBI Entrez documentation is very unhelpful to beginners.
> We do try and make this easier in our tutorial, but perhaps "retmode"
> and "rettype" need to be discussed more on the EFetch section (they
> are mentioned a little later in the chapter in the context of other formats)

I've tried to make the EFetch section of the Biopython tutorial clearer
for the next release - thanks for the feedback.

Peter


From biopython at maubp.freeserve.co.uk  Thu Oct 29 11:09:39 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 29 Oct 2009 11:09:39 +0000
Subject: [Biopython] losing information
In-Reply-To: 
References: 
	<320fb6e00910290313x48c8de1ew37efe425b632fa87@mail.gmail.com>
	
Message-ID: <320fb6e00910290409s4470ec7ufb15e0556c6d4d89@mail.gmail.com>

On Thu, Oct 29, 2009 at 10:52 AM, Liam Thompson  wrote:
> Hi Peter
>
> Thanks for the helpful reply as always. I upgraded to 1.51 from 1.49,
> but it made no difference, the information is still lost.

That is curious. Could you tell use a specific GenBank record showing
this problem (e.g. an accession number or a URL)?

By the way - Biopython 1.52 has been out for a month, although I
don't recall any major changes in the GenBank support right now.

> You are right that it would be better not to write the data to file, and just
> check over the file, and I will try to incorporate this into the next few
> functions I'm adding.

That would be best I think.

> Let me attempt the Bio.Genbank feature

If you really want to. The API is a bit different to Bio.SeqIO ;)

Peter


From biopython at maubp.freeserve.co.uk  Thu Oct 29 12:15:36 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 29 Oct 2009 12:15:36 +0000
Subject: [Biopython] losing information
In-Reply-To: 
References: 
	<320fb6e00910290313x48c8de1ew37efe425b632fa87@mail.gmail.com>
	
	<320fb6e00910290409s4470ec7ufb15e0556c6d4d89@mail.gmail.com>
	
Message-ID: <320fb6e00910290515n72c99ec2ye9f3c6ab61361b1e@mail.gmail.com>

On Thu, Oct 29, 2009 at 11:48 AM, Liam Thompson  wrote:
> Hi Peter
>
> There are 2000 records, but they all behave the same way
>
> I have attached 2 files, to show just 2 of them change.
>
> Thanks
> Liam

The mailing list doesn't like attachments, but I got them and
had a look. This is odd. I just tied a conversion using 1.52+
(i.e. the latest code in the repository) with:

from Bio import SeqIO
count = SeqIO.convert("original.txt", "gb", "new.txt", "gb")
print "Converted %i records" % count

or, equivalently for pre-Biopython 1.52 you can use:

from Bio import SeqIO
records = SeqIO.parse(open("original.txt"), "gb")
handle = open("new.txt", "w")
count = SeqIO.write(records, handle,  "gb")
handle.close()
print "Converted %i records" % count

See this blog post introducing the convert function:
http://news.open-bio.org/news/2009/09/biopython-convert-function/

Either way, I am seeing the features preserved (although
some of the qualifiers are in a different order). As I said
before, I thought this would work on 1.51 too - but maybe
I was wrong. Could you upgrade to 1.52 and retry?

Peter


From biopython at maubp.freeserve.co.uk  Thu Oct 29 14:04:20 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 29 Oct 2009 14:04:20 +0000
Subject: [Biopython] losing information
In-Reply-To: <320fb6e00910290313x48c8de1ew37efe425b632fa87@mail.gmail.com>
References: 
	<320fb6e00910290313x48c8de1ew37efe425b632fa87@mail.gmail.com>
Message-ID: <320fb6e00910290704n605aaf4fr56af80e5463eb35c@mail.gmail.com>

On Thu, Oct 29, 2009 at 10:13 AM, Peter  wrote:
> On Thu, Oct 29, 2009 at 4:53 AM, Liam Thompson  wrote:
>> hi everyone
>>
>> I'm running a simple script to remove genbank records from
>> a GB file that I have indentified as undesirable. ...
>
> ...
>
> That said, I would approach your problem in a very different
> way. I would NOT parse the file into objects at all - I would
> just loop over the lines, toggling between desired or not,
> and outputting the lines for desired records as is. This
> assumes your criteria for "desired" is simple to define,
> e.g. a list of LOCUS identifiers.

If you can just look at the LOCUS line, this is very easy in
Python (you don't need Biopython at all). It will also be very
fast as there is no complicated parsing and object creation.
e.g.

wanted = set(["AB493847", "AB493848"])
inp_handle = open("original.txt")
out_handle = open("new.txt", "w")
save = False
for line in inp_handle :
    if line.startswith("LOCUS") : #start of record
        save = line.split()[1] in wanted
    if save :
        out_handle.write(line)
    if line.strip() == "//" : #end of record
        save = False
inp_handle.close()
out_handle.close()

I've written this using a set of good record identifiers. If you have a
list of bad records, just switch round the "in" check.

If you need to access something like the annotation, or the sequence,
then it does make sense to parse the records - but keep a copy of
the raw GenBank record as a string to use for output. One way to
do this is to use StringIO.

Peter


From bassbabyface at yahoo.com  Thu Oct 29 14:59:45 2009
From: bassbabyface at yahoo.com (Ben O'Loghlin)
Date: Fri, 30 Oct 2009 01:59:45 +1100
Subject: [Biopython] Entrez.read return value is typed as a string??
In-Reply-To: <109726.94290.qm@web62408.mail.re1.yahoo.com>
References: <001901ca5846$96f69d60$c4e3d820$@com>
	<109726.94290.qm@web62408.mail.re1.yahoo.com>
Message-ID: <005001ca58a8$75a41cc0$60ec5640$@com>

Thanks Michiel.

What is the function of the 'u' in the string discussed below? That's the
bit that's got me confused.

Best regards,
Ben

p.s. assistance on this list is fast and useful. Nice!

-----Original Message-----
From: Michiel de Hoon [mailto:mjldehoon at yahoo.com] 
Sent: Thursday, 29 October 2009 2:50 PM
To: 'Peter'; Ben O'Loghlin
Cc: biopython at biopython.org
Subject: Re: [Biopython] Entrez.read return value is typed as a string??



--- On Wed, 10/28/09, Ben O'Loghlin  wrote:
> >>>> handle = Entrez.efetch(db="pubmed", id="17206916", retmode="XML")
> >>>> record = Entrez.read(handle)
> >>>> print record
> >[{u'MedlineCitation': ... ]
> 
> I'm interested in using this format, however I don't
> understand how to
> read/write fields and subtrees of the object type
> 'Bio.Entrez.Parser.ListElement' returned by
> Entrez.read(handle) with retmode XML. 
> 
> I'm finding it hard to track down references to this
> [{u'x':['y']}] object format in Python ...

Look at the outermost two brackets [].
You can treat this object as a Python list.

So if record = [{u'x':['y']}],
then record[0] = {u'x':['y']}

Now look at the two outermost braces {}.
You can treat record[0] as a dictionary.
So record[0]['x'] will return ['y'].
Which can then be treated as a Python list.

--Michiel.


      




From biopython at maubp.freeserve.co.uk  Thu Oct 29 15:37:21 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 29 Oct 2009 15:37:21 +0000
Subject: [Biopython] Entrez.read return value is typed as a string??
In-Reply-To: <005001ca58a8$75a41cc0$60ec5640$@com>
References: <001901ca5846$96f69d60$c4e3d820$@com>
	<109726.94290.qm@web62408.mail.re1.yahoo.com>
	<005001ca58a8$75a41cc0$60ec5640$@com>
Message-ID: <320fb6e00910290837w5861226dsb0a9f4a9fb4acd1f@mail.gmail.com>

On Thu, Oct 29, 2009 at 2:59 PM, Ben O'Loghlin  wrote:
> Thanks Michiel.
>
> What is the function of the 'u' in the string discussed below?
> That's the bit that's got me confused.
>
> Best regards,
> Ben
>
> p.s. assistance on this list is fast and useful. Nice!

Again, its a bit of Python basics rather than anything Biopython
specific. The u is for unicode, thus "fred" gives a normal string
while u"fred" gives a unicode string. Unless you are messing
about with odd foreign characters (e.g. letters with accents) you
won't have to worry about this. Python 3 gets rid of the dichotomy
by using unicode for all strings.

Peter