[Biopython] Biopython Digest, Vol 82, Issue 3

Austin Davis-Richardson harekrishna at gmail.com
Wed Oct 7 20:11:03 UTC 2009


I'm confused now.  In the latest version

http://github.com/biopython/biopython/commit/1fff8038e4fa9e2643851a70118e3227ccbea44e

Missing values are empty strings so if I did something like

record = Entrez.read(handle)

for item in record:
    myList.append += item['TaxId']

myList should be something like :
[ '1234', '2434', '', '9970' ]
where myList[2] is the result of a missing value

However, when I run my script.  I find no blank spaces despite knowing
that there are some that should have missing values.
Which screws things up later when I zip tax ID's with their
corresponding accession number:

zip (accessions, taxids)

I'm all for using '1' (root) or '-1' for missing values.


2009/10/7  <biopython-request at lists.open-bio.org>:
> Send Biopython mailing list submissions to
>        biopython at lists.open-bio.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>        http://lists.open-bio.org/mailman/listinfo/biopython
> or, via email, send a message with subject or body 'help' to
>        biopython-request at lists.open-bio.org
>
> You can reach the person managing the list at
>        biopython-owner at lists.open-bio.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Biopython digest..."
>
>
> Today's Topics:
>
>   1. Skipping over blank/erroneous Entrez.esummary() results
>      (Austin Davis-Richardson)
>   2. Re: Skipping over blank/erroneous Entrez.esummary()       results
>      (Michiel de Hoon)
>   3. Re: Combine nexus files but not concatenating them (Peter)
>   4. Re: Skipping over blank/erroneous Entrez.esummary()       results
>      (Peter)
>   5. Re: Skipping over blank/erroneous Entrez.esummary()       results
>      (Brad Chapman)
>   6. Re: Skipping over blank/erroneous Entrez.esummary()       results
>      (Michiel de Hoon)
>   7. Re: Skipping over blank/erroneous Entrez.esummary()       results
>      (Brad Chapman)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 6 Oct 2009 17:07:52 -0400
> From: Austin Davis-Richardson <harekrishna at gmail.com>
> Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary()
>        results
> To: biopython at lists.open-bio.org
> Message-ID:
>        <d8e68faf0910061407v90f050dw1c16f2f5f97aa697 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Howdy,
>
> I'm using BioPython to generate a table of accession numbers and their
> corresponding TaxIDs.  The fastest way I can do this is 20 at a time
> (20 per 3 seconds rather than 1 per 3 seconds).
>
> However, this results in a problem.
>
> whenever my script receives a result from NCBI that is blank such as
> there being no value for TaxID, BioPython crashes with the error:
>
>  File "taxcollector3.py", line 39, in getTaxID
>    record = Entrez.read(handle)
>  File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py",
> line 259, in read
>    record = handler.run(handle)
>  File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> line 90, in run
>    self.parser.ParseFile(handle)
>  File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
> line 191, in endElement
>    value = IntegerElement(value)
> ValueError: invalid literal for int() with base 10: ''
>
>
> my code looks like this:  Where gids is a string of comma-separated GIDs
> (I get the GIDs from the accession numbers using
> eEntrez.esearch(db="nucleotide", rettype="text", term=accessions))
>
>                        handle = Entrez.esummary(db="nucleotide", id=gids)
>                        record = Entrez.read(handle)
>
>
> The only solution I can come up with is searching one at a time, but
> this is very slow.  (I have about 300,000 accession numbers)
>
> Does anyone know perhaps a patch or a solution for this?  Or maybe an
> easier way to get a TaxID from an accession number?
>
> Thanks,
> Austin Davis-Richardson
>
>
> ------------------------------
>
> Message: 2
> Date: Tue, 6 Oct 2009 19:11:36 -0700 (PDT)
> From: Michiel de Hoon <mjldehoon at yahoo.com>
> Subject: Re: [Biopython] Skipping over blank/erroneous
>        Entrez.esummary()       results
> To: biopython at lists.open-bio.org,       Austin Davis-Richardson
>        <harekrishna at gmail.com>
> Message-ID: <362834.37683.qm at web62401.mail.re1.yahoo.com>
> Content-Type: text/plain; charset=iso-8859-1
>
> You could try the following (with biopython 1.52):
>
> handle = Entrez.esummary(db="nucleotide", id=gids)
> records = Entrez.parse(handle)
> while True:
>    try:
>        record = records.next()
>    except StopIteration:
>        break
>    except:
>        print "Skipping record"
>
>
> We should probably modify Bio.Entrez so that empty "integer" values are treated correctly.
>
>
> --Michiel.
>
> --- On Tue, 10/6/09, Austin Davis-Richardson <harekrishna at gmail.com> wrote:
>
>> From: Austin Davis-Richardson <harekrishna at gmail.com>
>> Subject: [Biopython] Skipping over blank/erroneous Entrez.esummary() results
>> To: biopython at lists.open-bio.org
>> Date: Tuesday, October 6, 2009, 5:07 PM
>> Howdy,
>>
>> I'm using BioPython to generate a table of accession
>> numbers and their
>> corresponding TaxIDs.? The fastest way I can do this
>> is 20 at a time
>> (20 per 3 seconds rather than 1 per 3 seconds).
>>
>> However, this results in a problem.
>>
>> whenever my script receives a result from NCBI that is
>> blank such as
>> there being no value for TaxID, BioPython crashes with the
>> error:
>>
>> ? File "taxcollector3.py", line 39, in getTaxID
>> ? ? record = Entrez.read(handle)
>> ? File
>> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py",
>> line 259, in read
>> ? ? record = handler.run(handle)
>> ? File
>> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
>> line 90, in run
>> ? ? self.parser.ParseFile(handle)
>> ? File
>> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
>> line 191, in endElement
>> ? ? value = IntegerElement(value)
>> ValueError: invalid literal for int() with base 10: ''
>>
>>
>> my code looks like this:? Where gids is a string of
>> comma-separated GIDs
>> (I get the GIDs from the accession numbers using
>> eEntrez.esearch(db="nucleotide", rettype="text",
>> term=accessions))
>>
>> ??? ??? ???
>> handle = Entrez.esummary(db="nucleotide", id=gids)
>> ??? ??? ???
>> record = Entrez.read(handle)
>>
>>
>> The only solution I can come up with is searching one at a
>> time, but
>> this is very slow.? (I have about 300,000 accession
>> numbers)
>>
>> Does anyone know perhaps a patch or a solution for
>> this?? Or maybe an
>> easier way to get a TaxID from an accession number?
>>
>> Thanks,
>> Austin Davis-Richardson
>> _______________________________________________
>> Biopython mailing list? -? Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>
>
>
>
>
>
> ------------------------------
>
> Message: 3
> Date: Wed, 7 Oct 2009 10:29:36 +0100
> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: Re: [Biopython] Combine nexus files but not concatenating
>        them
> To: Denzel Li <denzel.dz.li at gmail.com>
> Cc: Biopython Mailing List <biopython at lists.open-bio.org>
> Message-ID:
>        <320fb6e00910070229n1b78542dj82998de13cf7eed7 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> On Wed, Oct 7, 2009 at 4:22 AM, Denzel Li <denzel.dz.li at gmail.com> wrote:
>> Hi Peter:
>> Thank you for the help. Both functions work well. By the way, will
>> "standard" datatype or "mixed" datatype be supported in Bio:Nexus:Nexus?
>>
>> Best,
>> Denzel
>
> Hi Denzel,
>
> I CC'd the list - please try and keep replies send there.
>
> I'm glad Bio.Nexus is working well for you.
>
> Regarding the finer details of the NEXUS file format and the Biopython
> code, I am not an expert - we need Frank or Cymon to comment. If
> you could give us a couple of examples of what you are asking for it
> would probably be much clearer (to me at least).
>
> Regards,
>
> Peter
>
>
> ------------------------------
>
> Message: 4
> Date: Wed, 7 Oct 2009 12:17:30 +0100
> From: Peter <biopython at maubp.freeserve.co.uk>
> Subject: Re: [Biopython] Skipping over blank/erroneous
>        Entrez.esummary()       results
> To: Michiel de Hoon <mjldehoon at yahoo.com>
> Cc: biopython at lists.open-bio.org,       Austin Davis-Richardson
>        <harekrishna at gmail.com>
> Message-ID:
>        <320fb6e00910070417w26236a62ifece2e2610256609 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> On Wed, Oct 7, 2009 at 3:11 AM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>>
>> We should probably modify Bio.Entrez so that empty "integer" values are treated correctly.
>>
>
> Does "correctly" mean a default value? I see Brad has just commited a change to
> use -1 in this case, but perhaps None is also a good choice? Can we
> alternatively
> leave this bit of the data structure empty?
>
> Peter
>
>
> ------------------------------
>
> Message: 5
> Date: Wed, 7 Oct 2009 07:17:37 -0400
> From: Brad Chapman <chapmanb at 50mail.com>
> Subject: Re: [Biopython] Skipping over blank/erroneous
>        Entrez.esummary()       results
> To: Austin Davis-Richardson <harekrishna at gmail.com>
> Cc: biopython at lists.open-bio.org
> Message-ID: <20091007111737.GC84267 at sobchak.mgh.harvard.edu>
> Content-Type: text/plain; charset=us-ascii
>
> Hi Austin;
>
>> I'm using BioPython to generate a table of accession numbers and their
>> corresponding TaxIDs.  The fastest way I can do this is 20 at a time
>> (20 per 3 seconds rather than 1 per 3 seconds).
>>
>> However, this results in a problem.
>>
>> whenever my script receives a result from NCBI that is blank such as
>> there being no value for TaxID, BioPython crashes with the error:
>>
>>   File "taxcollector3.py", line 39, in getTaxID
>>     record = Entrez.read(handle)
>>   File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py",
>> line 259, in read
>>     record = handler.run(handle)
>>   File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
>> line 90, in run
>>     self.parser.ParseFile(handle)
>>   File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
>> line 191, in endElement
>>     value = IntegerElement(value)
>> ValueError: invalid literal for int() with base 10: ''
>
> In addition to Michiel's workaround, I checked in a small change
> which could at least circumvent the error you are reporting:
>
> http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279
>
> It affects only one file, so if you don't want to pull the latest
> from GitHub, you can download just that file and replace it in your
> Biopython library:
>
> http://github.com/biopython/biopython/blob/master/Bio/Entrez/Parser.py
>
> Ideally, we should have a test case to cover this. Could you let us
> know specific GIs that are causing the problem? The group of 20 is
> fine if you haven't narrowed it further than that. This'll also help
> us check if there are any other problems with these records.
>
> Thanks for reporting this,
> Brad
>
>
> ------------------------------
>
> Message: 6
> Date: Wed, 7 Oct 2009 05:19:01 -0700 (PDT)
> From: Michiel de Hoon <mjldehoon at yahoo.com>
> Subject: Re: [Biopython] Skipping over blank/erroneous
>        Entrez.esummary()       results
> To: Austin Davis-Richardson <harekrishna at gmail.com>,    Brad Chapman
>        <chapmanb at 50mail.com>
> Cc: biopython at lists.open-bio.org
> Message-ID: <826538.32828.qm at web62406.mail.re1.yahoo.com>
> Content-Type: text/plain; charset=iso-8859-1
>
>> In addition to Michiel's workaround, I checked in a small
>> change
>> which could at least circumvent the error you are
>> reporting:
>>
>> http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279
>
> Sorry, but that change introduces two bugs. First, we should be able to distinguish between -1 and missing values. More importantly, we want to be able to add attributes to value. Since -1 is an integer instead of an object, it won't allow that.
>
> Can you revert this change?
>
> --Michiel
>
> --- On Wed, 10/7/09, Brad Chapman <chapmanb at 50mail.com> wrote:
>
>> From: Brad Chapman <chapmanb at 50mail.com>
>> Subject: Re: [Biopython] Skipping over blank/erroneous Entrez.esummary() results
>> To: "Austin Davis-Richardson" <harekrishna at gmail.com>
>> Cc: biopython at lists.open-bio.org
>> Date: Wednesday, October 7, 2009, 7:17 AM
>> Hi Austin;
>>
>> > I'm using BioPython to generate a table of accession
>> numbers and their
>> > corresponding TaxIDs.? The fastest way I can do
>> this is 20 at a time
>> > (20 per 3 seconds rather than 1 per 3 seconds).
>> >
>> > However, this results in a problem.
>> >
>> > whenever my script receives a result from NCBI that is
>> blank such as
>> > there being no value for TaxID, BioPython crashes with
>> the error:
>> >
>> >???File "taxcollector3.py", line 39, in
>> getTaxID
>> >? ???record = Entrez.read(handle)
>> >???File
>> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py",
>> > line 259, in read
>> >? ???record = handler.run(handle)
>> >???File
>> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
>> > line 90, in run
>> >? ???self.parser.ParseFile(handle)
>> >???File
>> "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
>> > line 191, in endElement
>> >? ???value = IntegerElement(value)
>> > ValueError: invalid literal for int() with base 10:
>> ''
>>
>> In addition to Michiel's workaround, I checked in a small
>> change
>> which could at least circumvent the error you are
>> reporting:
>>
>> http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279
>>
>> It affects only one file, so if you don't want to pull the
>> latest
>> from GitHub, you can download just that file and replace it
>> in your
>> Biopython library:
>>
>> http://github.com/biopython/biopython/blob/master/Bio/Entrez/Parser.py
>>
>> Ideally, we should have a test case to cover this. Could
>> you let us
>> know specific GIs that are causing the problem? The group
>> of 20 is
>> fine if you haven't narrowed it further than that. This'll
>> also help
>> us check if there are any other problems with these
>> records.
>>
>> Thanks for reporting this,
>> Brad
>> _______________________________________________
>> Biopython mailing list? -? Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>
>
>
>
>
>
> ------------------------------
>
> Message: 7
> Date: Wed, 7 Oct 2009 08:32:27 -0400
> From: Brad Chapman <chapmanb at 50mail.com>
> Subject: Re: [Biopython] Skipping over blank/erroneous
>        Entrez.esummary()       results
> To: Michiel de Hoon <mjldehoon at yahoo.com>
> Cc: Austin Davis-Richardson <harekrishna at gmail.com>,
>        biopython at lists.open-bio.org
> Message-ID: <20091007123227.GD84267 at sobchak.mgh.harvard.edu>
> Content-Type: text/plain; charset=us-ascii
>
> Peter and Michiel;
>
>> > In addition to Michiel's workaround, I checked in a small
>> > change which could at least circumvent the error you are
>> > reporting:
>> >
>> > http://github.com/biopython/biopython/commit/4dca8a24f62a1c28556d4e58f34db66f4b099279
>
> Peter:
>> Does "correctly" mean a default value? I see Brad has just commited a change to
>> use -1 in this case, but perhaps None is also a good choice? Can we
>> alternatively
>> leave this bit of the data structure empty?
>
> Michiel:
>> Sorry, but that change introduces two bugs. First, we should be able
>> to distinguish between -1 and missing values. More importantly, we
>> want to be able to add attributes to value. Since -1 is an integer
>> instead of an object, it won't allow that.
>>
>> Can you revert this change?
>
> Thanks guys -- not the best choice. How do you feel about just passing
> it along as an empty string and only doing the integer conversion if we
> actually have data to convert?
>
> http://github.com/biopython/biopython/commit/1fff8038e4fa9e2643851a70118e3227ccbea44e
>
> So now missing values are empty strings, as passed, instead of any
> sort of integer interpretation of them.
>
> Brad
>
>
> ------------------------------
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>
> End of Biopython Digest, Vol 82, Issue 3
> ****************************************
>



-- 
AGDR




More information about the Biopython mailing list