[Biopython] Skipping over blank/erroneous Entrez.esummary() results

Tue Oct 6 21:07:52 UTC 2009

Howdy,

I'm using BioPython to generate a table of accession numbers and their
corresponding TaxIDs.  The fastest way I can do this is 20 at a time
(20 per 3 seconds rather than 1 per 3 seconds).

However, this results in a problem.

whenever my script receives a result from NCBI that is blank such as
there being no value for TaxID, BioPython crashes with the error:

  File "taxcollector3.py", line 39, in getTaxID
    record = Entrez.read(handle)
  File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/__init__.py",
line 259, in read
    record = handler.run(handle)
  File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
line 90, in run
    self.parser.ParseFile(handle)
  File "/Users/audy/Downloads/biopython-1.52/build/lib.macosx-10.6-universal-2.6/Bio/Entrez/Parser.py",
line 191, in endElement
    value = IntegerElement(value)
ValueError: invalid literal for int() with base 10: ''

my code looks like this:  Where gids is a string of comma-separated GIDs
(I get the GIDs from the accession numbers using
eEntrez.esearch(db="nucleotide", rettype="text", term=accessions))

			handle = Entrez.esummary(db="nucleotide", id=gids)
			record = Entrez.read(handle)

The only solution I can come up with is searching one at a time, but
this is very slow.  (I have about 300,000 accession numbers)

Does anyone know perhaps a patch or a solution for this?  Or maybe an
easier way to get a TaxID from an accession number?

Thanks,
Austin Davis-Richardson