[Biopython-dev] test_SeqIO_online failure

Mon May 27 21:38:30 UTC 2013

Hi Peter,

On Mon, May 27, 2013 at 2:05 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Hi Philip,
>
> On Mon, May 27, 2013 at 3:27 AM, Phillip Garland <pgarland at gmail.com> wrote:
>> The fasta formatted record is fine, the problem seems to come after
>> requesting and reading the genbank-formatted record for the protein
>> with GI:16130152.
>>
>> It looks like the record was modified a few days ago:
>>
>> LOCUS       NP_416719                367 aa            linear   CON 24-MAY-2013
>>
>> and ends with
>>
>> CONTIG      join(WP_000865568.1:1..367)\n//\n\n'
>>
>> instead of
>>
>> ORIGIN and the sequence data.
>>
>> Is this a problem with the genbank record that should be reported to
>> NCBI, or is SeqIO supposed to handle the record as it is by fetching
>> the sequence from the linked contig, or is the test doing the wrong
>> thing by using rettype="gb" instead of rettype="gbwithparts"?
>
> Interesting - it looks like the NCBI made a change to Entrez and
> where previously this record had included the sequence with
> rettype="gb" now we have to ask for it explicitly with the longer
> rettype="gbwithparts" - my guess is this is now happening on
> more records.
>
> Note it does not affect all records, consider this example in our
> Tutorial which seems unchanged:
>
>   from Bio import Entrez
>   Entrez.email = "A.N.Other at example.com"     # Always tell NCBI who you are
>   handle = Entrez.efetch(db="nucleotide", id="186972394",
> rettype="gb", retmode="text")
>   print handle.read()
>
> Curious.
>
>> Here's the test output:
>>
>> pgarland at cradle:~/Hacking/Source/Biology/biopython/Tests$ python
>> run_tests.py test_SeqIO_online.py
>> Python version: 2.7.5 (default, May 20 2013, 11:51:12)
>> [GCC 4.7.3]
>> Operating system: posix linux2
>> test_SeqIO_online ... FAIL
>> ======================================================================
>> FAIL: test_protein_16130152 (test_SeqIO_online.EntrezTests)
>> Bio.Entrez.efetch(protein, 16130152, ...)
>> ----------------------------------------------------------------------
>> Traceback (most recent call last):
>>   File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py",
>> line 77, in <lambda>
>>     method = lambda x : x.simple(d, f, e, l, c)
>>   File "/home/pgarland/Hacking/Source/Biology/biopython/Tests/test_SeqIO_online.py",
>> line 65, in simple
>>     self.assertEqual(seguid(record.seq), checksum)
>> AssertionError: 'NT/aFiTXyD/7KixizZ9sq2FcniU' != 'fCjcjMFeGIrilHAn6h+yju267lg'
>>
>> ----------------------------------------------------------------------
>> Ran 1 test in 10.010 seconds
>>
>> FAILED (failures = 1)
>
> I'd noticed this on Friday but hadn't looked into why the sequence was
> different (and sometimes Entrez errors are transient). Thanks for
> exploring this :)
>
> Would you like to submit a pull request to update test_SeqIO_online.py
> or should I just go ahead and change the rettype?
>
> It would be sensible to review all the Entrez examples in the Tutorial,
> to perhaps make more use of 'gbwithparts' rather than 'gb'?
>
> Thanks,
>
> Peter

The slight problem with just replacing "gb" with "gbwithparts" is that
SeqIO doesn't take "gbwithparts" as an option for the file format. So
in test_SeqIO_online.py, you have this code:

            handle = Entrez.efetch(db=database, id=entry, rettype=f,
retmode="text")
            record = SeqIO.read(handle, f)

which is a natural way to write the test (because it tests fasta and
genbank files), but will currently fail if f is "gbwithparts", b/c
SeqIO doesn't accept "gbwithparts" as a file format specifier. My
guess is that most existing code hardcodes the rettype and SeqIO file
format specifier, so we could just test for gbwithparts prior to
calling SeqIO.read:

  handle = Entrez.efetch(db=database, id=entry, rettype=f, retmode="text")
            if f == "gbwithparts":
                f = "gb"
            record = SeqIO.read(handle, f)

I submitted a pull request with a minimal patch that does this.

For code like this, it would be cleaner if SeqIO accepted,
"gbwithparts" as an alias for "genbank", just like "gb" is, but I
don't know if it's a common pattern enough to bother.

If records like this are becoming more common, then "gbwithparts"
should be clearly documented in the biopython tutorial, though
"gbwithparts" isn't clearly explained in NCBI's Entrez docs AFAICT. It
seems safer to always use "gbwithparts" at this point, at least when
you want the sequence.

~Phillip