[Biopython-dev] Bio.File

Peter Cock p.j.a.cock at googlemail.com
Thu Sep 8 11:25:17 EDT 2011


On Thu, Sep 8, 2011 at 3:49 PM, Michiel de Hoon <mjldehoon at yahoo.com> wrote:
>
> No we shouldn't rely an HTTP return code. The idea is that only
> the parser can know if the output returned by NCBI is valid, as in:
>
> handle = Entrez.efetch(...something...)
> try:
>    record = Entrez.read(handle)
> raise Exception:
>    # NCBI returned something invalid, or at least
>    # something that we don't know how to parse

In theory, yes, but quite often parsers look for certain
patterns and if you feed them something else they may
just say "no data". For example, the GenBank parser
ignores anything before the LOCUS line (in order to
cope with the free text header in the large multi-record
files on the NCBI FTP site). As a side effect, you can
give it almost any plain text file and the parser won't
raise an error - it will just say no GenBank records
found.

>> If the server could be relied on to always give an
>> HTTP error code this wouldn't be needed:
>>
>> https://github.com/peterjc/biopython/blob/togows/Bio/TogoWS/__init__.py
>>
>
> I don't like this approach much, as it depends on exactly
> what the error message looks like, and misses any other
> problems, such as incomplete output. There will be a
> certain false positive rate, with return values that pass
> the checking of the first 10 lines but are still unusable.

Yes, in theory the server should detect and handle
errors nicely - but there are sometimes bugs in web-
services. Certainly from memory I have had HTTP
return code 200 (OK) with invalid data from both the
NCBI and TogoWS.

> Even worse, the false positive rate can suddenly go up
> if the server maintainers decide to change anything in
> their error messages.

The checks are deliberately designed to avoid false
positives - at the cost of missing some errors early.

> This kind of checking should be
> done by the parser, which can tell you exactly if the
> data are valid, or if not, what is wrong with it.

That isn't always possible, since so many bioinformatics
file formats are so vague that validation is hard.

I accept checking the first 10 lines for common errors
specific to that webservice is inelegant, but it is practical.

[Some of those TogoWS checks are probably superfluous
right now, I'm still polishing the error handling - some of
which will rely on TogoWS itself catching more conditions]

Regards,

Peter



More information about the Biopython-dev mailing list