[Biopython-dev] Name problem in BLAST parser?

Mon Aug 4 05:49:50 EDT 2008

On Sun, Aug 3, 2008 at 11:35 PM, Sebastian Bassi <sbassi at gmail.com> wrote:
> Hello,
>
>>>> from Bio.Blast import NCBIXML
>>>> blast_records = NCBIXML.parse(res)
>>>> record = blast_records.next()
>>>> record.database_length
>>>> record.num_letters_in_database
> 39588516
>
> So if we are going to retrieve the database length field, why call it
> num_letters_in_database? I guess that the reply is: This field is
> called '<Statistics_db-len>' in the XML but most people know it as
> 'Number of letters in database' as it is displayed in the HTML BLAST
> output.

Good question.  I think the name was picked in the plain text parser,
and maintained in the XML parser.  However, things are more
complicated...

Strangely the plain text BLAST format contains this information three
times (!), once in the header (for each query) and then again at the
end of the file in the database report and the parameters "total
letters" and again as "length of database", e.g.
http://bugzilla.open-bio.org/attachment.cgi?id=676

...
Database: Leigo
           4,535,438 sequences; 1,573,298,872 total letters
...
  Database: Leigo
    Posted date:  Jan 22, 2007 11:26 AM
  Number of letters in database: 1,573,298,872
  Number of sequences in database:  4,535,438
...
Length of database: 1,573,298,872
...

The Bio.Record.Header class defines "database_letters" (this is
repeated every query), Bio.Record.DatabaseReport defines
"num_letters_in_database", and Bio.Record.Parameters class defines
"database_length" (where the names reflect the NCBI strings).  The
Bio.Record.Record inherits from all three, so ends up with
"database_letters",  "database_length" and "num_letters_in_database"
(all coming from different bits of a plain text BLAST file).  I am
assuming that these three numbers should agree, but the design allows
for the fact they may not (I would have used a single name and checked
they were the same).

>From a quick check, in the XML output the database length is found
only in the statistics block (repeated for each query), as you stated,
called '<Statistics_db-len>'.  As this is per-query, the closest match
to the original trio is the one in each query's header,
"database_letters", but instead in the initial XML parser this was
mapped to "num_letters_in_database".

> Thats OK but why having an empty "database_length" attribute?
> I am thinking in two solutions for this:
> 1) Just delete the "database_length" attribute.
> 2) Make "database_length" another name for "num_letters_in_database".
> Maybe there is another solution that I am not aware of.

Regarding idea (2), as the plain text parser fills in both
"num_letters_in_database" and "database_length"  and
"database_letters" (from different parts of the file), I think for
consistency one could argue that the XML parser should also fill in
all three!  On the other hand, having the same information in three
places is crazy and un-pythonic.

In the long run perhaps we should deprecate the "database_length"  and
"database_letters" properties of the Record class (and just make the
plain text parser just check all three agree)?  This is a variation on
your idea (1).

Peter