[Biopython-dev] Name problem in BLAST parser?

Peter biopython at maubp.freeserve.co.uk
Mon Aug 4 10:30:46 UTC 2008


> Strangely the plain text BLAST format contains this information three
> times (!), once in the header (for each query) and then again at the
> end of the file in the database report and the parameters "total
> letters" and again as "length of database", e.g.
> http://bugzilla.open-bio.org/attachment.cgi?id=676
>
> ...
> Database: Leigo
>           4,535,438 sequences; 1,573,298,872 total letters
> ...
>  Database: Leigo
>    Posted date:  Jan 22, 2007 11:26 AM
>  Number of letters in database: 1,573,298,872
>  Number of sequences in database:  4,535,438
> ...
> Length of database: 1,573,298,872
> ...

At the suggestion of Leighton (off list) I checked out the -z option
and what this does to the reported database length.

If the -z option is used, only the last of these three databases in
the plain text output is changed (tested using standalone BLAST
2.2.18, which Biopython can parse for single queries).  Using the
Biopython plain text parser, "database_letters" and
"num_letters_in_database" reflect the real database size, while
"database_length" reflects the -z argument (which is used in the
statistics).  My naive assumption that the three value would always be
the same has been invalidated.

If the -z option is used with XML output, then <Statistics_db-len> is
updated.  As far as I can tell, the "real" database size is not
reported.  This suggests to match the old plain text parser, the field
should have been called "database_length" when parsing the XML.

Peter



More information about the Biopython-dev mailing list