[Biopython-dev] [Bug 2176] XML Blast parser: miscellaneous bug fixes and cleanup

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Wed Oct 22 16:28:48 UTC 2008


http://bugzilla.open-bio.org/show_bug.cgi?id=2176





------- Comment #13 from biopython-bugzilla at maubp.freeserve.co.uk  2008-10-22 12:28 EST -------
Database Length
===============
I wanted to record my notes on this based on findings reported on the mailing
list.  See this thread:

http://lists.open-bio.org/pipermail/biopython-dev/2008-August/004101.html

The plain text BLAST format contains the database length information three
times (!), once in the header (for each query) and then again at the end of the
file in the database report and the parameters "total letters" and again as
"length of database", e.g.
http://bugzilla.open-bio.org/attachment.cgi?id=676

...
Database: Leigo
          4,535,438 sequences; 1,573,298,872 total letters
...
 Database: Leigo
   Posted date:  Jan 22, 2007 11:26 AM
 Number of letters in database: 1,573,298,872
 Number of sequences in database:  4,535,438
...
Length of database: 1,573,298,872
...

The Bio.Record.Header class defines "database_letters" (this is repeated every
query), Bio.Record.DatabaseReport defines "num_letters_in_database", and
Bio.Record.Parameters class defines "database_length" (where the names reflect
the NCBI strings).  The Bio.Record.Record inherits from all three, so ends up
with "database_letters",  "database_length" and "num_letters_in_database" (all
coming from different bits of a plain text BLAST file).

If the -z option is used, only the last of these three databases in the plain
text output is changed (tested using standalone BLAST 2.2.18, which Biopython
can parse for single queries).  Using the Biopython plain text parser,
"database_letters" and "num_letters_in_database" reflect the real database
size, while "database_length" reflects the -z argument (which is used in the
statistics).

If the -z option is used with XML output, then <Statistics_db-len> is updated. 
As far as I can tell, the "real" database size is not reported.  The XML parser
stores this as "num_letters_in_database".

So from plain text BLAST we have two pieces of information,

actual database size - "database_letters" and "num_letters_in_database
specified database size - "database_length"

While for XML BLAST we only get one piece of information,

specified database size - "num_letters_in_database"
while "database_letters" and "database_length" default to None.

This is a horrid mess.  In the short term I propose the XML parser also record
the specified database size as "database_length", and perhaps also as
"database_letters" which would facilitate anyone trying to migrate a script
from the plain text parser to the XML parser.


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list