[Biopython-dev] SearchIO, was: PEP8 lower case module names?

Wibowo Arindrarto w.arindrarto at gmail.com
Tue Dec 4 13:33:32 UTC 2012


Hi Peter and everyone,

>> I've started work on SearchIO indexing of BGZF files now,
>> enabling it was quite simple (the same code as used for
>> SeqIO the indexing):
>> https://github.com/biopython/biopython/commit/cf063bf6a2dca4d534d00699310548e43bf2e14f
>>
>> Thus far I've only tested this with BLAST XML, but that did
>> require a bit of reworking to avoid doing file offset arithmetic:
>> https://github.com/biopython/biopython/commit/600b231a1817035141c8de80e5689dcfd31290b5
>>
>> I will resume this work later this afternoon, going over all
>> the SearchIO file formats one by one.

Yes, the original one that I wrote did have some less straightforward
arithmetic as I was trying to adhere to the strict XML definition
(i.e. no matter the whitespace outside of the start and end elements,
indexing will still work). But line-based indexing should work too
(and is simpler) so long as BLAST XML keeps its style (and any user
modification afterwards doesn't introduce any wacky whitespaces).

> I've refactored test_SearchIO_index.py to make adding
> additional get_raw tests easier. Proper testing of all the
> formats with BGZF will some larger test files (over 64k
> before compression) which we probably don't want to
> include in the repository.
>
> However, I also added code to additionally test
> Bio.SearchIO.index_db(...).get_raw(...) as well as your
> original testing of Bio.SearchIO.index(...).get_raw(...)
> alone. These should return the exact same string, and
> that is now working nicely for BLAST XML (and BGZF
> from limited testing), but not on all the formats.
>
> Could you look at the difference in get_raw and the
> record length found during indexing for: blast-tab
> (with comments), hmmscan3-domtab, hmmer3-tab,
> and hmmer3-text?
>
> i.e. Anything where test_SearchIO_index.py is now
> printing a WARNING line when run.

Sure :). Based on a quick initial look, it seems that these are due to
filler texts (e.g. the BLAST
tab format ending with lines like "# BLAST processed 3 queries").
These texts won't affect the calculation results and the values of our
objects, but does add additional text length.

regards,
Bow



More information about the Biopython-dev mailing list