[Biopython-dev] my latest trick
Andrew Dalke
adalke at mindspring.com
Fri Jan 18 07:36:31 EST 2002
The FormatIO system now supports BLAST and WU-BLAST, although it
isn't yet finished (missing a few details, like locations and the
sequence name). Here's what it looks like
First, autodetection of the file format
>>> import Bio
>>> format = Bio.formats["search"].identify("blastp.2.wu")
>>> format.name
'wu-blastp'
>>>
It isn't figuring things out from the extension - it's testing
the contents of the file. Here's proof.
>>> import os
>>> os.symlink("blastp.2.wu", "unknown.dat")
>>> Bio.formats["search"].identify("unknown.dat").name
'wu-blastp'
>>>
And I can load this into memory as a 'Search' object.
>>> from Bio import Search
>>> search = Search.io.read("unknown.dat").next()
# The 'next' is because the FormatIO system assumes multiple
# records. I have an idea of how to fix that.
>>> search.query.description
'YEL060C vacuolar protease B'
>>> search.algorithm.name
'BLASTP'
>>> len(search.hits)
12
>>>
>>> for hit in search.hits:
... print hit.name, hit.description
... print len(hit.hsps), "hsps"
... for hsp in hit.hsps:
... print hsp.query.seq
... print hsp.homology.seq
... print hsp.subject.seq
...
(Lots printed out)
Many of the fields are stored, like
>>> for k, v in search.statistics.items():
... print k, "=", repr(v)
...
total_time = ' 0.95u 0.08s 1.03t Elapsed: 00:00:01'
num_threads = 1
posted_date = ' 3:27 PM PST Mar 9, 1998'
start_time = 'Mon Mar 9 15:57:59 1998'
num_dfa_states = ' 569 (112 KB)'
total_dfa_size = ' 358 KB (384 KB)'
database = ' /tmpblast/PDBUNIQ'
release_date = ' unknown'
format = ' BLAST'
num_sequences_in_database = 2335
num_sequences_satisying_E = 12
end_time = 'Mon Mar 9 15:58:00 1998'
neighborhood_generation_time = ' 0.01u 0.00s 0.01t Elapsed: 00:00:00'
num_letters_in_database = 479690
search_cpu_time = ' 0.90u 0.05s 0.95t Elapsed: 00:00:01'
database_title = ' PDBUNIQ'
>>>
>>> hsp.query.positives
56
>>> hsp.query.frac_positives
0.45528455284552843
>>>
Normalization is still a problem, as you can see from the untrimmed
strings. And I don't quite get everything .. it's pretty tedious.
(There's a few fields in the hit I also don't handle, like what do
I do with 'P(2)' compared to 'P'? Someone with a better technical
understanding of the details of the algorithms needs to help me.
Perhaps in Tuscon.)
Biggest thing missing is failure cases. The data files I found
were all for successful runs.
The format expressions get hairy. The biggest problems occur when
formatY is almost like formatX except for a small change in the
bottom of the expression tree. Then the whole tree needs to be
reconstructed, which is noisy. I'm thinking about possibilities
like
blastn = Martel.replace_group(blastp, "hsp_info", blastn_hsp_info)
All these tree editing methods are ad hoc. I keep wondering what
it would be like to convert the tree to a DOM then use the DOM
methods to manipulate the structure. That'll have to wait for
Martel version 2.
Andrew
dalke at dalkescientific.com
More information about the Biopython-dev
mailing list