[Biojava-l] Parsing a BLAST file
Keith James
kdj@sanger.ac.uk
05 Nov 2001 10:16:44 +0000
>>>>> "David" == David Waring <dwaring@u.washington.edu> writes:
[...]
David> This parses the blast file and builds
David> SequenceDBSearchResults into a list. It is a little bit
David> more complicated than that really. But this complication
David> gives very great functionality. The SearchResultBuilder
David> must have two things that you might not expect, a
David> SequenceDB with all the query sequences that blast was
David> called with, and a SequenceDBInstallation which contains a
David> SequenceDB with the same name as that found in the blast
David> output file, in the demo this is 'genome'. With these
David> things in place you can get both the subject, and query
David> sequences of any hit from the SequenceDBSearchResult. I
David> included a little sample of how to do this below since it
David> is not in the demo.
David> But, you say, this is a blast against some foreign
David> database, How can I have a sequencDB with all this
David> data. The truth is you do not really need it. You just need
David> an empty SequenceDB with the correct name inside your
David> SequenceDBInstallation. But then of course you can not get
David> the subject sequences from the search result.
Yeah, the added complexity issue has been bugging me since the
bootcamp. I'm just finishing off a dotplot-style viewer for pairwise
comparisons which has to read Blast/Fasta/whatever. As an end-user
application it's got to cope with this robustly (e.g. where the
sequence name of the query/subject or database may not match up with
the search output).
As you say, there are tricks to get round the problem. The tests don't
contain a copy of EMBL (!), but use a dummy SequenceDB in the way you
describe. In cases where a user has said "this was my query, no matter
what your code thinks" I use a SingleSequenceDB (which contains one
sequence, no ID list and you always get back that sequence when you
request it). It's also possible to compact things like the
SequenceDBInstallation to anonymous inner classes which behave exactly
as you want (such as making assumptions about the identity of
sequences/databases which you wouldn't normally allow).
Keith
--
-= Keith James - kdj@sanger.ac.uk - http://www.sanger.ac.uk/Users/kdj =-
Pathogen Sequencing Unit, Wellcome Trust Sanger Institute, Cambridge, UK