[Biojava-l] Fasta search parsing design
Thomas Down
td2@sanger.ac.uk
Tue, 5 Dec 2000 12:59:56 +0000
On Tue, Dec 05, 2000 at 12:39:02PM +0000, Keith James wrote:
>
> Hi,
Hi...
> There's a few issues about representing the data as biojava.bio.search
> result, hit and subhit objects if I use the interfaces there already.
>
>
> Interface SeqSimilaritySearchResult
>
> getSequenceDB() - there probably won't be a sensible SequenceDB object
> if the search has been done externally. It could return null instead?
> (the interface docs discourage this)
I'm generally a little bit suspicious of returning null values
(they're an easy way to cause bugs), although I'll concede that
in this case it does look kind-of sensible.
On the other hand, you could return a `dummy' SequenceDB which
has a name but doesn't contain any sequences. This feels a little
more correct to me, and has the advantage that it would be easy
to extend if in the future you want to provide a mechanism for
fetching whole sequences from the remote search database (a potentiall
quite useful function for some appplications).
> getSearcher() - again the interface docs discourage returning
> null. Returning a Searcher could be okay if its getSearchableDBs()
> returns an empty set, indicating that you can't actually run a search
> because the database is external. Maybe omit the Searcher entirely and
> allow getSearcher() to return null?
This seems a shame to me -- SeqSimilaritySearcher looks a potentially
nice interface for `end users' of the code.
I assume we're still talking about cases where we might be launching
searches on remote servers? I could invisage a situation where a
service like this is wrapped up as a SeqSimilaritySearcher with one
of the aforementioned `dummy' SequenceDBs for each database installed
on the server.
Of course, this isn't necessarily `first-pass' functionality -- Returning
null for now is fine if you don't need this sort of reflections.
> Interface SeqSimilaritySearchHit
>
> getSubHits() - Fasta hits don't have subhits as such. However, you
> could view them as a case where they are a hit which only ever
> contains one subhit.
I didn't write the interface, but I would assume from the
documentation that in the `no sub-hits' case you are expected
to return a singleton List containing yourself. Certainly,
there doesn't appear to be any harm which could result from
implementing it this way.
> Fasta search output also contains extra information (several scores,
> positions of the presented alignment in the query and subject
> sequences, percent identity). I was thinking of maybe a sub-interface
> to specify extra methods. Incidentally, we find % id useful, but if
> the alignment is retained you then have the same information in two
> places (via a calculation), which is a bad thing I guess.
Sub-interfacing is fine, if that's useful to you. Another
possibility to look at might be to attach BioJava `Annotation'
objects to these interfaces -- I guess I don't really have
strong feelings either way on this one.
Gerald (as the original developer of these interfaces): any
comments on this?
> Also, not having written any Java before, I don't know what memory use
> will be like for storing big lists of hits. I've seen Perl hoover up
> rather a lot of memory dealing with unfiltered Blast output.
Well-designed Java is generally a lot more efficient than Perl in
this respect, mainly because OO perl has a large per-object overhead
because of the `object-is-a-hash' trick. Per-object overhead in
Java is typically 8-12 bytes in a modern VM -- much better.
>From experience, the Java garbage collector is also rather more
trustworthy.
Happy hacking,
Thomas.
--
``If I was going to carry a large axe on my back to a diplomatic
function I think I'd want it glittery too.''
-- Terry Pratchett