[Biojava-l] Stop condition for blast parser
mark.schreiber at novartis.com
mark.schreiber at novartis.com
Thu Mar 12 03:49:54 UTC 2009
Hi Marcel -
One possible solution would be to customise the handler and the parser so
they can talk to each other and the handler can make call backs to the
parser.
However, there is a fundamental problem with the BlastLikeSAXParser.
Because it is a SAX parser it is not at all suited to bouncing around the
file it is parsing because SAX parsing is event based. Therefore I think
you need a different paradigm. If you have lots of memory you could go
with something that is more like a DOM parser and reads the whole file
into memory (or uses java nio to pretend to) and use something like XQuery
to find what you want. If you are using BLAST XML output you could also
build an object tree with JAXB and navigate that.
You can also combine SAX and DOM to read memory sized chunks in one go but
this can be clunky.
Note, I am assuming you will use BLAST XML. If you are not I would
strongly encourage it for the task you describe. It will also make you
parsers much more robust to BLAST version changes.
Sorry the standard BioJava model can't really help here but please
consider posting you're solution or adding it as a recipe in the cookbook
as others are sure to have similar problems soon.
- Mark
biojava-l-bounces at lists.open-bio.org wrote on 03/12/2009 11:00:38 AM:
> Hi Mark!
>
> The blast etc. is parallelized. The contigs are split into groups of
1000
> and I also modified my program in the way that it works now with all
those
> separate files. But nevertheless I also have a program that works on the
> concatenated blast output. The parser with my customized handler is
always
> looking for the results of a certain contig and then compares these
> results to something else and also does some other stuff in-between to
> calculate some statistics and then creates a new parser again to get the
> results for the next contig. So a System.exit() is not an option, since
it
> would stop my whole program (in which I am using the parser). I also
don't
> wanna start working with threads here. I was just hoping that there
would
> be a way to tell the handler that, when a certain condition is met, it
> should give the parser a signal to stop parsing (and maybe even to reset
> itself to the first line). But I guess there's no way to do it in the
> customized handler...
>
> Thanks,
> Marcel
>
>
> mark.schreiber at novartis.com wrote:
> >
> > Hi -
> >
> > There are many ways to stop the parsing but it really depends on how
you
> > have set the program up. Notably there is no way for the Blast
parsing
> > system of BioJava to shut itself down but control probably shouldn't
> > happen at that level.
> >
> > A crude but effective procedure is to write out the results when you
> > find the hit of interest and then simply call System.exit()
> >
> > Another approach would be to spawn Tasks to parse each record and then
> > have them signal to the main thread when they are complete to shut
them
> > down. If you are using Java 1.5 or earlier then you would need to do
> > this with Threads. If you have a later version you can use the
> > concurrent packages which are much nicer to deal with.
> >
> > One thing I don't understand is why you don't blast each contig
> > separately, in that case the results would only contain your hit of
> > interest. That means 90K separate blasts but there are versions of
> > blast that run on clusters and the database (3 million genes) is not
> > huge so it should be an embarrassingly parallel problem?
> >
> > - Mark
> >
> > biojava-l-bounces at lists.open-bio.org wrote on 03/10/2009 03:00:36 AM:
> >
> >> Hi Mark!
> >>
> >> Mark Schreiber wrote:
> >> > You could just customize BlastEcho to pass on the events of
interest,
> >> > ignore those that are not interesting.
> >> That's what I am doing right now. But I don't know, how to tell my
> >> customized BlastEcho to stop, when a certain condition is met during
a
> >> paricular event call. What's the command for stopping there?
> >>
> >> > It could also exit if a certain
> >> > event occurs.
> >> How?
> >>
> >> > Remember it cost almost nothing to read the file so you
> >> > save time by only sending interesting events for parsing.
> >> Hmm, I am not sure, if it's really almost nothing, when I've about
90,000
> >> contigs that were blasted against a database with about maybe
3,000,000
> >> genes. The blast output that I am parsing is about 13Gig big and
every
> >> cycle I am looking for the results of one particular contig of these
> >> 90,000 contigs. So I definitely experienced that the time sums up a
lot,
> >> when it's running in each of these 90,000 cycles over the whole file,
> >> although the contig I am looking for was already at the beginning
> > ofthe file.
> >>
> >>
> >> Cheers,
> >> Marcel
> _______________________________________________
> Biojava-l mailing list - Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l
_________________________
CONFIDENTIALITY NOTICE
The information contained in this e-mail message is intended only for the
exclusive use of the individual or entity named above and may contain
information that is privileged, confidential or exempt from disclosure
under applicable law. If the reader of this message is not the intended
recipient, or the employee or agent responsible for delivery of the
message to the intended recipient, you are hereby notified that any
dissemination, distribution or copying of this communication is strictly
prohibited. If you have received this communication in error, please
notify the sender immediately by e-mail and delete the material from any
computer. Thank you.
More information about the Biojava-l
mailing list