[Bioperl-l] parsing blast output and obtaining seq info

Jason Stajich jason@chg.mc.duke.edu
Wed, 1 Aug 2001 13:53:28 -0400 (EDT)


I just wrote scripts/blast_fetch_local.pl
using BPlite which may be simplier to use.

You'll probably want to do some logic on the HSPs to make sure you're
getting what you want, I was just checking to see that all the HSPs had
the appropriate P value for a given Sbjct more as an example than
anything else.

BTW I've been fixing the percent_identity method in the BPlite::HSP
recently so you may want to try and get the latest version from the CVS
repository or wait until the developer's release comes out at the end of
next week.

This script works running off bioperl-live HEAD but I doubt there is
anything in there that won't work for 0.7.x branch.

http://cvs.bioperl.org/cgi-bin/viewcvs/viewcvs.cgi/bioperl-live/scripts/blast_fetch_local.pl?rev=1.1&content-type=text/vnd.viewcvs-markup&cvsroot=bioperl

-Jason
On Tue, 31 Jul 2001, Gopinath Ganji wrote:

> Thanks a ton, Jason! That works :-) Thought web retrieval would be perfect
> :-)  Yeah, I am sure the network entrez option would be the way to go.
> Here's another petty question : Once I do download and index the sequences,
> how do I match a seq id(from my indexed database) to, say, a certain hit from
> the blast output for further sequence manipulations (extract subseq, revcomp,
> etc.).
> 
> Thanks again,
> Gopi
> [PS: I am using the Bio::Tools::Blast for parsing.]
> Jason Stajich wrote:
> 
> > Which blast paring module are you using?  Bio::Tools::Blast or
> > Bio::Tools::BPlite?
> >
> > DB::GenBank is not going to work if you are retrieving one sequence at a
> > time over the course of a huge blast file - NCBI will shut off access to
> > you because they'll think you are spamming their webserver.   Also the web
> > retrieval of a sequence is notoriously problematic - it dies on everyone
> > at intermittent points w/o good reason.
> >
> > We are working on ways to use ncbi's network entrez in perl to alleviate
> > this problem, but might take a little while.  I'm also not sure that the
> > latency here associated with fetching a sequence remotely is worth the
> > 'up-to-dateness' implied when retrieving sequences from their server.  If
> > you did a blast search on a db, you really want the actual sequence that
> > was in that db anyways.
> >
> > You might be better served downloading the blast db which you are
> > searching against, building an index using Bio::Index::Fasta and doing all
> > your sequence retrieval locally - assuming you have enough disk space for
> > this.
> >
> > I've suggested this to other people and haven't gotten major complaints
> > about implementing it, perhaps a setup script for doing this should
> > make its way into the scripts directory some time?
> >
> > -jason
> > On Mon, 30 Jul 2001, Gopinath Ganji wrote:
> >
> > > Hey everyone,
> > > Been teaching myself some bioperl to automate a few tasks in our project
> > > for the past couple of weeks. The tasks at hand may be summarized as
> > > follows:
> > > - parse blast output for frac_aligned_query which is within the filter()
> > >
> > > - for each filtered hit, obtain seq (by using the
> > > get_by_acc($hit->name()) , obtain subseq and for - (anti) sense strands,
> > > revcom() the string obtained from subseq
> > >
> > > The blast output is in a multi-report format. For testing purposes, a
> > > shorter version has been used which works problem free. However, upon
> > > applying the script to the original file, I have frequently encountered
> > > error messages like "subseq() not defined" or "revcom() not defined"
> > > each time halting at a different line in the script. So when I actually
> > > looked at the output generated by the script, the script seems to halt
> > > at a different hit each time. Appears that the program crashes for some
> > > reason - runs out of memory, perhaps? I do know that parse method is
> > > known for memory leaks. However, the same scenario is duplicated for
> > > merely obtaining seq using the Blat::DB::GenBank module (using the
> > > get_by_acc method (see above)). Is this module known to suffer from
> > > leaks or crashes?  Please clarify. Any help/suggestions/pointers would
> > > be much appreciated.
> > >
> > > Thanks in advance,
> > > Gopi
> > >
> > > [PS: Kindly let me know if I am not making any sense :-)]
> > >
> > >
> > >
> > > _______________________________________________
> > > Bioperl-l mailing list
> > > Bioperl-l@bioperl.org
> > > http://bioperl.org/mailman/listinfo/bioperl-l
> > >
> >
> > Jason Stajich
> > jason@chg.mc.duke.edu
> > Center for Human Genetics
> > Duke University Medical Center
> > http://www.chg.duke.edu/
> 
> 
> 
> 

Jason Stajich
jason@chg.mc.duke.edu
Center for Human Genetics
Duke University Medical Center 
http://www.chg.duke.edu/