[Bioperl-l] getting database hit sequences

Ewan Birney birney@ebi.ac.uk
Mon, 21 Oct 2002 07:56:41 +0100 (BST)


On Mon, 21 Oct 2002, Tobias Thierer wrote:

> Hi,
>

> What is the best (or at least some) way to get the entire sequence? The
> annotation of the database hits ($hsp->query->name or so) contains
> substrings of the form "/gi=some_gi_number", so I could perhaps extract
> the GI with a regexp. But how do I get the right sequence for a specific
> GI number efficiently? Parsing the entire database for every hit is
> O(num_sequences^2) and therefore much too slow to be feasible.

Use one of the Bio::DB or Bio::Index classes. Here are your options:

   - if you have your own local database, use Bio::Index::Fasta to build
a local index of the database (read the docs on Bio::Index::Fasta on how
to create an index). This should be stored on a disk which can be seen by
all your clients (ie, often good to NFS mount this)


   - if you are working against EMBL, Swissprot or GenBank, use
Bio::DB::EMBL, Bio::DB::GenBank or Bio::DB::Swiss - these work across the
network and so can be pretty darn slow. Make sure you point Bio::DB::Swiss
to the nearest expasy mirror if you are using it.

      - Use Bio::DB::FileCache and Bio::DB::InMemoryCache to improve
performace of the clients and cut down on the number of trips to the
network


>
> Is there any possibility to easily get the entire sequences that formed
> HSPs with my query sequence, preferable with Bioperl?
>
> Any help would be greatly appreciated!
>
> Regards,
>
> 	Tobias
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>