[Bioperl-l] Bio::DB::Fasta

Ewan Birney birney@ebi.ac.uk
Sun, 20 May 2001 15:21:25 +0100 (BST)


On Sat, 19 May 2001, Lincoln Stein wrote:

> Hi all,
> 
> Inspired by Jim Kent's indexed "NIB" file format for primary sequence
> data, I recently cooked up a module that allows fast indexed access to
> a directory of Fasta files.  The neat thing about it is that you can
> extract a little bit of subsequence from the middle of an immense
> sequence (like a human chromosome) very quickly, and without worrying
> much about which file it's in.  On my 300 MHz laptop, the module can
> extract any 200 bp subsequence from the C. elegans genome plus EST set
> in less than 0.02 sec.
> 
> Here's a synopsis to give you an idea of what it does:
> 
> SYNOPSIS
>       use Bio::DB::Fasta;
> 
>       # create database from directory of fasta files
>       my $db      = Bio::DB::Fasta->new('/path/to/fasta/files');
> 
>       # simple access (for those without Bioperl)
>       my $seq     = $db->seq('CHROMOSOME_I',4_000_000 => 4_100_000);
>       my $revseq  = $db->seq('CHROMOSOME_I',4_100_000 => 4_000_000);
>       my @ids     = $db->ids;
>       my $length  = $db->length('CHROMOSOME_I');
>       my $moltype = $db->moltype('CHROMOSOME_I');
>       my $header  = $db->header('CHROMOSOME_I');
> 
>       # Bioperl-style access
>       my $db      = Bio::DB::Fasta->new('/path/to/fasta/files');
> 
>       my $obj     = $db->get_Seq_by_id('CHROMOSOME_I');
>       my $seq     = $obj->seq;
>       my $subseq  = $obj->subseq(4_000_000 => 4_100_000);
>       my $length  = $obj->length;
>       # (etc)
> 
>       # BioSeqI-style access
>       my $stream  = Bio::DB::Fasta->new('/path/to/fasta/files')->get_PrimarySeq_stream;
>       while (my $seq = $stream->next_seq) {
>         # Bio::PrimarySeqI stuff
>       }
> 
>       my $fh = Bio::DB::Fasta->newFh('/path/to/fasta/files');
>       while (my $seq = <$fh>) {
>         # Bio::PrimarySeqI stuff
>       }
> 
>       # tied hash access
>       tie %sequences,'Bio::DB::Fasta','/path/to/fasta/files';
>       print $sequences{'CHROMOSOME_I:1,20000'};
>       while (my $id = each %sequences) {
>          print $id," => ",tied(%sequences)->length($id),"\n";
>       }
> 
> I've tried to make the thing compatible with the Bio::PrimarySeqI,
> Bio::SeqIO, and Bio::DB:RandomSeqI interfaces.
> 
> I'm hoping that this doesn't reinvent something that's already in
> Bioperl.  (I looked and didn't find it, but I've been burnt before.)
> If it's not already there, let me know and I will add the module to
> bioperl-live.  For those who want to look it over, I've attached the
> module to this letter.


Looks very cool. Some of this is similar Bio::Seq::LargePrimarySeq but the
integration of whole Fasta file sets is novel - and nice to make this
Bio::DB::RandomSeqI/SeqIO/PrimarySeq interface compliant

What is this 4_100_000 syntax - is this standard perl for long numbers?


I reckon you can just check it in straight away into bioperl-live





> 
> Working with Jim Kent, I've also written a C-based module that
> accesses Jim's NIB files.  It looks like this is going to be part of
> Jim's FTPable code library, but if people think appropriate, I can add 
> that to the bioperl C-code repository.
> 
> Lincoln
> 
> 

-----------------------------------------------------------------
Ewan Birney. Mobile: +44 (0)7970 151230, Work: +44 1223 494420
<birney@ebi.ac.uk>. 
-----------------------------------------------------------------