[Bioperl-l] Species retrieval from NCBI nr protein database.

Aaron J Mackey Aaron J. Mackey" <amackey@virginia.edu
Mon, 29 Jul 2002 10:33:23 -0400 (EDT)


Note that the [species name] trailer on nr description lines is only for
those that come from EMBL (emb), GenBank (gb), RefSeq (ref) or DNA
Databank of Japan (dbj).  The other description lines don't have that
info. In general, these description lines have lots of stuff packed into
the id as well (gi number, db source, db xref, etc) - all of which have
nothing to do with a generic FASTA format, which is what SeqIO/fasta.pm
provides.  All of that stuff ends up in the id or the description, since
that's all that FASTA format knows about.

I could envison a fasta/nr.pm parsing module (in fact, we use one locally
here to do nearly what you describe) plugin to the SeqIO system, but it's
not yet in Bioperl.  Any takers?

-Aaron

On Mon, 29 Jul 2002, Navdeep Jaitly wrote:

> Hi!
> I was using SeqIO to get proteins in NCBI nr database. Unfortunately it
> seems that the parsing of the species field is not quite working, and it
> gets lumped in with the description field (usually the species is the last
> element in the header of the nr database and is surrounded by []). Is this
> to be expected or am I doing something wrong ? Is the parsing of the fields
> specifiable in declaring a SeqIO instance ?
> Thanks!
> Deep
>
> ps: Code, and results attached.
>
>
>
> use Bio::SeqIO;
> use strict ;
> $in  = Bio::SeqIO->new('-file' => "c:\\Databases\\nr.fas",
>                          '-format' => 'Fasta');
> my $TO_PRINT = 3 ;
> my $numProteins = 0 ;
> my $seq ;
> while ( ($seq = $in->next_seq()) && $numProteins < $TO_PRINT)
> {
> 	my $sequence = $seq->seq() ;
> 	my $name = $seq->display_id() ;
> 	my $species = $seq->species() ;
> 	my $description = $seq->desc() ;
> 	print "NAME: $name\n" ;
> 	print "SPECIES: $species\n" ;
> 	print "DESCRIPTION: $description\n" ;
> 	print "SEQUENCE: $sequence\n\n" ;
> 	$numProteins++ ;
> }
>
>
> NAME: gi|6|emb|CAA42669.1|
> SPECIES:
> DESCRIPTION: (X60065) beta-2-glycoprotein  I [Bos taurus]
> SEQUENCE:
> PALVLLLGFLCHVAIAGRTCPKPDELPFSTVVPLKRTYEPGEQIVFSCQPGYVSRGGIRRFTCPLTGLWPINTLKCMPRVCPFAGILENGTVRYTTFEYPNTISFSCHTGFYLKGASSAKCTEEGKWSPDLPVCAPITCPPPPIPKFASLSVYKPLAGNNSFYGSKAVFKCLPHHAMFGNDTVTCTEHGNWTQLPECREVRCPFPSRPDNGFVNHPANPVLYYKDTATFGCHETYSLDGPEEVECSKFGNWSAQPSCKASCKLSIKRATVIYEGERVAIQNKFKNGMLHGQKVSFFCKHKEKKCSYTEDAQCIDGTIEIPKCFKEHSSLAFWKTDASDVKPC
>
> NAME: gi|129249|sp|P02820|OSTC_BOVIN
> SPECIES:
> DESCRIPTION: OSTEOCALCIN PRECURSOR (GAMMA-CARBOXYGLUTAMIC ACID-CONTAINING
> PROTEIN) (BONE GLA-PROTEIN) (BGP)gi|538590|pir||GEBO osteocalcin precursor
> - bovinegi|8|emb|CAA35997.1| (X51700) bone Gla precursor (100 AA) [Bos
> taurus]gi|720|emb|CAA37737.1| (X53699) Gla protein precusor [Bos taurus]
> SEQUENCE:
> MRTPMLLALLALATLCLAGRADAKPGDAESGKGAAFVSKQEGSEVVKRLRRYLDHWLGAPAPYPDPLEPKREVCELNPDCDELADHIGFQEAYRRFYGPV
>
> NAME: gi|231734|sp|P30274|CGA2_BOVIN
> SPECIES:
> DESCRIPTION: CYCLIN A2 (CYCLIN A)gi|284597|pir||S24788 cyclin A -
> bovinegi|10|emb|CAA48398.1| (X68321) Cyclin A-3 [Bos taurus]
> SEQUENCE:
> EFQEDQENVNPEKAAPAQQPRTRAGLAVLRAGNSRGPAPQRPKTRRVAPLKDLPINDEYVPVPPWKANNKQPAFTIHVDEAEEIQKRPTESKKSESEDVLAFNSAVTLPGPRKPLAPLDYPMDGSFESPHTMEMSVVLEDEKPVSVNEVPDYHEDIHTYLREMEVKCKPKVGYMKKQPDITNSMRAILVDWLVEVGEEYKLQNETLHLAVNYIDRFLSSMSVLRGKLQLVGTAAMLLASKFEEIYPPEVAEFVYITDDTYTKKQVLRMEHLVLKVLAFDLAAPTINQFLTQYFLHQQPANCKVESLAMFLGELSLIDADPYLKYLPSVIAAAAFHLALYTVTGQSWPESLVQKTGYTLETLKPCLLDLHQTYLRAPQHAQQSIREKYKNSKYHGVSLLNPPETLNV
>
>
> _________________________________________________________________
> Join the world’s largest e-mail service with MSN Hotmail.
> http://www.hotmail.com
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>

-- 
 Aaron J Mackey
 Pearson Laboratory
 University of Virginia
 (434) 924-2821
 amackey@virginia.edu