[Bioperl-l] Species retrieval from NCBI nr protein database.

Jason Stajich jason@cgt.mc.duke.edu
Mon, 29 Jul 2002 11:03:45 -0400 (EDT)


Problem is that this only works for genbank/embl/swissprot where the
format describes a species slot.  Fasta/Pearson format does not describe
this.

For the nr db speceis can be inferred from the fasta description.  We
don't attempt to do this in our code because that would assume that all []
comments in a fasta description were in fact species names.

Navdeep the simple answer is for you to do something like this in your
code:

my ($species) = ( $seq->desc =~ /\[(.+)\]/);
$species ||= ''; # to avoid undefs in your output

On Mon, 29 Jul 2002, Brian Osborne wrote:

> Navdeep,
>
> The key documentation for Bioperl, in my opinion, is in the Seq.pm module,
> take a look at it. $seqobj->species returns a Bio::Species object, not a
> string, so the code should look like:
>
> $species_obj = $seq_obj->species;
> $species = $species_obj->binomial;
>
> And so on. See Bio::Species.pm for all the methods that you could use once
> you have the Bio::Species object. Mind you, the common_name method may not
> always return the common name since the common name isn't provided in all
> public databases, use binomial() instead.
>
> Brian O.
>
>
>
> -----Original Message-----
> From: bioperl-l-admin@bioperl.org [mailto:bioperl-l-admin@bioperl.org]On
> Behalf Of Navdeep Jaitly
> Sent: Monday, July 29, 2002 10:23 AM
> To: bioperl-l@bioperl.org
> Subject: [Bioperl-l] Species retrieval from NCBI nr protein database.
>
> Hi!
> I was using SeqIO to get proteins in NCBI nr database. Unfortunately it
> seems that the parsing of the species field is not quite working, and it
> gets lumped in with the description field (usually the species is the last
> element in the header of the nr database and is surrounded by []). Is this
> to be expected or am I doing something wrong ? Is the parsing of the fields
> specifiable in declaring a SeqIO instance ?
> Thanks!
> Deep
>
> ps: Code, and results attached.
>
>
>
> use Bio::SeqIO;
> use strict ;
> $in  = Bio::SeqIO->new('-file' => "c:\\Databases\\nr.fas",
>                          '-format' => 'Fasta');
> my $TO_PRINT = 3 ;
> my $numProteins = 0 ;
> my $seq ;
> while ( ($seq = $in->next_seq()) && $numProteins < $TO_PRINT)
> {
>         my $sequence = $seq->seq() ;
>         my $name = $seq->display_id() ;
>         my $species = $seq->species() ;
>         my $description = $seq->desc() ;
>         print "NAME: $name\n" ;
>         print "SPECIES: $species\n" ;
>         print "DESCRIPTION: $description\n" ;
>         print "SEQUENCE: $sequence\n\n" ;
>         $numProteins++ ;
> }
>
>
> NAME: gi|6|emb|CAA42669.1|
> SPECIES:
> DESCRIPTION: (X60065) beta-2-glycoprotein  I [Bos taurus]
> SEQUENCE:
> PALVLLLGFLCHVAIAGRTCPKPDELPFSTVVPLKRTYEPGEQIVFSCQPGYVSRGGIRRFTCPLTGLWPINTLKC
> MPRVCPFAGILENGTVRYTTFEYPNTISFSCHTGFYLKGASSAKCTEEGKWSPDLPVCAPITCPPPPIPKFASLSV
> YKPLAGNNSFYGSKAVFKCLPHHAMFGNDTVTCTEHGNWTQLPECREVRCPFPSRPDNGFVNHPANPVLYYKDTAT
> FGCHETYSLDGPEEVECSKFGNWSAQPSCKASCKLSIKRATVIYEGERVAIQNKFKNGMLHGQKVSFFCKHKEKKC
> SYTEDAQCIDGTIEIPKCFKEHSSLAFWKTDASDVKPC
>
> NAME: gi|129249|sp|P02820|OSTC_BOVIN
> SPECIES:
> DESCRIPTION: OSTEOCALCIN PRECURSOR (GAMMA-CARBOXYGLUTAMIC ACID-CONTAINING
> PROTEIN) (BONE GLA-PROTEIN) (BGP)gi|538590|pir||GEBO osteocalcin precursor
> - bovinegi|8|emb|CAA35997.1| (X51700) bone Gla precursor (100 AA) [Bos
> taurus]gi|720|emb|CAA37737.1| (X53699) Gla protein precusor [Bos taurus]
> SEQUENCE:
> MRTPMLLALLALATLCLAGRADAKPGDAESGKGAAFVSKQEGSEVVKRLRRYLDHWLGAPAPYPDPLEPKREVCEL
> NPDCDELADHIGFQEAYRRFYGPV
>
> NAME: gi|231734|sp|P30274|CGA2_BOVIN
> SPECIES:
> DESCRIPTION: CYCLIN A2 (CYCLIN A)gi|284597|pir||S24788 cyclin A -
> bovinegi|10|emb|CAA48398.1| (X68321) Cyclin A-3 [Bos taurus]
> SEQUENCE:
> EFQEDQENVNPEKAAPAQQPRTRAGLAVLRAGNSRGPAPQRPKTRRVAPLKDLPINDEYVPVPPWKANNKQPAFTI
> HVDEAEEIQKRPTESKKSESEDVLAFNSAVTLPGPRKPLAPLDYPMDGSFESPHTMEMSVVLEDEKPVSVNEVPDY
> HEDIHTYLREMEVKCKPKVGYMKKQPDITNSMRAILVDWLVEVGEEYKLQNETLHLAVNYIDRFLSSMSVLRGKLQ
> LVGTAAMLLASKFEEIYPPEVAEFVYITDDTYTKKQVLRMEHLVLKVLAFDLAAPTINQFLTQYFLHQQPANCKVE
> SLAMFLGELSLIDADPYLKYLPSVIAAAAFHLALYTVTGQSWPESLVQKTGYTLETLKPCLLDLHQTYLRAPQHAQ
> QSIREKYKNSKYHGVSLLNPPETLNV
>
>
> _________________________________________________________________
> Join the world's largest e-mail service with MSN Hotmail.
> http://www.hotmail.com
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>

-- 
Jason Stajich
Duke University
jason at cgt.mc.duke.edu