[Bioperl-l] Species retrieval from NCBI nr protein database.

Navdeep Jaitly ndjaitly@hotmail.com
Mon, 29 Jul 2002 15:24:28 -0400


ok, thanks!
Is it possible for me to modify SeqIO in house to adapt itself for nr 
databases, as was suggested earlier by someone ?
Does SeqIO allow as user to specify parsing rules in case someone wanted to 
develop a customized version of a parser but keep the existing interface to 
the modules ?
Thanks!
Deep

>From: Jason Stajich <jason@cgt.mc.duke.edu>
>To: Brian Osborne <brian_osborne@cognia.com>
>CC: Navdeep Jaitly <ndjaitly@hotmail.com>, <bioperl-l@bioperl.org>
>Subject: RE: [Bioperl-l] Species retrieval from NCBI nr protein database.
>Date: Mon, 29 Jul 2002 11:03:45 -0400 (EDT)
>MIME-Version: 1.0
>Received: from [152.3.67.98] by hotmail.com (3.2) with ESMTP id 
>MHotMailBF0EA516000A400431959803436210790; Mon, 29 Jul 2002 08:07:03 -0700
>Received: from localhost (jason@localhost)by tenero.genetics.duke.edu 
>(8.11.0/8.11.0) with ESMTP id g6TF3jI30527;Mon, 29 Jul 2002 11:03:45 -0400
>From jason@cgt.mc.duke.edu Mon, 29 Jul 2002 08:08:11 -0700
>X-Authentication-Warning: tenero.genetics.duke.edu: jason owned process 
>doing -bs
>X-X-Sender:  <jason@tenero.genetics.duke.edu>
>In-Reply-To: <GAEDKMGOKFBLJPKCLKCCKEJACDAA.brian_osborne@cognia.com>
>Message-ID: 
><Pine.LNX.4.33.0207291100360.30518-100000@tenero.genetics.duke.edu>
>
>Problem is that this only works for genbank/embl/swissprot where the
>format describes a species slot.  Fasta/Pearson format does not describe
>this.
>
>For the nr db speceis can be inferred from the fasta description.  We
>don't attempt to do this in our code because that would assume that all []
>comments in a fasta description were in fact species names.
>
>Navdeep the simple answer is for you to do something like this in your
>code:
>
>my ($species) = ( $seq->desc =~ /\[(.+)\]/);
>$species ||= ''; # to avoid undefs in your output
>
>On Mon, 29 Jul 2002, Brian Osborne wrote:
>
> > Navdeep,
> >
> > The key documentation for Bioperl, in my opinion, is in the Seq.pm 
>module,
> > take a look at it. $seqobj->species returns a Bio::Species object, not a
> > string, so the code should look like:
> >
> > $species_obj = $seq_obj->species;
> > $species = $species_obj->binomial;
> >
> > And so on. See Bio::Species.pm for all the methods that you could use 
>once
> > you have the Bio::Species object. Mind you, the common_name method may 
>not
> > always return the common name since the common name isn't provided in 
>all
> > public databases, use binomial() instead.
> >
> > Brian O.
> >
> >
> >
> > -----Original Message-----
> > From: bioperl-l-admin@bioperl.org [mailto:bioperl-l-admin@bioperl.org]On
> > Behalf Of Navdeep Jaitly
> > Sent: Monday, July 29, 2002 10:23 AM
> > To: bioperl-l@bioperl.org
> > Subject: [Bioperl-l] Species retrieval from NCBI nr protein database.
> >
> > Hi!
> > I was using SeqIO to get proteins in NCBI nr database. Unfortunately it
> > seems that the parsing of the species field is not quite working, and it
> > gets lumped in with the description field (usually the species is the 
>last
> > element in the header of the nr database and is surrounded by []). Is 
>this
> > to be expected or am I doing something wrong ? Is the parsing of the 
>fields
> > specifiable in declaring a SeqIO instance ?
> > Thanks!
> > Deep
> >
> > ps: Code, and results attached.
> >
> >
> >
> > use Bio::SeqIO;
> > use strict ;
> > $in  = Bio::SeqIO->new('-file' => "c:\\Databases\\nr.fas",
> >                          '-format' => 'Fasta');
> > my $TO_PRINT = 3 ;
> > my $numProteins = 0 ;
> > my $seq ;
> > while ( ($seq = $in->next_seq()) && $numProteins < $TO_PRINT)
> > {
> >         my $sequence = $seq->seq() ;
> >         my $name = $seq->display_id() ;
> >         my $species = $seq->species() ;
> >         my $description = $seq->desc() ;
> >         print "NAME: $name\n" ;
> >         print "SPECIES: $species\n" ;
> >         print "DESCRIPTION: $description\n" ;
> >         print "SEQUENCE: $sequence\n\n" ;
> >         $numProteins++ ;
> > }
> >
> >
> > NAME: gi|6|emb|CAA42669.1|
> > SPECIES:
> > DESCRIPTION: (X60065) beta-2-glycoprotein  I [Bos taurus]
> > SEQUENCE:
> > 
>PALVLLLGFLCHVAIAGRTCPKPDELPFSTVVPLKRTYEPGEQIVFSCQPGYVSRGGIRRFTCPLTGLWPINTLKC
> > 
>MPRVCPFAGILENGTVRYTTFEYPNTISFSCHTGFYLKGASSAKCTEEGKWSPDLPVCAPITCPPPPIPKFASLSV
> > 
>YKPLAGNNSFYGSKAVFKCLPHHAMFGNDTVTCTEHGNWTQLPECREVRCPFPSRPDNGFVNHPANPVLYYKDTAT
> > 
>FGCHETYSLDGPEEVECSKFGNWSAQPSCKASCKLSIKRATVIYEGERVAIQNKFKNGMLHGQKVSFFCKHKEKKC
> > SYTEDAQCIDGTIEIPKCFKEHSSLAFWKTDASDVKPC
> >
> > NAME: gi|129249|sp|P02820|OSTC_BOVIN
> > SPECIES:
> > DESCRIPTION: OSTEOCALCIN PRECURSOR (GAMMA-CARBOXYGLUTAMIC 
>ACID-CONTAINING
> > PROTEIN) (BONE GLA-PROTEIN) (BGP)gi|538590|pir||GEBO osteocalcin 
>precursor
> > - bovinegi|8|emb|CAA35997.1| (X51700) bone Gla precursor (100 AA) [Bos
> > taurus]gi|720|emb|CAA37737.1| (X53699) Gla protein precusor [Bos 
>taurus]
> > SEQUENCE:
> > 
>MRTPMLLALLALATLCLAGRADAKPGDAESGKGAAFVSKQEGSEVVKRLRRYLDHWLGAPAPYPDPLEPKREVCEL
> > NPDCDELADHIGFQEAYRRFYGPV
> >
> > NAME: gi|231734|sp|P30274|CGA2_BOVIN
> > SPECIES:
> > DESCRIPTION: CYCLIN A2 (CYCLIN A)gi|284597|pir||S24788 cyclin A -
> > bovinegi|10|emb|CAA48398.1| (X68321) Cyclin A-3 [Bos taurus]
> > SEQUENCE:
> > 
>EFQEDQENVNPEKAAPAQQPRTRAGLAVLRAGNSRGPAPQRPKTRRVAPLKDLPINDEYVPVPPWKANNKQPAFTI
> > 
>HVDEAEEIQKRPTESKKSESEDVLAFNSAVTLPGPRKPLAPLDYPMDGSFESPHTMEMSVVLEDEKPVSVNEVPDY
> > 
>HEDIHTYLREMEVKCKPKVGYMKKQPDITNSMRAILVDWLVEVGEEYKLQNETLHLAVNYIDRFLSSMSVLRGKLQ
> > 
>LVGTAAMLLASKFEEIYPPEVAEFVYITDDTYTKKQVLRMEHLVLKVLAFDLAAPTINQFLTQYFLHQQPANCKVE
> > 
>SLAMFLGELSLIDADPYLKYLPSVIAAAAFHLALYTVTGQSWPESLVQKTGYTLETLKPCLLDLHQTYLRAPQHAQ
> > QSIREKYKNSKYHGVSLLNPPETLNV
> >
> >
> > _________________________________________________________________
> > Join the world's largest e-mail service with MSN Hotmail.
> > http://www.hotmail.com
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l@bioperl.org
> > http://bioperl.org/mailman/listinfo/bioperl-l
> >
>
>--
>Jason Stajich
>Duke University
>jason at cgt.mc.duke.edu
>




_________________________________________________________________
MSN Photos is the easiest way to share and print your photos: 
http://photos.msn.com/support/worldwide.aspx