[Bioperl-l] Re: [Bioperl-guts-l] problems with Bio::DB::GenBank

Jason Eric Stajich jason@cgt.mc.duke.edu
Thu, 20 Sep 2001 17:06:06 -0400 (EDT)


Ah ha!

You are retrieving an EST from dbEST.  I confess I didn't
try your code example so I wasn't really very much help was I....

So of course this won't parse because it is dbEST format with HTML thrown
in.

Technically this is still really a NCBI problem because there is not a
seamless way to plug in a url and an accession and get back that sequence
in a single query.  The old qmap.cgi script (which did support this notion
of accession -> sequence ) that we are using obviously doesn't handle ESTs
returning them as dbEST format rather than Genbank format by default.  We
may have to move towards supporting the new htbin-post/Entrez/query cgi
except that retrievals from it are a 2 step process - returning list of
potential matches in summary format then retrieving by GI number.

I am probably too busy for tracking this down for a couple of weeks
perhaps another bioperl developer is interested in tracking this down?
Should be an interesting fix and much praise fame await those who tackle
it.

Brian - for your immediete future - depends on whether or not you want to
process the annotation for these ESTs or just want the sequence. It also
depends on how many sequences you really plan to retrieve if it is 4, I
would go to the website by hand, if it is > 100 I would download the db
files from NCBI or try the DB::EMBL module.

Your options w/ bioperl as I see them:

a) Try retrieving sequences with Bio::DB::EMBL
b) fix this problem yourself and submit the fix back to bioperl for us to
   commit (amid much praise and fame)
c) download est sequence from ncbi blast db (huge) and use
   Bio::Index::Fasta to do the same lookups locally (if you only want
   sequence)
d) download genbank release of dbest & updates and use Bio::Index::GenBank
   to do lookups locally (if you want annotation as well as sequence)

HTH
-jason
On Thu, 20 Sep 2001, Brian C. Thomas wrote:

> Hi
>
> I did as you suggested and used the "verbose" setting.  Now I think
> that it isn't an NCBI problem, but a bioperl parsing the response
> problem. I can see the sequence coming back.  Here's the output...
>
> Does this help you any?
>
> Brian
>
> --------------------------------------------------------------
> url is
http://www.ncbi.nlm.nih.gov/entrez/utils/qmap.cgi?db=n&title=no&form=6&dopt=genbank&uid=AI834759
> str is
> <b>IDENTIFIERS</b>
>
> <b>dbEST Id:</b>       <b>2923079</b>
> EST name:       UI-M-AL0-abo-a-07-0-UI.s2
> GenBank Acc:    AI834759
> GenBank gi:     5468972
>
> <b>CLONE INFO</b>
> Clone Id:       UI-M-AL0-abo-a-07-0-UI (3')
> Source:         The NIH - University of Iowa Brain Molecular Anatomy
>                 Project: NIH-Iowa BMAP (Bento Soares, Thomas Casavant, and
>                 Val Sheffield)
> Id as DNA:      UI-M-AL0-abo-a-07-0-UI
> Id in host:     UI-M-AL0-abo-a-07.s2
> DNA type:       cDNA
>
> <b>PRIMERS</b>
> Sequencing:     M13 Forward
> PolyA Tail:     yes
>
> <b>SEQUENCE</b>
>                 TTTTTTTTTTTTTTTTTGGAGGGGGGGAAAACCCCCCCCAGGGAAACCCAAAATGGGTTT
>                 TTCAAAAAAAAAATTCCGGGGGTTTTTCAAAAAAAAAAAAAATTCCCCCAAAAAGGGG
>
> Entry Created:  Jul 14 1999
> Last Updated:   Jul 14 1999
>
> <b>COMMENTS</b>
>                 The sequence contained an oligo-dT track that was present in
>                 the oligonucleotide that was used to prime the synthesis of
>                 first strand cDNA and therefore this may represent a
>                 bonafide poly A tail. The sequence tag present in the cDNA
>                 between the NotI site and the oligo-dT track served to
>                 verify it as a clone from the non-normalized prefrontal
>                 cortex library cDNA Library Preparation: M.B. Soares Lab
>                 Clone distribution: NIH BMAP cDNA clones will be made
>                 available by the means that is soon to be determined. When
>                 NIH determines the means for distribution of the BMAP cDNA
>                 clones, this record will be updated accordingly when that
>                 means is determined.
>
> <b>LIBRARY</b>
> Lib Name:       NIH_BMAP_MCO
> Organism:       <a href=/htbin-post/Taxonomy/wgetorg?name=Mus+musculus>Mus musculus</a>
> Tag Lib:        NIH_BMAP_MCO
> Tag Tissue:     prefrontal-cortex
> Tag Seq:        GCACA
> Strain:         C57BL/6J
> Develop. stage: 27-32 days
> Lab host:       DH10B (Life Technologies)
> Vector:         pT7T3D-Pac (Pharmacia) with a modified polylinker
> R. Site 1:      Not I
> R. Site 2:      Eco RI
> Description:    The NIH_BMAP_MCO library is a non-normalized library
>                 constructed from mouse cortex. The tag is a string of 5
>                 nucleotides present between the Not I site and the oligo-dT
>                 track. The library was constructed as described by Bonaldo,
>                 Lennon and Soares, Genome Research 6: 791-806, 1996. Tissue
>                 provided by Ms. Annie Novakovich, Zivic-Miller Laboratories.
>
> <b>SUBMITTER</b>
> Name:           Chin, H
> Institution:    National Institute of Mental Health
> Address:        6001 Executive Blvd. Room 7N-7190, MSC 9643, Bethesda, MD
>                 20892-9643, USA
> Tel:            301 443 1706
> Fax:            301 443 9890
> E-mail:         mEST@mail.nih.gov
>
> <b>CITATIONS</b>
> Medline UID:    97044477
> Title:          Normalization and subtraction: two approaches to facilitate
>                 gene discovery
> Authors:        Bonaldo,M.F., Lennon,G., Soares,M.B.
> Citation:       Genome Res. 6 (9): 791-806 1996
>
>
> <b>MAP DATA</b>
> <br><hr><br>
>
> Can't call method "seq" on an undefined value at /tmp/test_webget.pl line 8.
>
> -----------------------------------------------------------------
>
>
> On Thu, Sep 20, 2001 at 03:59:31PM -0400, Jason Eric Stajich wrote:
> > (Bioperl guts is for CVS msgs and bioperl administration stuff bioperl-l
> > is better place to ask questions.)
> >
> > Bio::DB::GenBank is always going to be problematic because of the way we
> > connect to NCBI.  HTTP connections are frequently dropped and their site
> > frequently fails to return proper data.  I am at a loss for a solution at
> > this point other than building a netentrez wrapper/XS implementation
> > (Johnathan Epstein has mentioned he might be able to lead this effort) as
> > the problem is really on NCBI side dropping connections periodically. You
> > can use the Bio::DB::EMBL which will be slower from California, but has
> > typically been much more reliable.
> >
> > I would suggest building the GenBank object as
> >
> > my $db = new Bio::DB::GenBank(-verbose => 1);
> >
> > to get more verbose output.
> >
> > As an aside you will also get shutoff periodically from NCBI if you make a
> > lot of queries to their site (ie running this script repeatedly) - I think
> > selectively by IP address but I've not really seen documentation for this
> > just reported empirical evidence from other people playing around.
> >
> > -jason
> > On Thu, 20 Sep 2001, Brian C. Thomas wrote:
> >
> > > Hi
> > >
> > > I am having a bit of a problem with Bio::DB::GenBank.
> > > I have used this module numerous times in the past, and I don't know
> > > why I am getting this response now.
> > >
> > > Here's the code...
> > >
> > > ------------------------------------------
> > > #!/usr/bin/perl -w
> > > use Bio::DB::GenBank;
> > >
> > > $gb = new Bio::DB::GenBank;
> > > my($id) = "AI834759";
> > > $seqobj = $gb->get_Seq_by_id($id);
> > > print $seqobj->seq() . "\n";
> > > ------------------------------------------
> > >
> > > here's the output...
> > >
> > > ------------------------------------------
> > > > perl /tmp/test_webget.pl
> > > Can't call method "seq" on an undefined value at /tmp/test_webget.pl line 7.
> > > ------------------------------------------
> > >
> > > when I add in $gb->request_format('fasta'), I get this output...
> > > ------------------------------------------
> > > > perl /tmp/test_webget.pl
> > > -------------------- EXCEPTION --------------------
> > > MSG: Attempting to set the sequence to [<html] which does not look healthy
> > > STACK Bio::PrimarySeq::seq /usr/lib/perl5/site_perl/Bio/PrimarySeq.pm:243
> > > STACK Bio::PrimarySeq::new /usr/lib/perl5/site_perl/Bio/PrimarySeq.pm:218
> > > STACK Bio::Seq::new /usr/lib/perl5/site_perl/Bio/Seq.pm:132
> > > STACK Bio::SeqIO::fasta::next_primary_seq /usr/lib/perl5/site_perl/Bio/SeqIO/fasta.pm:130
> > > STACK Bio::SeqIO::fasta::next_seq /usr/lib/perl5/site_perl/Bio/SeqIO/fasta.pm:85
> > > STACK Bio::DB::WebDBSeqI::get_Seq_by_id /usr/lib/perl5/site_perl/Bio/DB/WebDBSeqI.pm:141
> > > STACK toplevel /tmp/test_webget.pl:7
> > > -------------------------------------------
> > > ------------------------------------------
> > >
> > > Any thoughts?
> > >
> > > Thanks,
> > >
> > > BCT
> > > _______________________________________________
> > > Bioperl-guts-l mailing list
> > > Bioperl-guts-l@bioperl.org
> > > http://bioperl.org/mailman/listinfo/bioperl-guts-l
> > >
> >
> > --
> > Jason Stajich
> > Duke University
> > jason@cgt.mc.duke.edu
>

-- 
Jason Stajich
Duke University
jason@cgt.mc.duke.edu