[Bioperl-l] RE: Bioperl-l Digest, Vol 3, Issue 45

Brian Osborne brian_osborne at cognia.com
Thu Mar 20 13:44:30 EST 2003


Kevin and Kerr,

> If you have access to EMBOSS, this also has an indexing facility that you can access via BioPerl's extensions.

You can also index multiple fasta format or multiple Genbank format files using Bioperl's Bio::Index::* or Bio::DB::Fasta modules. The bptutorial file has some example code, and there are the bpindex.PLS and bpfetch.PLS scripts in scripts/index as well.

Brian O.
 
-----Original Message-----
From: bioperl-l-bounces at bioperl.org [mailto:bioperl-l-bounces at bioperl.org]On Behalf Of Clancy, Kevin
Sent: Thursday, March 20, 2003 1:34 PM
To: bioperl-l at bioperl.org; bioperl-l at bioperl.org
Subject: [Bioperl-l] RE: Bioperl-l Digest, Vol 3, Issue 45

Kerr,

If you look under the ncbi's ftp site - ftp://ftp.ncbi.nih.gov/blast/db you will see both the nt and nr sequece collections. You could simply download these and use these as sequence sources. nr should be proteins - nt should be nucleic acid sequences.

NCBI will probably not appreciate your hitting the entrez server many thousands of times - your sys admin probably would be a bit miffed as well, particularly if you are affecting other services while doing this.

You could try a couple of approaches - the NCBI hasd an entrez service that takes lists of gi numbers and allows you to download them as a batch. You might try this as an approach. The other alternative is to simply use the nr and nt databases (which are in fasta format) and when you identify sequences that you are interested, then retrieve these via entrez for the fully annotated sequences. Both these techniques are a bit more friendly than a mass query of ncbi.

A final approach is download GenBank (this will take a while) and then query it locally using fastacmd or some other home grown tool. For instance the bioperl faq does deal with querying and getting sequences from an indexed database. If you have access to EMBOSS, this also has an indexing facility that you can access via BioPerl's extensions.

Hope this helps.
kevin clancy

        -----Original Message-----
        From: bioperl-l-request at bioperl.org [mailto:bioperl-l-request at bioperl.org]
        Sent: Thu 3/20/2003 12:02 PM
        To: bioperl-l at bioperl.org
        Cc:
        Subject: Bioperl-l Digest, Vol 3, Issue 45
       
       

        Send Bioperl-l mailing list submissions to
                bioperl-l at bioperl.org
       
        To subscribe or unsubscribe via the World Wide Web, visit
                http://bioperl.org/mailman/listinfo/bioperl-l
        or, via email, send a message with subject or body 'help' to
                bioperl-l-request at bioperl.org
       
        You can reach the person managing the list at
                bioperl-l-owner at bioperl.org
       
        When replying, please edit your Subject line so it is more specific
        than "Re: Contents of Bioperl-l digest..."
       
       
        Today's Topics:
       
           1. Question regarding NR database (Kerr Wall)
       
       
        ----------------------------------------------------------------------
       
        Message: 1
        Date: Thu, 20 Mar 2003 11:21:49 -0500
        From: Kerr Wall <pkerrwall at psu.edu>
        Subject: [Bioperl-l] Question regarding NR database
        To: <bioperl-l at bioperl.org>
        Message-ID: <BA9F54CD.8523%pkerrwall at psu.edu>
        Content-Type: text/plain; charset="US-ASCII"
       
        Hi,
       
        I am somewhat new to Bioperl and have checked the mailing list archive with
        no luck.  I am trying to come up with a way to get all of the nucleotide cds
        sequences that are in the NR protein database.  There are currently
        1,363,299 protein sequences in NCBI's NR database file.  I would like to get
        a nucleotide sequence for each of these protein sequences.
       
        I have devised a way to use Entrez to get the sequences but I am wondering
        if there is an easier way to do this.  I can retrieve the html file for each
        protein sequence in NR using Entrez, then parse out the CDS html link fore
        each protein, then find the nucleotide sequence file in Entrez, and finally
        parse out the coding region nucleotide sequence.  This would require
        1,363,299 x 2 requests to Entrez for such a job.  Is it ok to hammer the
        Entrez server this many times?
       
        I've downloaded the NT database as well but not sure how to link the two
        files.  Hopefully someone has already had to do this and has thought about
        the logic to accomplish such a job.
       
        Thanks,
       
        Kerr
       
       
        ------------------------------
       
        _______________________________________________
        Bioperl-l mailing list
        Bioperl-l at bioperl.org
        http://bioperl.org/mailman/listinfo/bioperl-l
       
       
        End of Bioperl-l Digest, Vol 3, Issue 45
        ****************************************
       


_______________________________________________
Bioperl-l mailing list
Bioperl-l at bioperl.org
http://bioperl.org/mailman/listinfo/bioperl-l





More information about the Bioperl-l mailing list