[Bioperl-l] RE: Bioperl-l Digest, Vol 3, Issue 45

Thu Mar 20 13:34:10 EST 2003

Kerr,

If you look under the ncbi's ftp site - ftp://ftp.ncbi.nih.gov/blast/db you will see both the nt and nr sequece collections. You could simply download these and use these as sequence sources. nr should be proteins - nt should be nucleic acid sequences.

NCBI will probably not appreciate your hitting the entrez server many thousands of times - your sys admin probably would be a bit miffed as well, particularly if you are affecting other services while doing this.

You could try a couple of approaches - the NCBI hasd an entrez service that takes lists of gi numbers and allows you to download them as a batch. You might try this as an approach. The other alternative is to simply use the nr and nt databases (which are in fasta format) and when you identify sequences that you are interested, then retrieve these via entrez for the fully annotated sequences. Both these techniques are a bit more friendly than a mass query of ncbi.

A final approach is download GenBank (this will take a while) and then query it locally using fastacmd or some other home grown tool. For instance the bioperl faq does deal with querying and getting sequences from an indexed database. If you have access to EMBOSS, this also has an indexing facility that you can access via BioPerl's extensions.

Hope this helps.
kevin clancy

	-----Original Message----- 
	From: bioperl-l-request at bioperl.org [mailto:bioperl-l-request at bioperl.org] 
	Sent: Thu 3/20/2003 12:02 PM 
	To: bioperl-l at bioperl.org 
	Cc: 
	Subject: Bioperl-l Digest, Vol 3, Issue 45

	Send Bioperl-l mailing list submissions to
	        bioperl-l at bioperl.org

	To subscribe or unsubscribe via the World Wide Web, visit
	        http://bioperl.org/mailman/listinfo/bioperl-l
	or, via email, send a message with subject or body 'help' to
	        bioperl-l-request at bioperl.org

	You can reach the person managing the list at
	        bioperl-l-owner at bioperl.org

	When replying, please edit your Subject line so it is more specific
	than "Re: Contents of Bioperl-l digest..."

	Today's Topics:

	   1. Question regarding NR database (Kerr Wall)

	----------------------------------------------------------------------

	Message: 1
	Date: Thu, 20 Mar 2003 11:21:49 -0500
	From: Kerr Wall <pkerrwall at psu.edu>
	Subject: [Bioperl-l] Question regarding NR database
	To: <bioperl-l at bioperl.org>
	Message-ID: <BA9F54CD.8523%pkerrwall at psu.edu>
	Content-Type: text/plain; charset="US-ASCII"

	Hi,

	I am somewhat new to Bioperl and have checked the mailing list archive with
	no luck.  I am trying to come up with a way to get all of the nucleotide cds
	sequences that are in the NR protein database.  There are currently
	1,363,299 protein sequences in NCBI's NR database file.  I would like to get
	a nucleotide sequence for each of these protein sequences.

	I have devised a way to use Entrez to get the sequences but I am wondering
	if there is an easier way to do this.  I can retrieve the html file for each
	protein sequence in NR using Entrez, then parse out the CDS html link fore
	each protein, then find the nucleotide sequence file in Entrez, and finally
	parse out the coding region nucleotide sequence.  This would require
	1,363,299 x 2 requests to Entrez for such a job.  Is it ok to hammer the
	Entrez server this many times?

	I've downloaded the NT database as well but not sure how to link the two
	files.  Hopefully someone has already had to do this and has thought about
	the logic to accomplish such a job.

	Thanks,

	Kerr

	------------------------------

	_______________________________________________
	Bioperl-l mailing list
	Bioperl-l at bioperl.org
	http://bioperl.org/mailman/listinfo/bioperl-l

	End of Bioperl-l Digest, Vol 3, Issue 45
	****************************************