[Bioperl-l] Question regarding NR database

Thu Mar 20 11:21:49 EST 2003

Hi,

I am somewhat new to Bioperl and have checked the mailing list archive with
no luck.  I am trying to come up with a way to get all of the nucleotide cds
sequences that are in the NR protein database.  There are currently
1,363,299 protein sequences in NCBI's NR database file.  I would like to get
a nucleotide sequence for each of these protein sequences.

I have devised a way to use Entrez to get the sequences but I am wondering
if there is an easier way to do this.  I can retrieve the html file for each
protein sequence in NR using Entrez, then parse out the CDS html link fore
each protein, then find the nucleotide sequence file in Entrez, and finally
parse out the coding region nucleotide sequence.  This would require
1,363,299 x 2 requests to Entrez for such a job.  Is it ok to hammer the
Entrez server this many times?

I've downloaded the NT database as well but not sure how to link the two
files.  Hopefully someone has already had to do this and has thought about
the logic to accomplish such a job.

Thanks,

Kerr