[BioPython] Writing a biopython script to download all Genbank records from Nucleotide database

Sun Nov 11 19:37:02 UTC 2007

Matthew Abravanel wrote:
> Hi everyone,
> 
>         Please ignore my last messages, I am still getting the hang of this
> e-mail list and everything. I am trying to write a bio-python script to
> download all Genbank records in the Nucleotide database and I know what I
> want to do just not how to go about doing it. I am using a Unix based system
> with bio-python 2.4 and I am using emacs editor,  if someone could help me
> out I would really appreciate it with some sample code or something.  I
> just started learning python  and have tried to follow the documentation and
> cookbook without much success,  my programming experience is  virtually
> non-existent.  Thanks.
> 
> Matthew

Hi Matthew,

I used the code below to retrieve some entries from the Nucleotide database. 
Since two entries already take a few seconds, it is probably a bad idea to 
download _all_ entries in that way.

You might be better off downloading the data first:
ftp://ftp.ncbi.nih.gov/genbank/

HTH,
cheers,
Christof

from Bio import GenBank

featureParser = GenBank.FeatureParser()
ncbiDict = GenBank.NCBIDictionary("nucleotide", "genbank", parser=featureParser)

accessionNumbers = ["BC063166", "NM_028459"]

for accessionNo in accessionNumbers:
     giList = GenBank.search_for(accessionNo)
     for gi in giList:
         record = ncbiDict[gi]   # parsing happens here
         for feature in record.features:
             # extract sequences
             if feature.type == "CDS":
                 codingStart = feature.location._start.position
                 codingEnd = feature.location._end.position
                 completeSequence = record.seq.tostring()
                 fiveUTRSequence = completeSequence[:codingStart]
                 codingSequence = completeSequence[codingStart:codingEnd]
                 threeUTRSequence = completeSequence[codingEnd:]
             # extract gene name
             if feature.type == "gene":
                 geneName = feature.qualifiers['gene'][0]
         print "Found", gi, geneName, len(completeSequence)