[BioPython] Writing a biopython script to download all Genbank records from Nucleotide database
Christof Winter
winter at biotec.tu-dresden.de
Sun Nov 11 19:37:02 UTC 2007
Matthew Abravanel wrote:
> Hi everyone,
>
> Please ignore my last messages, I am still getting the hang of this
> e-mail list and everything. I am trying to write a bio-python script to
> download all Genbank records in the Nucleotide database and I know what I
> want to do just not how to go about doing it. I am using a Unix based system
> with bio-python 2.4 and I am using emacs editor, if someone could help me
> out I would really appreciate it with some sample code or something. I
> just started learning python and have tried to follow the documentation and
> cookbook without much success, my programming experience is virtually
> non-existent. Thanks.
>
> Matthew
Hi Matthew,
I used the code below to retrieve some entries from the Nucleotide database.
Since two entries already take a few seconds, it is probably a bad idea to
download _all_ entries in that way.
You might be better off downloading the data first:
ftp://ftp.ncbi.nih.gov/genbank/
HTH,
cheers,
Christof
from Bio import GenBank
featureParser = GenBank.FeatureParser()
ncbiDict = GenBank.NCBIDictionary("nucleotide", "genbank", parser=featureParser)
accessionNumbers = ["BC063166", "NM_028459"]
for accessionNo in accessionNumbers:
giList = GenBank.search_for(accessionNo)
for gi in giList:
record = ncbiDict[gi] # parsing happens here
for feature in record.features:
# extract sequences
if feature.type == "CDS":
codingStart = feature.location._start.position
codingEnd = feature.location._end.position
completeSequence = record.seq.tostring()
fiveUTRSequence = completeSequence[:codingStart]
codingSequence = completeSequence[codingStart:codingEnd]
threeUTRSequence = completeSequence[codingEnd:]
# extract gene name
if feature.type == "gene":
geneName = feature.qualifiers['gene'][0]
print "Found", gi, geneName, len(completeSequence)
More information about the Biopython
mailing list