[BioPython] How can I retreive FASTA sequences from NCBI

Peter (BioPython) biopython at maubp.freeserve.co.uk
Sat Apr 1 19:59:46 UTC 2006


Srinivas Iyyer wrote:
> Hi , 
> I have 151,204 GenBank Accession IDs. 
> I want to retreive FASTA sequences from NCBI and
> compile them for my local blast. 
 >
 > I am unable to get fasta sequences. I do not
 > understand.
 >
 > Could any one please help me.

This should help.  Using the first identifier in your example, AA035383, 
this is a nucleotide sequence, available from the NCBI.  By searching 
the Entrez database you end up here:-

http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=1507107

Note, AA035383 --> gi:1507107

Using the web interface, you can choose to view it as FASTA format 
rather than the default of GenBank format, and save to file.

You could make a note of that URL, and just change the GI number to 
download all the files you want - but you need a simple way to determine 
the GI number...

Now, BioPython can help you here:

 >>> from Bio import GenBank
 >>> gi_list = GenBank.search_for('AA035383', database='nucleotide')
 >>> print gi_list
['1507107']

You could use this code to get the GI numbers for each of your 151,204 
GenBank Accession IDs.  I would check in each case that only one GI 
number is returned.

 >>> assert len(gi_list)==1
 >>> gi_number = gi_list[0]

Once you have the GI number, then you could just download the FASTA file 
yourself and then parse it in the normal way.  Or, get BioPython to do 
all this for you with its rather clever NCBIDictionary object...

 >>> from Bio import Fasta
 >>> from Bio import GenBank
 >>> ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'fasta', \
...             parser =  Fasta.RecordParser())
 >>> gi_number = '1507107'
 >>> fasta_rec = ncbi_dict[gi_number]
 >>> print fasta_rec
 >gi|1507107|gb|AA035383.1|AA035383 zk25e12.r1 
Soares_pregnant_uterus_NbHPU Homo sapiens cDNA clone IMAGE:471598 5', 
mRNA sequence
CTTGAGCCTCAGGAACGAGATGGCGGTTCTCTGGAGGCTGAGTGCCGTTTGCGGTGCCCT
AGGAGGCCGAGCTCTGTTGCTTCGAACTCCAGTGGTCAGACCTGCTCATATCTCAGCATT
TCTTCAGGACCGACCTATCCCAGAATGGTGTGGAGTGCAGCACATACACTTGTCACCCGA
GCCACCATTCTGGCTCCAAGGCTGCATCTCTCCACTGGACTAGCGAGANGGTTGTCANTG
TTTTGCTCCTGGGTCTGCTTCCCGGCTGCTTANTTGAANCCTTGCTCNGCGANGGACTAN
TCCCTGGC

You could use the Fasta.SequenceParser() if you prefer.  I would guess 
you would then want to save these FASTA records into one long FASTA file.

Enjoy!

Peter




More information about the Biopython mailing list