[BioPython] How can I retreive FASTA sequences from NCBI
Peter (BioPython)
biopython at maubp.freeserve.co.uk
Sat Apr 1 19:59:46 UTC 2006
Srinivas Iyyer wrote:
> Hi ,
> I have 151,204 GenBank Accession IDs.
> I want to retreive FASTA sequences from NCBI and
> compile them for my local blast.
>
> I am unable to get fasta sequences. I do not
> understand.
>
> Could any one please help me.
This should help. Using the first identifier in your example, AA035383,
this is a nucleotide sequence, available from the NCBI. By searching
the Entrez database you end up here:-
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nucleotide&val=1507107
Note, AA035383 --> gi:1507107
Using the web interface, you can choose to view it as FASTA format
rather than the default of GenBank format, and save to file.
You could make a note of that URL, and just change the GI number to
download all the files you want - but you need a simple way to determine
the GI number...
Now, BioPython can help you here:
>>> from Bio import GenBank
>>> gi_list = GenBank.search_for('AA035383', database='nucleotide')
>>> print gi_list
['1507107']
You could use this code to get the GI numbers for each of your 151,204
GenBank Accession IDs. I would check in each case that only one GI
number is returned.
>>> assert len(gi_list)==1
>>> gi_number = gi_list[0]
Once you have the GI number, then you could just download the FASTA file
yourself and then parse it in the normal way. Or, get BioPython to do
all this for you with its rather clever NCBIDictionary object...
>>> from Bio import Fasta
>>> from Bio import GenBank
>>> ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'fasta', \
... parser = Fasta.RecordParser())
>>> gi_number = '1507107'
>>> fasta_rec = ncbi_dict[gi_number]
>>> print fasta_rec
>gi|1507107|gb|AA035383.1|AA035383 zk25e12.r1
Soares_pregnant_uterus_NbHPU Homo sapiens cDNA clone IMAGE:471598 5',
mRNA sequence
CTTGAGCCTCAGGAACGAGATGGCGGTTCTCTGGAGGCTGAGTGCCGTTTGCGGTGCCCT
AGGAGGCCGAGCTCTGTTGCTTCGAACTCCAGTGGTCAGACCTGCTCATATCTCAGCATT
TCTTCAGGACCGACCTATCCCAGAATGGTGTGGAGTGCAGCACATACACTTGTCACCCGA
GCCACCATTCTGGCTCCAAGGCTGCATCTCTCCACTGGACTAGCGAGANGGTTGTCANTG
TTTTGCTCCTGGGTCTGCTTCCCGGCTGCTTANTTGAANCCTTGCTCNGCGANGGACTAN
TCCCTGGC
You could use the Fasta.SequenceParser() if you prefer. I would guess
you would then want to save these FASTA records into one long FASTA file.
Enjoy!
Peter
More information about the Biopython
mailing list