[BioPython] FASTA parsing errors

Brad Chapman chapmanb at uga.edu
Wed Aug 4 09:26:26 EDT 2004


Hi Aaron;

[Retrieval from NCBI]
> I seem to still have a problem with the results I'm getting, I need a 
> protein sequence in order to do a BLAST search with the data from my 
> genbank lookup, however the FASTA file created now just contains the 
> nucleotide. I tried the following line:
> 
> ncbi_dict = GenBank.NCBIDictionary("protein", "fasta", 
> Fasta.RecordParser())
> 
> thinking that possibly changing "nucleotide" to "protein" in your 
> original recommendation would help things but I still get the following 
> results which are not in protein sequence form:
> 
> >gi|6273291|gb|AF191665.1|AF191665 Opuntia marenae rpl16 gene; 
> chloroplast gene for chloroplast product, partial intron sequence
> TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAAT
[...]

You are are still retrieving with a nucleotide GenBank ID. If you
want protein sequence, you need to use a protein GI number. This is
just a handy interface to what NCBI provides, so no magic is going
on at all -- you just get what you ask for.

If there is available protein sequence for the specific similarity
searches you are trying to do, retrieving with their protein GI
numbers will work. If protein is not available, your best bet is to
use one of the translated BLAST search programs (blastx, tblastx).

[Parsing GenBank records]
> That parsed just fine and I'm getting output now, however I'm now  
> having trouble figuring out how to extract the protein sequence from  
> this record. Looking at the API for the Record class, there isn't a  
> nice clear 'protein' attribute the way that 'sequence' is.  Do you have  
> any recommendations on how I would access that part of the Record?

'sequence' is the protein sequence, provided that you are dealing
with a GenBank protein record. If you are talking about doing
translations, then this is something that isn't done automatically.
Biopython can't know whether this is coding sequence, if it is in
the right frame, and so on. If you know all this, then you can
translate the DNA sequence yourself:

>>> from Bio import Translate
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> dna = Seq("GACTAGCCC", IUPAC.unambiguous_dna)
>>> translator = Translate.unambiguous_dna_by_name["Standard"]
>>> prot = translator.translate(dna)
>>> prot
Seq('D*P', HasStopCodon(IUPACProtein(), '*'))

Hope this helps.
Brad


More information about the BioPython mailing list