[BioPython] FASTA parsing errors

Aaron Zschau aaron at ocelot-atroxen.dyndns.org
Tue Aug 3 19:15:19 EDT 2004


I seem to still have a problem with the results I'm getting, I need a 
protein sequence in order to do a BLAST search with the data from my 
genbank lookup, however the FASTA file created now just contains the 
nucleotide. I tried the following line:

ncbi_dict = GenBank.NCBIDictionary("protein", "fasta", 
Fasta.RecordParser())

thinking that possibly changing "nucleotide" to "protein" in your 
original recommendation would help things but I still get the following 
results which are not in protein sequence form:

 >gi|6273291|gb|AF191665.1|AF191665 Opuntia marenae rpl16 gene; 
chloroplast gene for chloroplast product, partial intron sequence
TATACATTAAAGGAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAAT
CTAAATGATATAGGATTCCACTATGTAAGGTCTTTGAATCATATCATAAAAGACAATGTA
ATAAAGCATGAATACAGATTCACACATAATTATCTGATATGAATCTATTCATAGAAAAAA
GAAAAAAGTAAGAGCCTCCGGCCAATAAAGACTAAGAGGGTTGGCTCAAGAACAAAGTTC
ATTAAGAGCTCCATTGTAGAATTCAGACCTAATCATTAATCAAGAAGCGATGGGAACGAT
GTAATCCATGAATACAGAAGATTCAATTGAAAAAGATCCTATGNTCATTGGAAGGATGGC
GGAACGAACCAGAGACCAATTCATCTATTCTGAAAAGTGATAAACTAATCCTATAAAACT
AAAATAGATATTGAAAGAGTAAATATTCGCCCGCGAAAATTCCTTTTTTATTAAATTGCT
CATATTTTCTTTTAGCAATGCAATCTAATAAAATATATCTATACAAAAAAACATAGACAA
ACTATATATATATATATATATAATATATTTCAAATTCCCTTATATATCCAAATATAAAAA
TATCTAATAAATTAGATGAATATCAAAGAATCTATTGATTTAGTGTATTATTAAATGTAT
ATATTAATTCAATATTATTATTCTATTCATTTTTATTCATTTTCAAATTTATAATATATT
AATCTATATATTAATTTAGAATTCTATTCTAATTCGAATTCAATTTTTAAATATTCATAT
TCAATTAAAATTGAAATTTTTTCATTCGCGAGGAGCCGGATGAGAAGAAACTCTCATGTC
CGGTTCTGTAGTAGAGATGGAATTAAGAAAAAACCATCAACTATAACCCCAAAAGAACCA
GA

thanks,

Aaron Zschau

On Aug 3, 2004, at 6:26 PM, Brad Chapman wrote:

> Hi Aaron;
>
> Aaron:
>>> This is the file that is being read. I know it worked in 1.24 just 
>>> fine
>>> but maybe something changed in the versions that make it not like 
>>> this
>>> format
>>>
>>> LOCUS       XM_414447               2107 bp    mRNA    linear   VRT
> [....]
>
> Jon:
>> I don't think that file conforms to the fasta format:
>> see http://ngfnblast.gbf.de/docs/fasta.html
>> I could be wrong though.
>
> Right. That's a GenBank file, which is why the Fasta parser is
> choking on it (the error message should be a lot nicer, for sure).
>
> You have two options:
>
> 1. Use a GenBank parser.
>
> 2. Retrieve Fasta sequences. Going from the code you posted
> previously, you could retrieve your search in FASTA format with the
> following:
>
> from Bio import GenBank
> from Bio import Fasta
>
> ncbi_dict = GenBank.NCBIDictionary("nucleotide", "fasta",
>              Fasta.RecordParser())
>
> seqrecord = ncbi_dict["6273291"]
>
> genbank_file = open(data_path_prefix + file_unique_id + 'fasta',
> 'w')
> genbank_file.write(seqrecord + "\n")
> genbank_file.close()
>
> This may have changed with the most recent release because the
> default for GenBank retrieval used to be fasta. Because of changes
> at NCBI this had to be updated, and I believe now defaults to
> GenBank. So, if you didn't specify "fasta" as the second argument,
> that's probably now why you are getting GenBank data. Hopefully this
> small change in your code will fix everything.
>
> Hope this helps.
> Brad
> _______________________________________________
> BioPython mailing list  -  BioPython at biopython.org
> http://biopython.org/mailman/listinfo/biopython



More information about the BioPython mailing list