[BioPython] extraction from genbank/embl files

Mon Apr 6 10:50:15 UTC 2009

On 4/6/09, Liam Thompson <dejmail at gmail.com> wrote:
> Hi Peter & Sean
>
> I am looking for a nucleotide sequence for these three genes and I have
> downloaded the entire genomic sequences so that I can compare the same 3
> genes from all the same isolates. I downloaded the full GenBank and FASTA
> version of the same set of accession numbers, for as you said FASTA will be
> easier to work with once I can identify the location information from the
> info of the GB file.

The NCBI at least provide three flavours of FASTA file for a genome:
*.fna - FASTA Nucleic Acids - entire DNA nucleotide sequence as one record
*.faa - FASTA Amino Acids - amino acid sequences for each gene
*.ffn - FASTA Feature Nucleotides - nucleotide sequences for each gene
This is easiest to see on the FTP site.

In your case, using the ffn files might be simplest - assuming you can
recognise the genes from their sequences (e.g. using pairwise
alignments to known references).

Peter