[BioPython] extraction from genbank/embl files

Sun Apr 5 15:29:40 EDT 2009

On 4/5/09, Liam Thompson <dejmail at gmail.com> wrote:
> Hi everyone
>
>  I have a list of accession numbers, which I've used to download the entire
>  genomic sequences of several hundred hepatitis B virus isolates. What I am
>  trying to do is extract 3 gene sequences from each genomic sequence, and
>  place each sequence in one of 3 files depending on the gene for further
>  analysis.

Are you looking for the CDS sequence of these three genes (i.e. a
nucleotide sequence)?

>  The question is whether there is a shorter way to extract from Genbank files
>  using the Genbank parser, specific gene sequences, or whether I would need
>  to identify the gene of each genomic isolate individually (as they are
>  called a variety of names, despite being the same gene which makes it
>  trickier), copy the coordinates of the gene sequence, and then proceed
>  further down the file and actually perform the copying of the gene.

I see two main options for you (regardless of what programming
language you want to use):

(1) Compile a list of all the gene names by hand.
(2) Compile a few examples by hand, and then use pairwise alignments
(e.g. BLAST, or FASTA, or needle from EMBOSS) to find the matching
gene in each virus.  You could do this with the protein or the
nucleotide sequence.

Using Biopython's Bio.SeqIO EMBL/GenBank parser each gene/CDS in the
EMBL/GenBank file will be represented as a SeqFeature object, which
includes the location information.  If you can identify which features
you want from their annotation, then that tells you where to cut the
parent sequence.  See this page for some related discussion:
http://www.warwick.ac.uk/go/peter_cock/python/genbank/

As an alternative approach, rather than starting with the EMBL/GenBank
files, can you just download the CDS sequences as a FASTA file?  e.g.
files called *.ffn from the NCBI ftp site.  You might also want to
download the genes protein sequence, the NCBI uses *.faa for these
(FASTA amino acids).

Having FASTA files would make the sequence comparison approach easiest
- most of these tools will expect FASTA input files.

Peter