[BioPython] extraction from genbank/embl files

Liam Thompson dejmail at gmail.com
Mon Apr 6 00:21:05 EDT 2009


Hi Peter & Sean

I am looking for a nucleotide sequence for these three genes and I have
downloaded the entire genomic sequences so that I can compare the same 3
genes from all the same isolates. I downloaded the full GenBank and FASTA
version of the same set of accession numbers, for as you said FASTA will be
easier to work with once I can identify the location information from the
info of the GB file.

I'll give SeqFeature a bash, and possibly the seqret feature of EMBOSS as
well.

Thanks
Liam



On Sun, Apr 5, 2009 at 9:29 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:

> On 4/5/09, Liam Thompson <dejmail at gmail.com> wrote:
> > Hi everyone
> >
> >  I have a list of accession numbers, which I've useSed to download the
> entire
> >  genomic sequences of several hundred hepatitis B virus isolates. What I
> am
> >  trying to do is extract 3 gene sequences from each genomic sequence, and
> >  place each sequence in one of 3 files depending on the gene for further
> >  analysis.
>
> Are you looking for the CDS sequence of these three genes (i.e. a
> nucleotide sequence)?
>
> >  The question is whether there is a shorter way to extract from Genbank
> files
> >  using the Genbank parser, specific gene sequences, or whether I would
> need
> >  to identify the gene of each genomic isolate individually (as they are
> >  called a variety of names, despite being the same gene which makes it
> >  trickier), copy the coordinates of the gene sequence, and then proceed
> >  further down the file and actually perform the copying of the gene.
>
> I see two main options for you (regardless of what programming
> language you want to use):
>
> (1) Compile a list of all the gene names by hand.
> (2) Compile a few examples by hand, and then use pairwise alignments
> (e.g. BLAST, or FASTA, or needle from EMBOSS) to find the matching
> gene in each virus.  You could do this with the protein or the
> nucleotide sequence.
>
> Using Biopython's Bio.SeqIO EMBL/GenBank parser each gene/CDS in the
> EMBL/GenBank file will be represented as a SeqFeature object, which
> includes the location information.  If you can identify which features
> you want from their annotation, then that tells you where to cut the
> parent sequence.  See this page for some related discussion:
> http://www.warwick.ac.uk/go/peter_cock/python/genbank/
>
> As an alternative approach, rather than starting with the EMBL/GenBank
> files, can you just download the CDS sequences as a FASTA file?  e.g.
> files called *.ffn from the NCBI ftp site.  You might also want to
> download the genes protein sequence, the NCBI uses *.faa for these
> (FASTA amino acids).
>
> Having FASTA files would make the sequence comparison approach easiest
> - most of these tools will expect FASTA input files.
>
> Peter
>



-- 
-----------------------------------------------------------
Antiviral Gene Therapy Research Unit
University of the Witwatersrand
Faculty of Health Sciences, Room 7Q07
7 York Road, Parktown
2193

Tel: 2711 717 2465/7
Fax: 2711 717 2395
Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com


More information about the Biopython mailing list