[BioPython] extraction from genbank/embl files
Liam Thompson
dejmail at gmail.com
Mon Apr 6 04:21:05 UTC 2009
Hi Peter & Sean
I am looking for a nucleotide sequence for these three genes and I have
downloaded the entire genomic sequences so that I can compare the same 3
genes from all the same isolates. I downloaded the full GenBank and FASTA
version of the same set of accession numbers, for as you said FASTA will be
easier to work with once I can identify the location information from the
info of the GB file.
I'll give SeqFeature a bash, and possibly the seqret feature of EMBOSS as
well.
Thanks
Liam
On Sun, Apr 5, 2009 at 9:29 PM, Peter <biopython at maubp.freeserve.co.uk>wrote:
> On 4/5/09, Liam Thompson <dejmail at gmail.com> wrote:
> > Hi everyone
> >
> > I have a list of accession numbers, which I've useSed to download the
> entire
> > genomic sequences of several hundred hepatitis B virus isolates. What I
> am
> > trying to do is extract 3 gene sequences from each genomic sequence, and
> > place each sequence in one of 3 files depending on the gene for further
> > analysis.
>
> Are you looking for the CDS sequence of these three genes (i.e. a
> nucleotide sequence)?
>
> > The question is whether there is a shorter way to extract from Genbank
> files
> > using the Genbank parser, specific gene sequences, or whether I would
> need
> > to identify the gene of each genomic isolate individually (as they are
> > called a variety of names, despite being the same gene which makes it
> > trickier), copy the coordinates of the gene sequence, and then proceed
> > further down the file and actually perform the copying of the gene.
>
> I see two main options for you (regardless of what programming
> language you want to use):
>
> (1) Compile a list of all the gene names by hand.
> (2) Compile a few examples by hand, and then use pairwise alignments
> (e.g. BLAST, or FASTA, or needle from EMBOSS) to find the matching
> gene in each virus. You could do this with the protein or the
> nucleotide sequence.
>
> Using Biopython's Bio.SeqIO EMBL/GenBank parser each gene/CDS in the
> EMBL/GenBank file will be represented as a SeqFeature object, which
> includes the location information. If you can identify which features
> you want from their annotation, then that tells you where to cut the
> parent sequence. See this page for some related discussion:
> http://www.warwick.ac.uk/go/peter_cock/python/genbank/
>
> As an alternative approach, rather than starting with the EMBL/GenBank
> files, can you just download the CDS sequences as a FASTA file? e.g.
> files called *.ffn from the NCBI ftp site. You might also want to
> download the genes protein sequence, the NCBI uses *.faa for these
> (FASTA amino acids).
>
> Having FASTA files would make the sequence comparison approach easiest
> - most of these tools will expect FASTA input files.
>
> Peter
>
--
-----------------------------------------------------------
Antiviral Gene Therapy Research Unit
University of the Witwatersrand
Faculty of Health Sciences, Room 7Q07
7 York Road, Parktown
2193
Tel: 2711 717 2465/7
Fax: 2711 717 2395
Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com
More information about the Biopython
mailing list