[Biopython-dev] Bio.GFF and Brad's code
Peter
biopython at maubp.freeserve.co.uk
Wed Apr 22 12:24:36 EDT 2009
On Mon, Apr 13, 2009 at 1:16 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> I don't think the GFF parser should only return SeqRecord object, but
> I do see a use for this (via Bio.SeqIO). GFF files could be
> represented as a list of SeqFeature objects, and using a SeqRecord to
> hold this seems very natural to me. It also means we could use
> Bio.SeqIO to load a GFF file into SeqRecord objects for storage in a
> BioSQL database.
>
> If you look at the NCBI FTP site, they often provide genome sequences
> in a range of file formats including GenBank and GFF.
>
> e.g.
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/
>
> The GenBank files contain the features plus the sequence,
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gbk
>
> Their GFF3 file only contains the features:
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.gff
>
> Some GFF files will include the sequence too, in this case we can
> fetch it in FASTA format:
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.fna
>
> In principle, you could parse this FASTA file and the GFF3 file and
> put together a GenBank file - or vice versa.
>
> As an aside, I would also consider adding protein table support on the
> same lines, look at this file:
> ftp://ftp.ncbi.nih.gov/genomes/Bacteria/Escherichia_coli_K_12_substr__MG1655/NC_000913.ptt
> The header information gives us the genome size, so Bio.SeqIO could
> return a SeqRecord with lots of SeqFeature objects and for the
> SeqRecord's seq property use a Bio.Seq.UnknownSeq of length 4639675bp.
> This is something I might look at implementing myself after Biopython
> 1.50 is out. We should be able to read in a GenBank file and output a
> PTT file, and verify it matches the NCBI provided version of the PTT
> file.
There is a working NCBI protein table ("ptt") format parser for Bio.SeqIO
on Bug 2819 including unit tests.
http://bugzilla.open-bio.org/show_bug.cgi?id=2819
Hopefully this will be useful in integrating the GFF/GFF3 parser into
Bio.SeqIO, as well as being worth while in its own right. This "ptt"
parser should work fine with BioSQL and GenomeDiagram, offering
a light weight alternative to parsing the GenBank or GFF3 file when
all you care about is the locations of the proteins (CDS features).
Peter
More information about the Biopython-dev
mailing list