[Biopython-dev] Parsing NCBI Protein Tables (PTT) files in Bio.SeqIO
Peter
biopython-dev at maubp.freeserve.co.uk
Tue Feb 27 01:24:48 UTC 2007
I have a rough NCBI Protein Table (*.ptt) file parser prepared for
Bio.SeqIO (see below).
This file format was discussed in August 2006, and is unusual in that it
does not actually contain any sequences (!). This means the parser
returns SeqRecord objects with empty sequences, but with populated
annotation fields.
I believe Leighton Pritchard was interested in parsing PTT files from
the NCBI. Does something like this (below) look useful? Does anyone
know of a link to any "offical" documentation on this file format?
Leighton, you also mentioned parsing the NCBI's GFF files, which seem to
be a tab separated variable dump of the information found in a GenBank
file's features table (link to documentation welcome).
An entire GFF file could be turned into a single SeqRecord with no
sequence, but with many sub features as SeqFeatures (akin to the results
of the existing "genbank" parser). The location information would be
simplified for GFF.
Also, it looks like parsing just the CDS entries from a GFF file into
"sequence free" SeqRecords would also be sensible... (akin to the
existing "genbank-cds" parser).
Peter
--
from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
def NcbiProteinTableIterator(handle) :
"""Returns a SeqRecord for each entry in an NCBI Protein Table (PTT
file)
Note that the SeqRecord object's sequence will be zero length (emtpy).
"""
line = handle.readline()
line = handle.readline()
parts = line.rstrip().split()
if len(parts) <> 2 or parts[1].lower() <> "proteins" :
raise SyntaxError("Second line not recognised as an NCBI
Protein Table (PTT file)")
line = handle.readline().strip()
if line.rstrip() <>
"Location\tStrand\tLength\tPID\tGene\tSynonym\tCode\tCOG\tProduct" :
raise SyntaxError("Third line not recognised as an NCBI Protein
Table (PTT file)")
LOCATION = 0
STRAND = 1
LENGTH = 2
PID = 3
GENE = 4
SYNONYM = 5
CODE = 6
COG = 7
PRODUCT = 8
mapping = {
LOCATION : "location",
STRAND : "strand",
PID : "PID",
GENE : "gene",
SYNONYM : "synonym",
CODE : "locus_tag", # Is this always correct?
COG : "COG",
PRODUCT : "product",
}
count = int(parts[0])
for line in handle :
parts = line.rstrip().split("\t")
record = SeqRecord(seq=Seq(""),
id=parts[PID],
name=parts[GENE],
description=parts[PRODUCT])
if parts[LENGTH] <> "-" :
record.annotations["length"] = int(parts[LENGTH])
for field,key in mapping.iteritems() :
if parts[field] <> "-" :
record.annotations[key] = parts[field]
#TODO - Make sure STRAND is treated same as for GenBank
yield record
count -= 1
if count <> 0 :
raise SyntaxError("Record header number of records wrong?")
More information about the Biopython-dev
mailing list