[Biopython-dev] Parsing NCBI Protein Tables (PTT) files in Bio.SeqIO

Tue Feb 27 01:24:48 UTC 2007

I have a rough NCBI Protein Table (*.ptt) file parser prepared for 
Bio.SeqIO (see below).

This file format was discussed in August 2006, and is unusual in that it 
does not actually contain any sequences (!).  This means the parser 
returns SeqRecord objects with empty sequences, but with populated 
annotation fields.

I believe Leighton Pritchard was interested in parsing PTT files from 
the NCBI.  Does something like this (below) look useful?  Does anyone 
know of a link to any "offical" documentation on this file format?

Leighton, you also mentioned parsing the NCBI's GFF files, which seem to 
be a tab separated variable dump of the information found in a GenBank 
file's features table (link to documentation welcome).

An entire GFF file could be turned into a single SeqRecord with no 
sequence, but with many sub features as SeqFeatures (akin to the results 
of the existing "genbank" parser).  The location information would be 
simplified for GFF.

Also, it looks like parsing just the CDS entries from a GFF file into 
"sequence free" SeqRecords would also be sensible... (akin to the 
existing "genbank-cds" parser).

Peter

-- 

from Bio.SeqRecord import SeqRecord
from Bio.Seq import Seq
def NcbiProteinTableIterator(handle) :
     """Returns a SeqRecord for each entry in an NCBI Protein Table (PTT 
file)

     Note that the SeqRecord object's sequence will be zero length (emtpy).
     """
     line = handle.readline()

     line = handle.readline()
     parts = line.rstrip().split()
     if len(parts) <> 2 or parts[1].lower() <> "proteins" :
         raise SyntaxError("Second line not recognised as an NCBI 
Protein Table (PTT file)")

     line = handle.readline().strip()
     if line.rstrip() <> 
"Location\tStrand\tLength\tPID\tGene\tSynonym\tCode\tCOG\tProduct" :
         raise SyntaxError("Third line not recognised as an NCBI Protein 
Table (PTT file)")

     LOCATION = 0
     STRAND = 1
     LENGTH = 2
     PID = 3
     GENE = 4
     SYNONYM = 5
     CODE = 6
     COG = 7
     PRODUCT = 8

     mapping = {
         LOCATION : "location",
         STRAND : "strand",
         PID : "PID",
         GENE : "gene",
         SYNONYM : "synonym",
         CODE : "locus_tag", # Is this always correct?
         COG : "COG",
         PRODUCT : "product",
     }

     count = int(parts[0])
     for line in handle :
         parts = line.rstrip().split("\t")
         record = SeqRecord(seq=Seq(""),
                            id=parts[PID],
                            name=parts[GENE],
                            description=parts[PRODUCT])
         if parts[LENGTH] <> "-" :
             record.annotations["length"] = int(parts[LENGTH])
         for field,key in mapping.iteritems() :
             if parts[field] <> "-" :
                 record.annotations[key] = parts[field]
         #TODO - Make sure STRAND is treated same as for GenBank
         yield record
         count -= 1
     if count <> 0 :
         raise SyntaxError("Record header number of records wrong?")