[Biopython-dev] sequence format readers ?

Thu Sep 6 05:22:01 EDT 2001

Brad Chapman <chapmanb at arches.uga.edu> writes:

> Hi Thomas!
> 
> > To follow up one of the discussions and questions at ISMB in Copenhagen,
> > - how are we going to proceed with the sequence format reader (the
> > biopython variant of readseq ...)
> 
> It's great that you're going to work on this! It's definately much
> desired by a lot o' people (in fact I was just having a conversation
> today about format conversion).
> 
> > Currently we can only have parsers for Fasta, Embl and GenBank.  What we
> > need is a internal format and functions/modules which can read/write:
> [...impressive list o' formats...]
> > ??? - more suggestions ?
> 
> I think supporting this many would be an *excellent* start :-).
> 
> > I can write most of the rules, but I guess we have to define a smart base
> > class/parser - where plugging in a new format should only take 5 seconds ...
> > If we brain storm on the design of the reader/writer, I could volunteer to
> > implement the format rules ...
> > 
> > Some things to consider:
> > * some formats are alignment based (e.g. clustal, phylip, nexus)
> > * some formats have loads of information which is lost when converted to a
> >   lower info-rich format( e.g. Embl -> Fasta). But Embl -> GenBank should
> >   not lose any information 
> > * some formats allow multiple entries, some not
> 
> Just as a way of getting things started (I haven't done a lot of
> thinking about this), my opinion is that the best way to do this is
> to have a SeqIO system kinda like Bioperl. The inputs into the SeqIO
> system would be the standard SeqRecord object that we currently
> have. The advantage of this is that existing parsers (ie Fasta,
> GenBank), already parse into this, so all that would need to be done
> is to define a mapping that converts a generic SeqRecord object to
> and from the formats "native" Record based representation. So to
> convert from GenBank to Fasta you could do:
> 
> GenBank Record Format --> SeqRecord --> Fasta Record Format 
> 
> Since the Record formats already provide writing capabilities (and
> we have the parsers to parse into them) we would already get writing
> and parsing "for free." Also, we would make good use of our existing
> "generic" Sequence representations.
> 
> The advantages of this is that it would help us avoid having to make
> a billion GenBank -> Fasta, GenBank -> EMBL, GenBank -> whatever
> specific converters. The disadvantage of this is that we may lose
> some information in the conversion process (but than again, what
> converters don't :-).

I think inheriting the Seq object to a SeqIOSeq object is enough.
We just need to add a single dictionary (features) where all
Swiss/EMBL/GenBank extra annotations can be added. 

e.g.
class SeqIOSeq(Seq):
    def __init__(self):
        Seq.__init__(self)
        # dictionary for extra annotations (e.g. Embl, GenBank)
        self.features = {} 

In the case of 
GenBank Record Format --> SeqIOSeq --> Fasta Record Format 
we pick only the the name and sequence ...

but for 
GenBank Record Format --> SeqIOSeq --> EMBL Record Format 
the writer function should check if there are any additional features
(self.features.keys())
That way we shouldn't loose any information.

It would be nice if a new format can be added by simply adding functions
for reading, writing and recognizing the format.
I not completely sure of how to define these functions - any ideas ?

example code ...

import sys
from Bio.Seq import Seq
NO, YES = 0,1

class SeqIOSeq(Seq):
    def __init__(self):
        Seq.__init__(self)
        # dictionary for extra annotations (e.g. Embl, GenBank)
        self.features = {} 

class SeqIO:
    # dictionary to store functions for
    # recognizing, reading and writing of different sequence formats
    recognizers = {}
    readers = {}
    writers = {}

    def __init__(self, **kwds):
        self.name = None
        self.format = None
        self.sequence = SeqIOSeq()
        self.is_an_alignment = NO
        self.allow_multiple_entries = YES
        for k,v in kwds: setattr(self, k, v)

    def AddFormat(self, name, recognizeF, readF, writeF):
        self.recognizers[name] = recognizeF
        self.readers[name] = readF
        self.writers[name] = writeF

needing-a-machete-for-the-sequence-format-jungle'ly yr's
-thomas

-- 
Sicheritz-Ponten Thomas, Ph.D  CBS, Department of Biotechnology
thomas at biopython.org           The Technical University of Denmark
CBS:  +45 45 252489            Building 208, DK-2800 Lyngby
Fax   +45 45 931585            http://www.cbs.dtu.dk/thomas

	De Chelonian Mobile ... The Turtle Moves ...