[Biopython-dev] sequence format readers ?
Thomas Sicheritz-Ponten
thomas at cbs.dtu.dk
Thu Sep 6 05:22:01 EDT 2001
Brad Chapman <chapmanb at arches.uga.edu> writes:
> Hi Thomas!
>
> > To follow up one of the discussions and questions at ISMB in Copenhagen,
> > - how are we going to proceed with the sequence format reader (the
> > biopython variant of readseq ...)
>
> It's great that you're going to work on this! It's definately much
> desired by a lot o' people (in fact I was just having a conversation
> today about format conversion).
>
> > Currently we can only have parsers for Fasta, Embl and GenBank. What we
> > need is a internal format and functions/modules which can read/write:
> [...impressive list o' formats...]
> > ??? - more suggestions ?
>
> I think supporting this many would be an *excellent* start :-).
>
> > I can write most of the rules, but I guess we have to define a smart base
> > class/parser - where plugging in a new format should only take 5 seconds ...
> > If we brain storm on the design of the reader/writer, I could volunteer to
> > implement the format rules ...
> >
> > Some things to consider:
> > * some formats are alignment based (e.g. clustal, phylip, nexus)
> > * some formats have loads of information which is lost when converted to a
> > lower info-rich format( e.g. Embl -> Fasta). But Embl -> GenBank should
> > not lose any information
> > * some formats allow multiple entries, some not
>
> Just as a way of getting things started (I haven't done a lot of
> thinking about this), my opinion is that the best way to do this is
> to have a SeqIO system kinda like Bioperl. The inputs into the SeqIO
> system would be the standard SeqRecord object that we currently
> have. The advantage of this is that existing parsers (ie Fasta,
> GenBank), already parse into this, so all that would need to be done
> is to define a mapping that converts a generic SeqRecord object to
> and from the formats "native" Record based representation. So to
> convert from GenBank to Fasta you could do:
>
> GenBank Record Format --> SeqRecord --> Fasta Record Format
>
> Since the Record formats already provide writing capabilities (and
> we have the parsers to parse into them) we would already get writing
> and parsing "for free." Also, we would make good use of our existing
> "generic" Sequence representations.
>
> The advantages of this is that it would help us avoid having to make
> a billion GenBank -> Fasta, GenBank -> EMBL, GenBank -> whatever
> specific converters. The disadvantage of this is that we may lose
> some information in the conversion process (but than again, what
> converters don't :-).
I think inheriting the Seq object to a SeqIOSeq object is enough.
We just need to add a single dictionary (features) where all
Swiss/EMBL/GenBank extra annotations can be added.
e.g.
class SeqIOSeq(Seq):
def __init__(self):
Seq.__init__(self)
# dictionary for extra annotations (e.g. Embl, GenBank)
self.features = {}
In the case of
GenBank Record Format --> SeqIOSeq --> Fasta Record Format
we pick only the the name and sequence ...
but for
GenBank Record Format --> SeqIOSeq --> EMBL Record Format
the writer function should check if there are any additional features
(self.features.keys())
That way we shouldn't loose any information.
It would be nice if a new format can be added by simply adding functions
for reading, writing and recognizing the format.
I not completely sure of how to define these functions - any ideas ?
example code ...
import sys
from Bio.Seq import Seq
NO, YES = 0,1
class SeqIOSeq(Seq):
def __init__(self):
Seq.__init__(self)
# dictionary for extra annotations (e.g. Embl, GenBank)
self.features = {}
class SeqIO:
# dictionary to store functions for
# recognizing, reading and writing of different sequence formats
recognizers = {}
readers = {}
writers = {}
def __init__(self, **kwds):
self.name = None
self.format = None
self.sequence = SeqIOSeq()
self.is_an_alignment = NO
self.allow_multiple_entries = YES
for k,v in kwds: setattr(self, k, v)
def AddFormat(self, name, recognizeF, readF, writeF):
self.recognizers[name] = recognizeF
self.readers[name] = readF
self.writers[name] = writeF
needing-a-machete-for-the-sequence-format-jungle'ly yr's
-thomas
--
Sicheritz-Ponten Thomas, Ph.D CBS, Department of Biotechnology
thomas at biopython.org The Technical University of Denmark
CBS: +45 45 252489 Building 208, DK-2800 Lyngby
Fax +45 45 931585 http://www.cbs.dtu.dk/thomas
De Chelonian Mobile ... The Turtle Moves ...
More information about the Biopython-dev
mailing list