[Biopython-dev] sequence format readers ?

Thu Sep 6 23:39:05 EDT 2001

At 11:22 AM +0200 9/6/01, Thomas Sicheritz-Ponten wrote:
>Brad Chapman <chapmanb at arches.uga.edu> writes:

[Thomas]
>  > > I can write most of the rules, but I guess we have to define a smart base
>>  > class/parser - where plugging in a new format should only take 5 
>>seconds ...
>>  > If we brain storm on the design of the reader/writer, I could volunteer to
>>  > implement the format rules ...
>>  >
>>  > Some things to consider:
>>  > * some formats are alignment based (e.g. clustal, phylip, nexus)
>>  > * some formats have loads of information which is lost when converted to a
>>  >   lower info-rich format( e.g. Embl -> Fasta). But Embl -> GenBank should
>>  >   not lose any information
>>  > * some formats allow multiple entries, some not
>>
>>  Just as a way of getting things started (I haven't done a lot of
>>  thinking about this), my opinion is that the best way to do this is
>>  to have a SeqIO system kinda like Bioperl. The inputs into the SeqIO
>>  system would be the standard SeqRecord object that we currently
>>  have. The advantage of this is that existing parsers (ie Fasta,
>>  GenBank), already parse into this, so all that would need to be done
>>  is to define a mapping that converts a generic SeqRecord object to
>>  and from the formats "native" Record based representation. So to
>>  convert from GenBank to Fasta you could do:
>>
>>  GenBank Record Format --> SeqRecord --> Fasta Record Format
>>
>>  Since the Record formats already provide writing capabilities (and
>>  we have the parsers to parse into them) we would already get writing
>>  and parsing "for free." Also, we would make good use of our existing
>>  "generic" Sequence representations.
>>
>>  The advantages of this is that it would help us avoid having to make
>>  a billion GenBank -> Fasta, GenBank -> EMBL, GenBank -> whatever
>>  specific converters. The disadvantage of this is that we may lose
>>  some information in the conversion process (but than again, what
>  > converters don't :-).

Yes.  It would be nice to have a design where any conversion can be 
done via an intermediate data structure.  However, it should also be 
possible to plug in your own converter if you want.  For example, if 
you really need to have a good GenBank -> EMBL translator, you can 
code one up that bypasses the intermediate, and Biopython should use 
it.  That is, biopython should have 2 methods for translation, 1) 
general, but possible lossy translation via an intermediate, and 2) 
direct translation if we happen to have a translator for those two 
types; and the methods should work together as seamlessly as possible.

>I think inheriting the Seq object to a SeqIOSeq object is enough.
>We just need to add a single dictionary (features) where all
>Swiss/EMBL/GenBank extra annotations can be added.
>
>e.g.
>class SeqIOSeq(Seq):
>     def __init__(self):
>         Seq.__init__(self)
>         # dictionary for extra annotations (e.g. Embl, GenBank)
>         self.features = {}
>
>
>In the case of
>GenBank Record Format --> SeqIOSeq --> Fasta Record Format
we pick only the the name and sequence ...
>
>but for
>GenBank Record Format --> SeqIOSeq --> EMBL Record Format
>the writer function should check if there are any additional features
>(self.features.keys())
>That way we shouldn't loose any information.

This seems like the same solution to the one that Brad suggested, 
except that SeqRecord is replaced by SeqIOSeq.  The SeqIOSeq is a 
much simpler format, so may be easier to use.  However, it leaves 
unspecified how the the features should be stored, which may be 
problematic.  For example, the converter from SeqIOSeq to 
Fasta.Record will have to know what to use as the Fasta description. 
For GenBank, it might be the accession and comments.  For a 
SProt.Record, it might be the entry_name and description.  Thus, 
unless the SeqIOSeq.features elements are specified better, I'm 
afraid the SeqIOSeq -> X converter will have to know about all the 
other formats.

SeqRecord gets around this by defining (theoretically) all the 
information people would care about from a record, with a consistent 
interface.  Thus, a SeqRecord -> Fasta.Record converter will always 
use the SeqRecord.id and SeqRecord.description (or some other 
combination of attributes).

>It would be nice if a new format can be added by simply adding functions
>for reading, writing and recognizing the format.
>I not completely sure of how to define these functions - any ideas ?

Not exactly, but it would be nice if those functions were exposed. 
For example, there should be a function somewhere called 
"whichformat" (similar to the whichdb package in Python's standard 
library) that returns a best guess at the format.

In the past, Andrew's talked about building this kind of 
functionality into Martel...

Jeff