[Biopython-dev] sequence format readers ?

Wed Sep 5 17:46:00 EDT 2001

Hi Thomas!

> To follow up one of the discussions and questions at ISMB in Copenhagen,
> - how are we going to proceed with the sequence format reader (the
> biopython variant of readseq ...)

It's great that you're going to work on this! It's definately much
desired by a lot o' people (in fact I was just having a conversation
today about format conversion).

> Currently we can only have parsers for Fasta, Embl and GenBank.  What we
> need is a internal format and functions/modules which can read/write:
[...impressive list o' formats...]
> ??? - more suggestions ?

I think supporting this many would be an *excellent* start :-).

> I can write most of the rules, but I guess we have to define a smart base
> class/parser - where plugging in a new format should only take 5 seconds ...
> If we brain storm on the design of the reader/writer, I could volunteer to
> implement the format rules ...
> 
> Some things to consider:
> * some formats are alignment based (e.g. clustal, phylip, nexus)
> * some formats have loads of information which is lost when converted to a
>   lower info-rich format( e.g. Embl -> Fasta). But Embl -> GenBank should
>   not lose any information 
> * some formats allow multiple entries, some not

Just as a way of getting things started (I haven't done a lot of
thinking about this), my opinion is that the best way to do this is
to have a SeqIO system kinda like Bioperl. The inputs into the SeqIO
system would be the standard SeqRecord object that we currently
have. The advantage of this is that existing parsers (ie Fasta,
GenBank), already parse into this, so all that would need to be done
is to define a mapping that converts a generic SeqRecord object to
and from the formats "native" Record based representation. So to
convert from GenBank to Fasta you could do:

GenBank Record Format --> SeqRecord --> Fasta Record Format 

Since the Record formats already provide writing capabilities (and
we have the parsers to parse into them) we would already get writing
and parsing "for free." Also, we would make good use of our existing
"generic" Sequence representations.

The advantages of this is that it would help us avoid having to make
a billion GenBank -> Fasta, GenBank -> EMBL, GenBank -> whatever
specific converters. The disadvantage of this is that we may lose
some information in the conversion process (but than again, what
converters don't :-).

The tricky part of doing it this way is that we would then need to
define the Record --> SeqRecord mapping, which, as you mention,
may take some thinking for alignment formats and other
complications.

Hopefully-rambling-on-and-on-about-this-helps-a-little-bit-ly yr's,

Brad