[Biopython-dev] sequence format readers ?

Wed Sep 5 14:45:40 EDT 2001

Hej,

To follow up one of the discussions and questions at ISMB in Copenhagen,
- how are we going to proceed with the sequence format reader (the
biopython variant of readseq ...)

Currently we can only have parsers for Fasta, Embl and GenBank.  What we
need is a internal format and functions/modules which can read/write:
Fasta
Embl
GenBank
GCG
Phylip
PIR
MSF
Nexus
Clustal
Mase
??? - more suggestions ?

I can write most of the rules, but I guess we have to define a smart base
class/parser - where plugging in a new format should only take 5 seconds ...
If we brain storm on the design of the reader/writer, I could volunteer to
implement the format rules ...

Some things to consider:
* some formats are alignment based (e.g. clustal, phylip, nexus)
* some formats have loads of information which is lost when converted to a
  lower info-rich format( e.g. Embl -> Fasta). But Embl -> GenBank should
  not lose any information 
* some formats allow multiple entries, some not

back-in-the-sequence-format-jungle'ly yr's
-thomas

-- 
Sicheritz-Ponten Thomas, Ph.D  CBS, Department of Biotechnology
thomas at biopython.org           The Technical University of Denmark
CBS:  +45 45 252489            Building 208, DK-2800 Lyngby
Fax   +45 45 931585            http://www.cbs.dtu.dk/thomas

	De Chelonian Mobile ... The Turtle Moves ...