[Biopython-dev] IMGT parser (modified EMBL format),

Peter biopython at maubp.freeserve.co.uk
Tue Aug 24 11:56:57 UTC 2010


Hi all,

The IMGT is the international ImMunoGeneTics information system, a global
reference in immunogenetics and immunoinformatics. They have a sequence
databases, genome database, structure database, and monoclonal antibodies
database.

The IMGT use a variant of the EMBL flat file format with longer feature indents:
http://imgt.cines.fr/download/LIGM-DB/userman_doc.html
http://imgt.cines.fr/download/LIGM-DB/ftable_doc.html
http://www.ebi.ac.uk/imgt/hla/docs/manual.html

Uri and I have been working on extending the SeqIO EMBL/GenBank parser
and writer to support IMGT files too. This uncovered a number of data formatting
issues (e.g. wrong sequence length in ID line, partial feature
locations) and Uri
has been liaising with the IMGT curators to address these. With their latest
(Aug 2010) release, we can now parse the whole file without errors:
http://imgt.cines.fr/download/LIGM-DB/imgt.dat.Z

I think this code is now ready to merge - comments welcome:
http://github.com/peterjc/biopython/commits/seqio-imgt

Potentially we could even include this in Biopython 1.55, although it would
be more cautious not to add any new features between the beta and the
final release...

Peter



More information about the Biopython-dev mailing list