[BioPython] [Biopython-dev] Martel-based parsing of Unigene flat files

Wed Oct 25 09:27:45 UTC 2006

Sean,

I did have a little look at Unigene, but wasn't sure which files exactly 
you wanted to parse.  Like the NCBI, they seem to offer lots of 
different file formats.

Chris Lasher wrote:
> Hi Sean,
> 
> FWIW this should probably have been posted to BioPython-dev, but I
> don't think that would improve your chances of getting a response. I
> am cross-posting it there, anyways. Unfortunately for you, I do not
> have an answer for you. :-(

The dev list would probably have been a better idea.

I had seen Sean' email and was meaning to write something in the absence 
of any other takers.

> I, myself, would be interested in a response to this question from the
> Devs, as I would like to write a parser for PTT files. Last I saw
> there was a lot of chatter about the Martel parsers being incredibly
> slow compared to straightforward solutions. It seems that standard
> format parsers would be one of the easiest ways for BioPython newbies
> to contribute to developing the BioPython project, however, there
> isn't very much in the way of documentation on the BioPython way to do
> so, let alone developer documentation at all. I would like to know
> what can be done to get some dev docs going on the wiki.

I'm one of the more recent contributors - for example, I changed the 
GenBank parser from Martel to just Python.  This was done on the pretext 
that the old parser (when it worked) was exceedingly slow on large 
files.  There is still room for improvement, but I can now load whole 
chromosomes/genomes.

If for your file format, the individual records (repeating units) are 
over 10MB in size, then I would begin to worry about the performance 
using Martel.  Otherwise it might be OK...

In the process of this work I did eventually get a feel for how Martel 
works, and how to define file formats etc.  Its rather a clever design 
but it is daunting for new comers.

Also, when someone manages to find a file formatted sufficiently 
different to what a Martel parser expects, working out what exactly 
needs to be fixed is sometimes tricky.

Over on the developers list we have had some talk about where to go in 
future, and at the moment I have been working on a SeqIO system a little 
like BioPerl's,

http://bugzilla.open-bio.org/show_bug.cgi?id=2059
http://biopython.org/wiki/SeqIO

This is a work in progress... I've been planning to actually check 
something into CVS in the near future.  I would also need to lay down 
guidelines on how annotations are stored so that file format conversion 
is as smooth as possible.

Chris mentioned PTT files (protein table files), available from the NCBI 
(and probably other databases too).  I think PTT files had been 
mentioned on the dev list in the context of SeqIO (sequence 
input/output), and one suggestion was to load them as annotated 
SeqRecord objects with an empty Sequence.  Depending on what people want 
to do with a PTT file, this may not suit everyone.

Peter