[BioPython] [Biopython-dev] Martel-based parsing of Unigene flat files

Wed Oct 25 13:17:10 UTC 2006

On Wednesday 25 October 2006 05:27, Peter wrote:
> Sean,
>
> I did have a little look at Unigene, but wasn't sure which files exactly
> you wanted to parse.  Like the NCBI, they seem to offer lots of
> different file formats.

I was thinking about files like Hs.data.  These files are very simple file 
formats and can be parsed using simple regexes and if statements VERY 
quickly.  I have written one in perl (because the bioperl one creates objects 
when none were needed in my case, and so was slow).  I simply wanted to do 
the same in python, but wanted to "do it right".  

> Chris Lasher wrote:
> > Hi Sean,
> >
> > FWIW this should probably have been posted to BioPython-dev, but I
> > don't think that would improve your chances of getting a response. I
> > am cross-posting it there, anyways. Unfortunately for you, I do not
> > have an answer for you. :-(
>
> The dev list would probably have been a better idea.

I will join and certainly use the dev list in the future for questions along 
these lines.  It always takes a bit to get the culture of a new set of lists 
correct.

> I'm one of the more recent contributors - for example, I changed the
> GenBank parser from Martel to just Python.  This was done on the pretext
> that the old parser (when it worked) was exceedingly slow on large
> files.  There is still room for improvement, but I can now load whole
> chromosomes/genomes.

Good to know.  I'll take a look at this code.  

> If for your file format, the individual records (repeating units) are
> over 10MB in size, then I would begin to worry about the performance
> using Martel.  Otherwise it might be OK...
>
> In the process of this work I did eventually get a feel for how Martel
> works, and how to define file formats etc.  Its rather a clever design
> but it is daunting for new comers.
>
> Also, when someone manages to find a file formatted sufficiently
> different to what a Martel parser expects, working out what exactly
> needs to be fixed is sometimes tricky.
>
> Over on the developers list we have had some talk about where to go in
> future, and at the moment I have been working on a SeqIO system a little
> like BioPerl's,

Just keep in mind that on the bioperl side, as annotations have gotten richer 
and file size has become a non-issue for storage, some of those parsers are 
not keeping up in terms of speed.  SeqIO is fairing quite well, but the BLAST 
parser isn't, just as an example.  There is a fine line between creating 
objects for everything and speedy parsing into raw data structures.  In fact, 
having a couple of parsers (not fully deprecating a fast but trivial parser) 
is probably the best general way to go.

In short, the parser/consumer model is relatively new to me and I think that 
is where I need to spend a bit of time learning the lay of the land.  Thanks 
for the hints and pointers.  I'll look a bit more at code and then try to ask 
more specific questions as they arise.

Sean