[BioPython] [Biopython-dev] Martel-based parsing of Unigene flat files
Sean Davis
sdavis2 at mail.nih.gov
Wed Oct 25 13:17:10 UTC 2006
On Wednesday 25 October 2006 05:27, Peter wrote:
> Sean,
>
> I did have a little look at Unigene, but wasn't sure which files exactly
> you wanted to parse. Like the NCBI, they seem to offer lots of
> different file formats.
I was thinking about files like Hs.data. These files are very simple file
formats and can be parsed using simple regexes and if statements VERY
quickly. I have written one in perl (because the bioperl one creates objects
when none were needed in my case, and so was slow). I simply wanted to do
the same in python, but wanted to "do it right".
> Chris Lasher wrote:
> > Hi Sean,
> >
> > FWIW this should probably have been posted to BioPython-dev, but I
> > don't think that would improve your chances of getting a response. I
> > am cross-posting it there, anyways. Unfortunately for you, I do not
> > have an answer for you. :-(
>
> The dev list would probably have been a better idea.
I will join and certainly use the dev list in the future for questions along
these lines. It always takes a bit to get the culture of a new set of lists
correct.
> I'm one of the more recent contributors - for example, I changed the
> GenBank parser from Martel to just Python. This was done on the pretext
> that the old parser (when it worked) was exceedingly slow on large
> files. There is still room for improvement, but I can now load whole
> chromosomes/genomes.
Good to know. I'll take a look at this code.
> If for your file format, the individual records (repeating units) are
> over 10MB in size, then I would begin to worry about the performance
> using Martel. Otherwise it might be OK...
>
> In the process of this work I did eventually get a feel for how Martel
> works, and how to define file formats etc. Its rather a clever design
> but it is daunting for new comers.
>
> Also, when someone manages to find a file formatted sufficiently
> different to what a Martel parser expects, working out what exactly
> needs to be fixed is sometimes tricky.
>
> Over on the developers list we have had some talk about where to go in
> future, and at the moment I have been working on a SeqIO system a little
> like BioPerl's,
Just keep in mind that on the bioperl side, as annotations have gotten richer
and file size has become a non-issue for storage, some of those parsers are
not keeping up in terms of speed. SeqIO is fairing quite well, but the BLAST
parser isn't, just as an example. There is a fine line between creating
objects for everything and speedy parsing into raw data structures. In fact,
having a couple of parsers (not fully deprecating a fast but trivial parser)
is probably the best general way to go.
In short, the parser/consumer model is relatively new to me and I think that
is where I need to spend a bit of time learning the lay of the land. Thanks
for the hints and pointers. I'll look a bit more at code and then try to ask
more specific questions as they arise.
Sean
More information about the Biopython
mailing list