[BioPython] [Biopython-dev] Martel-based parsing of Unigene flat files
Peter
biopython at maubp.freeserve.co.uk
Wed Oct 25 09:27:45 UTC 2006
Sean,
I did have a little look at Unigene, but wasn't sure which files exactly
you wanted to parse. Like the NCBI, they seem to offer lots of
different file formats.
Chris Lasher wrote:
> Hi Sean,
>
> FWIW this should probably have been posted to BioPython-dev, but I
> don't think that would improve your chances of getting a response. I
> am cross-posting it there, anyways. Unfortunately for you, I do not
> have an answer for you. :-(
The dev list would probably have been a better idea.
I had seen Sean' email and was meaning to write something in the absence
of any other takers.
> I, myself, would be interested in a response to this question from the
> Devs, as I would like to write a parser for PTT files. Last I saw
> there was a lot of chatter about the Martel parsers being incredibly
> slow compared to straightforward solutions. It seems that standard
> format parsers would be one of the easiest ways for BioPython newbies
> to contribute to developing the BioPython project, however, there
> isn't very much in the way of documentation on the BioPython way to do
> so, let alone developer documentation at all. I would like to know
> what can be done to get some dev docs going on the wiki.
I'm one of the more recent contributors - for example, I changed the
GenBank parser from Martel to just Python. This was done on the pretext
that the old parser (when it worked) was exceedingly slow on large
files. There is still room for improvement, but I can now load whole
chromosomes/genomes.
If for your file format, the individual records (repeating units) are
over 10MB in size, then I would begin to worry about the performance
using Martel. Otherwise it might be OK...
In the process of this work I did eventually get a feel for how Martel
works, and how to define file formats etc. Its rather a clever design
but it is daunting for new comers.
Also, when someone manages to find a file formatted sufficiently
different to what a Martel parser expects, working out what exactly
needs to be fixed is sometimes tricky.
Over on the developers list we have had some talk about where to go in
future, and at the moment I have been working on a SeqIO system a little
like BioPerl's,
http://bugzilla.open-bio.org/show_bug.cgi?id=2059
http://biopython.org/wiki/SeqIO
This is a work in progress... I've been planning to actually check
something into CVS in the near future. I would also need to lay down
guidelines on how annotations are stored so that file format conversion
is as smooth as possible.
Chris mentioned PTT files (protein table files), available from the NCBI
(and probably other databases too). I think PTT files had been
mentioned on the dev list in the context of SeqIO (sequence
input/output), and one suggestion was to load them as annotated
SeqRecord objects with an empty Sequence. Depending on what people want
to do with a PTT file, this may not suit everyone.
Peter
More information about the Biopython
mailing list