[Bioperl-l] Bio::SeqIO::tigr

Fri Oct 31 12:54:11 EST 2003

I've written a SeqIO parser for the tigr xml data format, and would like
to contribute it to BioPerl. However, there are a couple things I don't
really like about it but don't have the time to fix right now. Could I
get some feedback from the list regaurding each?

First, some background. Since each XML file is roughly 60MB, using the
XML parsers provided by TIGR (using XML::Simple and XML::Sax, IIRC)
takes around 7-10 minutes to parse (no including BioPerl object
creation) and occationally used more than ~2.5GB of memory, which an x86
can't handle.

To get around this, I took advantage of the fact that these are machine
generated and parsed the entire file using regexp, only storing what is
"relavent" to retrieve a sequence. This means, the ~75 lines of code
TIGR used is around 1280. However, it uses around 250MB of memory and
(converting from TIGR to GenBank) runs in around two to three and a half
minutes, 30-60% slower than GenBank -> GenBank convertion.

1) The code is pretty ugly. It was one of my first "large" perl projects
   and reflects that. The uglyness is partially due to my inexperiance
   at the time, and partially do to the ugliness of the problem.

2) Its not very well commented, ok its not commented. This isn't too big
   a problem, as everything acts basically the same way, and once
   someone understands that the rest is easy. (Its really just the same
   thing over and over). Its just fairly bad form.

3) The memory usage (and runtime) could be improved by one or more of:
   a) Storing everything directly into objects rather than a tree 
   b) Using arrays to store everything rather than hashes
   c) Ignoring any tags that aren't actually used.

4) The coding style is nothing like the rest of BioPerl's. Mainly
   because, I prefer this style (PERSONAL preference, no flames,
   everyone gets their own oppinion). This is bad for a project,
   but in all honesty if I need to drastically change my coding
   style I will probably never get around to fixing up this code.

5) There is quite a long delay before anything is actually accessible
   because the nucleotide data is given at the end of the files
   (actually, at the end of an ASSEMBLY tag) so everything before it
   needs to be parsed. This leads to the first ->next_seq() call taking
   a significant time.

Since I can't show you what the object looks like, I'll show you what
the GenBank file looks like. An example of the genbank file is at:

http://bioinfo.ucr.edu/cgi-bin/seqfetch.pl?database=all&accession=At1g03870

Thanks for your time,

-- 

----------------------------
| Josh Lauricha            |
| laurichj at bioinfo.ucr.edu |
| Bioinformatics, UCR      |
|--------------------------|