[Bioperl-l] Bio::SeqIO::tigr

Jason Stajich jason at cgt.duhs.duke.edu
Mon Nov 3 16:04:40 EST 2003


> attached files before. The different style is more or less just with
> idents, so stuff I've seen like:
>
> if() {
> foreach () {
> if() {
> do something
> }
> }
> }
>
> is:
>
> if() {
>     foreach () {
>         if() {
>             do something
>         }
>     }
> }
>
So it SHOULD be like the second part ^^^^^^^.  We did get called fascist
by Chad (in jest) for asking for this...  We actually prefer 4 spaces
and no tabs.  I just let my emacs take over and frequently reformat
people's code when I can't read it... That probably makes me a tyrant, but
such is life... =)

> Basically, just because I can't read it.
>
>
> > The requirements we have for
> > something that would be a SeqIO module is they have to follow the
> > structure of SeqIO drivers, mainly they implement next_seq and write_seq
> > and inherit from Bio::SeqIO and use the inherited _readline or _print for
> > IO rather than <$fh> and print $fh.
>
> My module uses the _readline interface. write_seq isn't implemented
> because it doesn't make any sense to do so.
>
sure - sounds good.

> > You can contribute it be posting it to the list, asking nicely for CVS
> > r/w account, or submitting it as an enhancement to bugzilla.open-bio.org.
> > Looking forward to it.
>
> I'll post another e-mail will it attached.
>
> >
> > -jason
> >
> > On Fri, 31 Oct 2003, Josh Lauricha wrote:
> >
> > > I've written a SeqIO parser for the tigr xml data format, and would like
> > > to contribute it to BioPerl. However, there are a couple things I don't
> > > really like about it but don't have the time to fix right now. Could I
> > > get some feedback from the list regaurding each?
> > >
> > > First, some background. Since each XML file is roughly 60MB, using the
> > > XML parsers provided by TIGR (using XML::Simple and XML::Sax, IIRC)
> > > takes around 7-10 minutes to parse (no including BioPerl object
> > > creation) and occationally used more than ~2.5GB of memory, which an x86
> > > can't handle.
> > >
> > > To get around this, I took advantage of the fact that these are machine
> > > generated and parsed the entire file using regexp, only storing what is
> > > "relavent" to retrieve a sequence. This means, the ~75 lines of code
> > > TIGR used is around 1280. However, it uses around 250MB of memory and
> > > (converting from TIGR to GenBank) runs in around two to three and a half
> > > minutes, 30-60% slower than GenBank -> GenBank convertion.
> > >
> > > 1) The code is pretty ugly. It was one of my first "large" perl projects
> > >    and reflects that. The uglyness is partially due to my inexperiance
> > >    at the time, and partially do to the ugliness of the problem.
> > >
> > > 2) Its not very well commented, ok its not commented. This isn't too big
> > >    a problem, as everything acts basically the same way, and once
> > >    someone understands that the rest is easy. (Its really just the same
> > >    thing over and over). Its just fairly bad form.
> > >
> > > 3) The memory usage (and runtime) could be improved by one or more of:
> > >    a) Storing everything directly into objects rather than a tree
> > >    b) Using arrays to store everything rather than hashes
> > >    c) Ignoring any tags that aren't actually used.
> > >
> > > 4) The coding style is nothing like the rest of BioPerl's. Mainly
> > >    because, I prefer this style (PERSONAL preference, no flames,
> > >    everyone gets their own oppinion). This is bad for a project,
> > >    but in all honesty if I need to drastically change my coding
> > >    style I will probably never get around to fixing up this code.
> > >
> > > 5) There is quite a long delay before anything is actually accessible
> > >    because the nucleotide data is given at the end of the files
> > >    (actually, at the end of an ASSEMBLY tag) so everything before it
> > >    needs to be parsed. This leads to the first ->next_seq() call taking
> > >    a significant time.
> > >
> > > Since I can't show you what the object looks like, I'll show you what
> > > the GenBank file looks like. An example of the genbank file is at:
> > >
> > > http://bioinfo.ucr.edu/cgi-bin/seqfetch.pl?database=all&accession=At1g03870
> > >
> > > Thanks for your time,
> > >
> > >
> >
> > --
> > Jason Stajich
> > Duke University
> > jason at cgt.mc.duke.edu
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at portal.open-bio.org
> > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> >
>
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu


More information about the Bioperl-l mailing list