[Bioperl-l] Re: [BioXML-dev] Bio::SeqIO::game

Bradley Marshall bradmars@yahoo.com
Thu, 30 Nov 2000 11:28:15 -0800 (PST)


--- Jason Stajich <jason@chg.mc.duke.edu> wrote:
> Brad - Thanks for the game parser updates and test
> files.  
> 
> I have some comments.  One thing we've been kicking
> around with the latest
> bioperl release and the Bio::DB rewrites is that the
> expected behavior
> when a class is reading a data file of sequences is
> for it to only read
> one sequence at a time.  The game code actually
> reads everything in at one
> time and does this multiple times depending on
> whether or not it is
> adding features or just reading in a primary seq. 

Well, right.  The problem is that sax always reads the
whole document (well, there are some newer ones that
do it in chunks, but...).  In addition, due to the
game format, there is no guarantee that all of the
features that go with one seq will be between that
sequence definition and the next one.  So it can be:

Seq 1
features on seq 1
Seq 2
features on Seq 1

So you can't just read until the next seq and call it
a day.  The alternative is, as you suggest, to just
read it in as a string, but I didn't think this was a
good solution due to memory issues.  At any rate, it's
still pretty fast and I think it conforms to the
interface (correct me if I'm wrong)
 
> 
> It also expects only file names to be passed in, but
> I think file handles
> should be supported as well and that it should be
> using the
> $self->_filehandle method that all SeqIO classes
> subscribe to.  This 
> will not work with the current way of multiple
> passes on the document.

Hmm... unfortunately I'm not a perl wizard.  I guess I
didn't realize this would be a problem.  I agree that
it should certainly accept filehandles.  I'll have to
look into this more.

> 
> There are some simple ways around this, one is to
> read everything from the
> stream/file and store it as one giant string and
> then re-pass this string
> as input to the SAX parser. This will use up a large
> amount of memory and
> break on very large files.  The problem is that the
> SAX parser is
> expecting to read to the end of the document not to
> the end of a
> <seq></seq> block.  Anyone else had a chance to look
> over this and think
> about it?

See my previous comment.  Keep in mind that sax has
very little memory overhead.

> 
> I checked in some changes to try and keep this SeqIO
> compliant --
> specifically expecting -file=>filename instead of
> just reading the 2nd
> argument.

Thanks.  This is the type of thing I need the most
help on.

Brad

> 
> 113,114c113,114
> <   $self->{file}=@args[1];
> ---
> >   ($self->{file} ) = $self->_rearrange( [ qw(FILE)
> ], @args);
> >   $self->throw("did not specify a file to read,
> Filehandle suport is not
> implemented currently") if( !defined $self->{file});
> 140,141c140
> 
> Jason Stajich
> jason@chg.mc.duke.edu
> Center for Human Genetics
> Duke University Medical Center 
> http://www.chg.mc.duke.edu/ 
> 
> 
> _______________________________________________
> BioXML-dev mailing list  -  BioXML-dev@bioxml.org
> http://bioxml.org/mailman/listinfo/bioxml-dev


__________________________________________________
Do You Yahoo!?
Yahoo! Shopping - Thousands of Stores. Millions of Products.
http://shopping.yahoo.com/