[BioRuby] FlatFile GFF
Naohisa GOTO
ngoto at gen-info.osaka-u.ac.jp
Thu Apr 1 13:41:27 UTC 2010
Hi,
On Thu, 1 Apr 2010 11:33:27 +1100
Ben Woodcroft <donttrustben at gmail.com> wrote:
> Hi,
>
> I have a conceptual question for the list. When I open a gff2 file using
> Bio::FlatFile, the next_entry method gives me all of the lines at once (in
> the form of a Bio::GFF::GFF2 object).
>
> f = Bio::FlatFile.open(Bio::GFF::GFF2,"some.gff2") => Bio::FlatFile
> g = f.next_entry => Bio::GFF::GFF2 object
> g.records => array of GFF2 records
>
> To me, this seems a little counter-intuitive. I expected to get info for a
> single line of the GFF file from FlatFile#next_entry
The design of Bio::GFF classes was determined by the first authors of
the classes. I don't know much about what they thought, but I suppose
because GFF can have header lines, sequences in Fasta format, and
relation information across two or more lines, they might think it is
easy to gather all information in a file into a single object.
Because Bio::FlatFile supports many file formats, format-specific
situation may sometimes be omitted and "normalized".
> The other problem is that the whole file must be parsed at the beginning,
> and this can cause memory problems when using large GFF files (e.g. the
> current WormBase gff2 is 2.6GB).
To overcome the problem, reorganizing of Bio::GFF classes may be needed.
Bio::FlatFile is only a controller with input buffer, and format specific
things should be implemented in the format parser and splitter classes.
Currently, for a workaroud, use Bio::GFF::GFF2::Record directly without
using Bio::FlatFile.
> To get around the problem I can use File.foreach('some.gff2') and then parse
> each line using Bio::GFF::GFF2. I'm not sure what the situation is with
> other file formats.
>
> So, my question is, could we introduce a foreach method into FlatFile that
> iterates (without parsing all at once so it is light on memory) over the
> GFF/etc entries in the file? Ideally we could change next_entry, but that
> wouldn't be backwards compatible I don't think.
I'm negative, because this is basically not the Bio::FlatFile issue,
but the Bio::GFF design problem, and modifying only Bio::FlatFile
does not solve the problem.
Indeed, the method name is too confusing, because we already have
Bio::FlatFile.foreach and Bio::FlatFile#each.
http://bioruby.org/rdoc/classes/Bio/FlatFile.html#M002156 (foreach)
http://bioruby.org/rdoc/classes/Bio/FlatFile.html#M002168 (each)
I'm thinking to implement another GFF parser frontend class that
can be specified as a file format.
ff = Bio::FlatFile.open(Bio::GFF::AltParser, "xxx.gff")
Alternatively, introducing optional parameters to a Bio::FlatFile
and it could change parameters passed to the parser and splitter
classes for the format.
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
More information about the BioRuby
mailing list