[Bioperl-l] sets of sequences - how to read?

Fri May 17 17:37:53 UTC 2013

On May 17, 2013, at 12:12 AM, Carnë Draug <carandraug+dev at gmail.com>
 wrote:

> On 17 May 2013 05:08, Fields, Christopher J <cjfields at illinois.edu> wrote:
>> This doesn't surprise me too much; I know there have been some changes brewing, but didn't know when they would land.  I guess that would be... <looks at watch>... now.
>> 
>> My feeling is this will require writing some code for a higher-level layer of abstraction, say a Bio::DB::* (which would allow some internal indexing of the files maybe using a Bio::Index::*, look ups for specific gene IDs, etc).  How hard that would be to implement is another thing, have no idea w/o seeing what the data look like beyond they are in ASN1.
> 
> :s I'm not sure I understood your suggestion. I think the problem is
> just the introduction of a new concept, a "set" of stuff (genes in
> this case), and how should SeqIO handle multiple sets.
> 
> Carnë

(note: critical point in this is Bio::ASN1::Entrezgene would allow this, I'm not sure it would.  Otherwise this is all really hand-wavy)

To me a 'set of stuff', particularly when the 'stuff' is stored sequentially in a flat file, is a simple 'database' or 'store' of similar items, where the class allows one the ability to look up particular members in the set, but also could store higher level information about the set as a whole if needed.  If it were me, I would implement a method particular to Bio::SeqIO::entrezgene that specifically creates and returns this ( next_geneset(), for instance ); next_seq() could then be implemented to iterate through the items in that database/store.  

Two useful things come out of this.  First, if the data for the Entrez Gene file/chunk are parsed to store offsets per ID, one would only need to parse out the chunks needed (offset of ID to next offset), then pass that into the parser and create objects on the fly.  This would probably be as fast or faster than (for instance) the greedy method of parsing the entire file and storing everything in objects up-front, then iterating through those objects one at a time, which I think is current behavior.

Second: if an index is created, the upfront cost is already paid (you could reuse the same index when parsing the same data).

An analogous example might be storing all FASTQ data in a sequencing run; I don't want to expend the effort to parse all the FASTQ data, but I may want to run operations on individual items in the set as well as store additional information about the data (barcodes per run, lanes, overall quality stats, etc).  

Does that make sense?  The pieces for this are lying around (Bio::Index::* for instance has methods for indexing flat files, and classes like Bio::DB::Fasta).

chris