[Biopython-dev] Where should feature intersection code go?

Mon Feb 8 21:49:20 UTC 2010

I'm working on a project that's looking for alternative splicing using 
solexa data instead of microarray data.  Basically we've got a GFF file 
containing all the genes, introns and exons and 35M reads that have been 
placed into one of the various chromosomes via the excellent bowtie 
application out of Maryland.

Bowtie output is documented here:
http://bowtie-bio.sourceforge.net/manual.shtml#default-bowtie-output

In summary it's roughly a cross between fastq and GFF.  It's got the 
read name, strand, sequence the read aligned to, position, sequence, 
quality, and a few others.  It seems like it could rather easily be 
coerced into a SeqRecord 
(http://biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html).  
It might not get filled up completely, but it'd be better than handling 
things in a one-off way.

The FeatureLocation class provides for approximate and exact locations 
(both start and stop positions).  It seems like the correct location to 
put code that determines if two FeatureLocations overlap, or if one 
contains another, or is contained by another. 

Overall I'm talking about writing a bowtie .map parser and the 
comparison code for FeatureLocation.  Would these be welcome features?

Thanks,
Mike