[Biopython-dev] Where should feature intersection code go?

Tue Feb 9 01:04:25 UTC 2010

Mike;

> I'm working on a project that's looking for alternative splicing
> using solexa data instead of microarray data.  Basically we've got a
> GFF file containing all the genes, introns and exons and 35M reads
> that have been placed into one of the various chromosomes via the
> excellent bowtie application out of Maryland.

[...]

> Overall I'm talking about writing a bowtie .map parser and the
> comparison code for FeatureLocation.  Would these be welcome
> features?

A .map parser would definitely be useful. Another suggestion is to
get Bowtie to produce SAM format and use Pysam for parsing:

http://code.google.com/p/pysam/

The advantage of SAM is that it's an emerging standard and a lot of
downstream applications can use it. This way you can switch aligners
in your workflow without much disruption.

For doing feature overlaps, IntervalTree in bx-python is excellent:

http://bitbucket.org/james_taylor/bx-python/wiki/Home
http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/intervals/intersection.pyx

See the doc string of the IntervalTree class for how to use it. My
normal workflow is to build an IntervalTree with the GFF features of
your genome, and then loop through the alignment file finding
features that each alignment intersects.

For alternative splicing, are you using the raw genome or a built
transcriptome for all possible combinations of exons? One practical
thing to consider if that a read will not be aligned to the genome
if it splits an exon/exon junction.

Hope this helps,
Brad