[Biopython-dev] Sequential SFF IO

Fri Jan 28 13:54:47 UTC 2011

On Fri, Jan 28, 2011 at 7:34 AM, Brad Chapman <chapmanb at 50mail.com> wrote:

> Kevin and Peter;
> I'm really enjoying this discussion -- thanks for talking this
> through here.
>
> > For just 5' barcode detection, I am using a memoized scheme that computes
> > anchored alignments and then stores the result in a hash table
> > (match/mismatch, edit distance).  This approach allows me to reject
> barcodes
> > with too small an edit distance to the next best candidate.  It is
> > reasonably fast for our fairly long 454 barcode set (10-'mers), though I
> do
> > have an optional Cython version of the edit distance routine.  The
> > pure-Python version is pretty zippy and can decode a 454 run in a minute
> or
> > two.
>
> This sounds like a nice approach. Do you have code available or is
> it not packaged up yet?
>

It is still under development with some of the refinements I mentioned in a
non-public branch and have not percolated out to my Google code version.
 However, a previous version is available from:

http://code.google.com/p/glu-genetics/source/browse/glu/modules/seq/unbarcode.py#

> I wrote up a barcode detector, remover and sorter for our Illumina
> reads. There is nothing especially tricky in the implementation: it
> looks for exact matches and then checks for approximate matches,
> with gaps, using pairwise2:
>
>
> https://github.com/chapmanb/bcbb/blob/master/nextgen/scripts/barcode_sort_trim.py
>
> The "best_match" function could be replaced with different
> implementations, using the rest of the script as scaffolding to do
> all of the other sorting, trimming and output.
>
>
Nice!   I didn't know about pairwise2, though I figured BioPython would have
something to that effect.

-Kevin