[Biopython-dev] Sequential SFF IO

Mon Feb 7 12:23:56 UTC 2011

Peter;

> The computationally interesting part is matching the primer/adapter/
> barcode to the read (both of which may contain IUPAC ambiguity codes),
> which as you point out can be replaced once you have a working
> framework for the input, output, trimming, etc.

Absolutely. I'd be very happy if you wanted to take the framework in
the script and generalize it for different matching. Let me know
what I can do to help.

> Currently I'm using regular expressions, which is fast enough for my
> own needs - and this task could easily be parallelised by breaking
> up the input reads. Beyond that perhaps something based on
> Hamming distances (edit distance - number of mismatches) or
> Levenshtein searches might be quicker. I guess speed is more of
> an issue with Illumina than with 454 due to the number of reads?
> 
> Brad - you mentioned using approximate matches with gaps. Did you
> find gapped matches made a bit difference to the number of matches
> found? i.e. is it worthwhile on your data?

A large majority of the barcodes are found with exact matching
via a dictionary lookup, so the gapped/mismatch alignments are only
necessary for the barcodes with sequencing errors. For Illumina
reads gaps aren't as common, so the mismatch alignments are more
useful but I tried to make it general so as to catch as many cases
as possible.

Brad