[Biopython-dev] Sequential SFF IO

Thu Feb 10 15:10:19 UTC 2011

On Mon, Feb 7, 2011 at 12:23 PM, Brad Chapman <chapmanb at 50mail.com> wrote:
> Peter;
>
>> The computationally interesting part is matching the primer/adapter/
>> barcode to the read (both of which may contain IUPAC ambiguity codes),
>> which as you point out can be replaced once you have a working
>> framework for the input, output, trimming, etc.
>
> Absolutely. I'd be very happy if you wanted to take the framework in
> the script and generalize it for different matching. Let me know
> what I can do to help.

Do you have (or can you point me at) any good sample data with
barcodes, or custom adapters or primer sequences? e.g. some SRA
numbers you've been using.

>> Currently I'm using regular expressions, which is fast enough for my
>> own needs - and this task could easily be parallelised by breaking
>> up the input reads. Beyond that perhaps something based on
>> Hamming distances (edit distance - number of mismatches) or
>> Levenshtein searches might be quicker. I guess speed is more of
>> an issue with Illumina than with 454 due to the number of reads?

I originally had three separate tools (with shared code) for working
with FASTA, FASTQ and SFF reads, which I have recently combined
into one single tool that does all three. Code here if anyone wants to
look at it.

https://bitbucket.org/peterjc/galaxy-central/src/filter_fasta/tools/primers/

seq_primer_clip.py - Python script
seq_primer_clip.xml - Galaxy wrapper
seq_primer_clip.txt - readme file

This is still a work in progress...

Peter