[Biopython] Trimming FASTQ reads, was: [Velvet-users] Read length as a parameter?

Fri Sep 25 08:29:22 EDT 2009

Hi all,

I meant to forward this earlier, but it looks like I didn't. I've also just
posted a related blog post on the topic:
http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

Peter

---------- Forwarded message ----------
From: Peter <peter at maubp.freeserve.co.uk>
Date: Fri, Sep 25, 2009 at 11:46 AM
Subject: Re: [Velvet-users] Read length as a parameter?
To: Daniel Zerbino <zerbino at ebi.ac.uk>
Cc: Dan Bolser <dan.bolser at gmail.com>, velvet-users at ebi.ac.uk

Hi Velvet uses & Biopython fans,

I've CC'd this to the Biopython list as the examples may be of
interest there too. We are talking about scripts to pre-filter
sequencing reads before analysis with another tool (in this
case, the assembler velvet). The original thread is here:
http://listserver.ebi.ac.uk/pipermail/velvet-users/2009-September/000578.html

On Fri, Sep 25, 2009 at 10:45 AM, Daniel Zerbino <zerbino at ebi.ac.uk> wrote:
>
> Hello Yunchen and Dan,
>
> I'm afraid Velvet does not offer either length or k-mer frequency
> pre-filtering, although the cov_cutoff is a k-mer frequency post-filtering.
>
> Given practical considerations, I don't think I can implement this in the
> next months.
>
> However, what can be done is to have a simple script which does the
> filtering then pipes them to velvet:
>
> my_filtering_script.xx my_reads.fa | velveth directory 21 -fasta - ...
>
> In the  case of k-mer frequency filtering, you could imagine preparing a
> pseudo-fastq file which assigns a score to each nucleotide based on the
> frequency of the k-mer ending at that position, then scripting a score
> filter which pipes into Velvet.
>
> As usual, if anyone is willing to put forward such scripts for the other
> users, I will be happy to put it in the package.
>
> Best regards,
>
> Daniel

Was that a challenge? ;)  It probably won't be the fastest solution,
but it is very easy to do this with Biopython's SeqIO library.

#!/usr/bin/python
# This is a simple python script using Biopython 1.50+
# to read in FASTA records from stdin, trim to 21 letters,
# and write them to stdout.
import sys
from Bio import SeqIO
records = (rec[:21] for rec in SeqIO.parse(sys.stdin, "fasta"))
SeqIO.write(records, sys.stdout, "fasta")

This isn't (yet) a full script with command line arguments etc.
You could also do this with filenames, but to keep the examples
short I'm using stdin and stdout (not a problem for those happy
at the command line).

Because FASTA files are so simple, it would be fairly trivial to
to write a plain Python script (without using Biopython) which
runs faster than this.

However (and this is a sales pitch), just by changing the format
name the above script would also work on other file formats.
For example, with Biopython 1.51+ this would work fine for
FASTQ files too. However, if speed is an issue (often the case
with large next gen sequencing files), then a lower level python
script is also possible, e.g.:

#!/usr/bin/python
# This is a fairly simple python script using Biopython 1.51+
# to read in FASTQ records from stdin, trim to 21 letters,
# and write them to stdout. It does not check the quality
# strings at all, and should therefore work on Sanger,
# Solexa or Illumina 1.3+ FASTQ files equally well.
import sys
from Bio.SeqIO.QualityIO import FastqGeneralIterator
for title, seq, qual in FastqGeneralIterator(sys.stdin) :
   print "@%s\n%s\n+\n%s" % (title, seq[:21], qual[:21])
   #The print statement will include a trailing newline

Both these examples are just four lines of code (two of
which are imports), pretty neat if I do say so myself ;)

Peter