[Biopython-dev] 4/14 BioStar - Biopython Questions

Wed Apr 14 06:12:50 UTC 2010

==================================================
1. extracting a subset of sequences from a FASTQ file (BioPython speed)
==================================================
April 13, 2010 at 8:09 AM

Initially my problem was to extract all entries from a FASTQ file with names not present in a FASTA file. Using biopython I wrote:

from Bio.SeqIO.QualityIO import FastqGeneralIterator

corrected_fn   = "my_input_fasta.fas"
uncorrected_fn = "my_input_fastq.ftq"
output_fn      = "differences_fastq.ftq"

corrected_names = []
for line in open(corrected_fn):
        if line[0] == ">":
                read_name = line.split()[0][1:] 
                corrected_names.append(read_name)

input_fastq_fn = uncorrected_fn
corrected_names.sort()

handle = open(output_fn, "w")
for title, seq, qual in FastqGeneralIterator(open(input_fastq_fn)) :
        if title not in corrected_names:
                handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
handle.close()

Problem is, it is very slow. On 2Ghz workstation starting from a local disc it can take two days per pair of files:

4870868 seqs in FASTQ 
4299464 seqs in FASTA

Removing title from corrected_names speeds up things a bit (this version I used for running). 

Am I doing something obviously silly or simply FastqGeneralIterator is not a best construct to use here? While I like Python best, I am open to answers in Perl/Ruby. 

Slicing and dicing FASTQ files based on lists seems to be fairly common task.

Edit: Python 2.6.4, biopython 1.53, Linux Fedora 8. 

Edit 2: 

corrected one line of code, see comment to giovanni
code snippet taken from: http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

http://biostar.stackexchange.com/questions/671/extracting-a-subset-of-sequences-from-a-fastq-file-biopython-speed

--------------------------------------------------

===========================================================
Source: http://biostar.stackexchange.com/questions/tagged/biopython
This email was sent to biopython-dev at lists.open-bio.org.

Account Login: 
https://www.feedmyinbox.com/members/login/
Don't want to receive this feed any longer? Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/311791/6ca55937c6ac7ef56420a858404addee7b17d3e7/
-----------------------------------------------------------
This email was carefully delivered by FeedMyInbox.com. 
230 Franklin Road Suite 814 Franklin, TN 37064