[Biopython-dev] 4/14 BioStar - Biopython Questions
Feed My Inbox
updates at feedmyinbox.com
Wed Apr 14 06:12:50 UTC 2010
==================================================
1. extracting a subset of sequences from a FASTQ file (BioPython speed)
==================================================
April 13, 2010 at 8:09 AM
Initially my problem was to extract all entries from a FASTQ file with names not present in a FASTA file. Using biopython I wrote:
from Bio.SeqIO.QualityIO import FastqGeneralIterator
corrected_fn = "my_input_fasta.fas"
uncorrected_fn = "my_input_fastq.ftq"
output_fn = "differences_fastq.ftq"
corrected_names = []
for line in open(corrected_fn):
if line[0] == ">":
read_name = line.split()[0][1:]
corrected_names.append(read_name)
input_fastq_fn = uncorrected_fn
corrected_names.sort()
handle = open(output_fn, "w")
for title, seq, qual in FastqGeneralIterator(open(input_fastq_fn)) :
if title not in corrected_names:
handle.write("@%s\n%s\n+\n%s\n" % (title, seq, qual))
handle.close()
Problem is, it is very slow. On 2Ghz workstation starting from a local disc it can take two days per pair of files:
4870868 seqs in FASTQ
4299464 seqs in FASTA
Removing title from corrected_names speeds up things a bit (this version I used for running).
Am I doing something obviously silly or simply FastqGeneralIterator is not a best construct to use here? While I like Python best, I am open to answers in Perl/Ruby.
Slicing and dicing FASTQ files based on lists seems to be fairly common task.
Edit: Python 2.6.4, biopython 1.53, Linux Fedora 8.
Edit 2:
corrected one line of code, see comment to giovanni
code snippet taken from: http://news.open-bio.org/news/2009/09/biopython-fast-fastq/
http://biostar.stackexchange.com/questions/671/extracting-a-subset-of-sequences-from-a-fastq-file-biopython-speed
--------------------------------------------------
===========================================================
Source: http://biostar.stackexchange.com/questions/tagged/biopython
This email was sent to biopython-dev at lists.open-bio.org.
Account Login:
https://www.feedmyinbox.com/members/login/
Don't want to receive this feed any longer? Unsubscribe here: http://www.feedmyinbox.com/feeds/unsubscribe/311791/6ca55937c6ac7ef56420a858404addee7b17d3e7/
-----------------------------------------------------------
This email was carefully delivered by FeedMyInbox.com.
230 Franklin Road Suite 814 Franklin, TN 37064
More information about the Biopython-dev
mailing list