[Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp

Tue Sep 15 12:43:04 EDT 2009

On Tue, Sep 15, 2009 at 4:14 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Tue, Sep 15, 2009 at 4:02 PM, natassa <natassa_g_2000 at yahoo.com> wrote:
>>
>>> If you prefer to work in Python, it should be easy to recreate
>>> a Biopython version of the same script. Which script are we
>>> talking about? Is it publicly available?
>>
>> It is called shuffleSequences_fasta.pl and goes along with the
>> (free) distribution of velvet (Zerbino, EBI). The script is really
>> simple.
>
> Oh right - you can see the scripts on Daniel's github repository,
> http://github.com/dzerbino/velvet
>
> Both scripts are very very simple minded, which means fixing
> the bug will actually be a big change:
>
> shuffleSequences_fasta.pl appears to assume every FASTA
> entry is exactly two lines (a safe assumption for short reads
> like 36bp from early Solexa/Illumina), but not a safe choice
> in general as wrapping in FASTA is normal.
>
> shuffleSequences_fastq.pl appears to assume every FASTQ
> entry is exactly four lines (a reasonable assumption, especially
> for short reads like 36bp reads from early Solexa/Illumina),
> but not a safe choice in general as FASTQ files can also be
> wrapped (even if it is discouraged).
>
> We should be able to mimic these in Biopython using SeqIO...

How about this? I'm using itertools.izip_longest  from Python 2.6+
which should make sure the two input files have the same number
of reads. Using itertools.izip would also work, but will silently ignore
any extra records in one file.

import itertools
from Bio import SeqIO
#Setup variables (could parse command line args here)
fileA = "SRR001666_1.fastq"
fileB = "SRR001666_2.fastq
fileOut = "SRR001666_interleaved.fastq"
format = "fastq"
#Prepare the input (using iterators for memory efficiency)
recordsA = SeqIO.parse(open(fileA,"rU"), format)
recordsB = SeqIO.parse(open(fileB,"rU"), format)
records = itertools.izip_longest(recordsA, recordsB)
#Finally do the interleaved output:
handle = open(fileOut, "w")
count = SeqIO.write(records, handle, format)
handle.close()
print "%i records written to %s" % (count, fileOut)

Rather than using itertools, you could also write a simple generator
function to do the pairing explicitly. Assuming you are dealing with
paired end reads, it would make sense to explicitly check the IDs
match up as expected.

Peter

P.S. This uses the default output settings, for if you did
this for FASTA it would use line wrapping.