[Biopython] SeqIO for fasta conversion of Illumina files with > 60 bp

Peter biopython at maubp.freeserve.co.uk
Tue Sep 15 08:19:25 EDT 2009


On Tue, Sep 15, 2009 at 12:47 PM, natassa wrote:
>
> Hallo Peter,

Hi again,

Note I CC'd the mailing list again.

>>
>> Are you talking about Illumina FASTQ files? i.e. fastq-illumina in SeqIO?
>>
>
> I suppose so, I am quite confused with the names:
> The format is (ex from a file):
> @HWI-EAS293:8:1:0:1311#0/1
> CGCCACTGTTTTTGAGGGACCGCGGGCAGCCGCGGATCCCCAACGCAAGCAGAGCTNNNNGGTTGAAATGACGCTC
> +HWI-EAS293:8:1:0:1311#0/1
> `W_a^a`T``aaa]YIRW^`_X_]XT_``a]U]VWP\^a````_Y_BBBBBBBBBBBBBBBBBBBBBBBBBBBBBB
> So this is fastq-illumina in SeqIO, right?

That does look like a FASTQ file, and you probably know that it
came from a Solexa/Illumina machine. However, it could be an early
Solexa/Illumina file using Solexa scores ("fastq-solexa" in SeqIO),
or a more recent Illumina GA pipeline 1.3+ FASTQ file with PHRED
scores ("fastq-illumina" in SeqIO). From the read length (76bp) I
would guess this probably is an "fastq-illumina" file, but you
should double check this, as it does matter for poor quality reads.

>> Bio.SeqIO defaults to writing FASTA files with 60bp line wrapping.
>> You want to output 75bp FASTA files without line wrapping?
>> As an aside, why?
>
> Yes, that is what i want. I have paired-end reads in 2 separate files
>  that need to combine in one single file for subsequent assembly by
> velvet program. There is a 3rd party perl script in velvet to do this,
> and if I input to this program files converted to Fasta using the
> default argument wrap=60, it does not behave correct. [...]
>
>> Line wrapping is common and normal in
>> FASTA files and in fact is more widely used than non-wrapping.
>> If another software tool can't read line wrapped FASTA it has
>> a bug in my opinion.
>
> You are most probably right, I can report the bug to the person who
> wrote the script, but until now I thought a one-line format would me
> the more common and convenient way.

Please do report this bug in the perl-script, as it will help other
people in future.

If you prefer to work in Python, it should be easy to recreate
a Biopython version of the same script. Which script are we
talking about? Is it publicly available?

>> If you don't like the Bio.SeqIO.write(...) defaults, you can use the
>> underlying writer which may offer some options. In the case of
>> FASTA output, Bio.SeqIO.FastaIO allows you to set the wrapping.
>>
>> e.g.
>>
>> from Bio import SeqIO
>> from Bio.SeqIO.FastaIO import FastaWriter
>> records = SeqIO.parse(open("illumina.fastq"), "fastq-illumina")
>> handle = open("example.fasta", "w")
>> count = FastaWriter(handle, wrap=80).write_file(records)
>> handle.close()
>> print "Converted %i records" % count
>>
>
> Thanks, I will try this out, it should give the same result as directly
> changing the FastaWriter arguments, but surely is a cleaner option!

Yes, it should :)

Regards,

Peter



More information about the Biopython mailing list