[Biopython] replace header

Wed May 30 08:56:57 UTC 2012

On Wed, May 30, 2012 at 4:54 AM, Dilara Ally <dilara.ally at gmail.com> wrote:
> Thanks!  That worked.
>
> On May 29, 2012, at 8:44 PM, Lenna Peterson wrote:
>
>> Hi c,
>>
>> Opening a file for append with 'a' allows successive writes to go to
>> the end of the file.
>>
>> Before the loop:
>>
>>    out_handle = open("newsolid_1.fastq", 'a')
>>
>> In the loop:
>>
>>    SeqIO.write(newfastqrecord, out_handle, "fastq")
>>
>> After the loop:
>>
>>    out_handle.close()

Hi Dilara & Lenna,

I would use append mode with caution - it will have side effects
like if you run this script twice, the output file will double in size
(the first run plus the second run). Wouldn't opening in write
mode work just as well here?
i.e. Open the handle, do the loop, close the handle.

There are some other further changes I would suggest.
First, you don't need to create a new SeqRecord, you can
modify the old record in situ. This will be faster as it avoids
extra object creation:

...
newfastqrecord=SeqRecord(seq_record.seq, id=new_header,
letter_annotations=seq_record.letter_annotations)
...

becomes just:

...
seq_record.id = new_header
...

Next, it is better to call SeqIO.write(...) once to do the whole
file. On simple file formats like FASTA, FASTQ, GenBank,
there is no header/footer structure so you can write each
record independently. In general this is not possible, e.g.
SFF files. Moreover, multiple calls to SeqIO.write(...) is
slower than one single call.

The key point about using SeqIO.write(...) once to do a whole
file is this requires an iterator based approach. For example,
using a generator expression and a function acting on a single
record:

def modify_record(record):
    #Do something sensible to the headers here:
    record.id = "modified"
    return record
#This is a generator expression:
modified = (modify_record(r) for r in SeqIO.parse("solid_1.fastq", "fastq"))
count = SeqIO.write(modified, "newsolid_1.fastq", "fastq")
print "Modified %i records" % count

Equivalently using a generator function which does the
looping itself:

def modify_records(records):
    for record in records:
        #Do something sensible to the headers here:
        record.id = "modified"
        yield record
count = SeqIO.write(modify_records(SeqIO.parse("solid_1.fastq",
"fastq")), "newsolid_1.fastq", "fastq")
print "Modified %i records" % count

Getting to gripes with iterators and thinking this way takes
a while - but it is extremely useful for dealing with large
datasets efficiently (without running out of memory).

Now, For FASTQ in particular, the files are usually very large,
and using SeqIO and SeqRecord objects can be too slow.
You might find this useful:
http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

Peter