[Bioperl-l] dealing with large files

Wed Dec 19 17:17:28 UTC 2007

On Dec 19, 2007, at 10:45 AM, Stefano Ghignone wrote:

>> Not exactly clear why you aren't using Bio::SeqIO to write the
>> sequence back out in FASTA format and why you are re-opening the file
>> each time?
> It was to avoid tho keep the out file always opened...
>
>> Did you look at the examples that show how to convert file formats?
>> http://bioperl.org/wiki/HOWTO:SeqIO
> yes I did...but I didn't realized how to set a customized  
> description...
>
>> You can set the description with
>> $seq->description($newdescription);
>> and the ID with
>> $seq->display_id($newid);
>> before writing.
> Thanks for the hint. Anyway, just using the simple code reported to  
> convert embl to fasta format, the results are the same...I remember  
> you that I'm using a huge input file: the  
> uniprot_trembl_bacteria.dat.gz...it contains 13101418 sequences!
>
>> It isn't clear to me from your code why it would be leaking memory
>> and causing a problem - is it possible that you have a huge sequence
>> in the EMBL file?
>> -jason
>
> At the end, I succeeded in the format conversion using this command:
>
> gunzip -c uniprot_trembl_bacteria.dat.gz | perl -ne 'print ">$1 " if
> (/^AC\s+(\S+);/); print " $1" if (/^DE\s+(.*)/);print " [$1]\n" if
> (/^OS\s+(.*)/); if (($a)=/^\s+(.*)/){$a=~s/ //g; print "$a\n"};'
>
> (Thanks to Riccardo Percudani). It's not bioperl...but it works!
>
> My best wishes,
> Stefano

As this shows, sometimes BioPerl isn't always the best answer (I know,  
blasphemy...).  As Jason suggested it's quite likely there are large  
sequence records causing your problems when using BioPerl.  The one- 
liner works b/c it doesn't retain data (sequence, annotation, etc) in  
memory as Bio::Seq object; it's a direct conversion.

It would be nice to code up a lazy sequence object and related  
parsers; maybe for the next dev release.

chris