[Bioperl-l] dealing with large files

Thu Dec 20 20:39:48 UTC 2007

On Dec 20, 2007, at 12:52 PM, Amir Karger wrote:

>> Amir Karger wrote:
>>>> It would be nice to code up a lazy sequence object and related
>>>> parsers; maybe for the next dev release.
>>>
>>> Also, BLAST parsing. Blasting the proteome against the
>> genome makes for
>>> rather large result files.
>>
>> This has already been done. Use Bio::SearchIO::blast_pull. In a
>> situation like yours I dropped run time from 20223s to
>> 951s (~20x faster) and memory usage from over 8GB to less
>> than 5GB (~40%
>> less).
>
> Not in 1.5.1. Is it in 1.5.2 or just in cvs? Is there a single file I
> can put in my own perl lib for this, or does it require large  
> bunches of
> new code? (I'm guessing the latter.) We're about to upgrade to 1.5.2
> here, but I don't see our whole center using CVS Bioperl.
>
> -Amir

It's in CVS.

Just to note: there have been a lot of changes between 1.5.1 and  
1.5.2, and probably as many from 1.5.2 to now.  We are cleaning up  
some code introduced prior to the 1.5 release and working on other  
fixes and code docs, with the final aim to be a new 1.6; I'm hoping  
that release will have routine point releases for bug fixes.  Of  
course that'll have to wait until after SVN migration!

There a few discussions on the list about speeding up parsing using  
lightweight/featherweight objects or even straight hashes (for  
instance, Jason has a lightweight seqfeature implementation committed  
on a ranch which is quite fast, and Sendu's Bio::SearchIO PullParser  
implementations).  My feeling is that will be part of the next dev  
release, along with GFF3 integration and code cleanup.

chris