[Bioperl-l] Bio::SeqIO::FTHelper

Chris Fields cjfields at uiuc.edu
Fri Mar 2 14:35:34 UTC 2007


The current parsers are slightly faster, but not enough to make a  
huge difference unless you're parsing thousands of sequences.   
However, it does demonstrate that a good deal of the performance  
issues stem from object creation and not parsing, an issue that is  
already known.  For instance, if you do everything up to (but skip)  
instantiation of an object, like a SeqFeature/Annotation/Species, the  
parsing speeds up dramatically dependent on the number of objects  
created.  I also saw significant increases in speed when using  
FTHelper (instead of SeqFeatures) or Bio::Taxon (instead of  
Bio::Species), so lighter objects definitely help.

I basically just separate the two key steps into two distinct tasks  
(driver and handler); I haven't thought much about validation though  
I would probably separate that into a third task.  Regardless, the  
current drivers are flexible enough to deal with the occasional  
oddity and not die.  It's much easier to maintain and extend; for  
instance if you wanted to develop lightweight objects it's now easier  
to accomplish (i.e. rewrite/overload a handler vs. rewrite next_seq 
() ), and you can separately develop a faster driver via next_seq()  
as long as it threw the same data structure.

Multiple parsers can also use the same handler.  I currently have  
GenBank/EMBL/SwissProt all sharing the same handler and passing all  
tests.

chris

On Mar 2, 2007, at 12:08 AM, Heikki Lehvaslaiho wrote:

> This sounds great. Is the speed increase noticeable?
>
> 	-Heikki
>
>
> On Thursday 01 March 2007 17:24:03 Chris Fields wrote:
>> I do have a rough outline of what I think could be done:
>>
>> http://www.bioperl.org/wiki/Handler-based_SeqIO_parsers
>>
>> where you could switch out handlers to deal with incoming data
>> chunks.  Any suggestions there are welcome.
>>
>> I'll probably commit examples of the above in the next week or two
>> (GenBank, EMBL, Swiss parsers using the same handlers) which don't
>> use FTHelper.  So far I have all three passing tests based on  
>> genbank/
>> embl/swiss.t but they need a few more tweaks before I commit.
>>
>> chris
>>
>> On Mar 1, 2007, at 5:02 AM, Heikki Lehvaslaiho wrote:
>>> Chris,
>>>
>>> It was meant to collect code that was common to all three main
>>> databases using
>>> similar feature tables.
>>>
>>> Now might be the time to optimise the parsing speed by removing it.
>>> Do you
>>> have a plan how to do it?
>>>
>>> 	-Heikki
>>>
>>> On Tuesday 27 February 2007 22:57:40 Chris Fields wrote:
>>>> Could anyone tell me what FTHelper is used for?  From what I gather
>>>> it rolls up seqfeature data into a lightweight object but then
>>>> creates a SeqFeature::Generic anyway (at least for GenBank/EMBL/
>>>> Swiss), which seems to be a waste of memory and time.  Is there
>>>> something I'm missing (besides my sanity of course)?
>>>>
>>>> chris
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>> --
>>> ______ _/      _/ 
>>> _____________________________________________________
>>>       _/      _/
>>>      _/  _/  _/  Heikki Lehvaslaiho    heikki at_sanbi _ac _za
>>>     _/_/_/_/_/  Associate Professor    skype: heikki_lehvaslaiho
>>>    _/  _/  _/  SANBI, South African National Bioinformatics  
>>> Institute
>>>   _/  _/  _/  University of Western Cape, South Africa
>>>      _/      Phone: +27 21 959 2096   FAX: +27 21 959 2512
>>> ___ _/_/_/_/_/ 
>>> ________________________________________________________
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> Christopher Fields
>> Postdoctoral Researcher
>> Lab of Dr. Robert Switzer
>> Dept of Biochemistry
>> University of Illinois Urbana-Champaign
>
>
>
> -- 
> ______ _/      _/_____________________________________________________
>       _/      _/
>      _/  _/  _/  Heikki Lehvaslaiho    heikki at_sanbi _ac _za
>     _/_/_/_/_/  Associate Professor    skype: heikki_lehvaslaiho
>    _/  _/  _/  SANBI, South African National Bioinformatics Institute
>   _/  _/  _/  University of Western Cape, South Africa
>      _/      Phone: +27 21 959 2096   FAX: +27 21 959 2512
> ___ _/_/_/_/_/________________________________________________________
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign






More information about the Bioperl-l mailing list