[Bioperl-l] bp_bulk_load_gff.pl speed

Aaron J. Mackey amackey at pcbi.upenn.edu
Thu Jul 15 17:11:02 EDT 2004


I've benchmarked it a bit: the slowdown is happening in both of these  
lines:

   $FH{ FGROUP() }->print(     
join("\t",$gid,$group_class,$group_name),"\n"              ) unless  
$DONE{"fgroup$;$gid"}++;
   $FH{ FTYPE()  }->print(    join("\t",$ftypeid,$method,$source),"\n"    
                 ) unless $DONE{"ftype$;$ftypeid"}++;

What I need to do is break this up to see if it's the $DONE{} lookup  
that's slowing down (a Perl problem) or returning from the print()  
(because the pipe is blocked by MySQL being slow on the insert).  This  
hasn't percolated up my TODO list quite yet, so I'd be happy for  
someone else to chime in ...

My guess is that there's something diabolical about the %DONE hash  
(need to take a look at it's bucket structure) that fgroup1010101 and  
fgroup1010102 are colliding.  Also, since the print() should only  
happen once per feature type, %DONE is the likely suspect.

-Aaron

On Jul 15, 2004, at 4:36 PM, Scott Cain wrote:

> Dustin,
>
> Besides Aaron, a few other people have complained about this, and yes,  
> I
> had written them off as crazy :-)
>
> Since I can't reproduce this problem, I'll have to ask you: is the
> problem that the files are not being written to /usr/tmp (or where  
> ever)
> as quickly as before, or is it that, after the files are done being
> written, they aren't loaded into mysql as quickly?  Not that I have a
> solution to either problem, but the first is presumably a perl problem
> and the second a mysql problem.  If it were the latter (which I kind of
> doubt), you could get around it by using a real database, like
> PostgreSQL.
>
> Scott
>
>
> On Thu, 2004-07-15 at 13:45, bioperl-l-request at portal.open-bio.org
> wrote:
>>
>> I recently started using Bio:DB:GFF, beginning by using
>> bp_bulk_load_gff.pl to load a simple but large gff2 file.  This file
>> consisted only of transcripts and their subfeatures, so the group
>> class of all features was "transcript".  The files loaded with no
>> problem and I was able to write a few successful test scripts.
>>
>> Now I have added  new features (genes) to the gff file, and I
>> attempted to load the new file exactly as before with
>> bp_bulk_load_gff.pl, but now it takes _much_ longer to load, and takes
>> more time the more features are added (the first 5K features take
>> about 30 seconds, the next 5K features take nearly 2 minutes, and so
>> on).  It took over an hour to 50K features, at which point I stopped
>> it.
>>
>> I've played around with the gff file a bit and found that anything
>> that doesn't have a  group class of "transcript" has this problem, for
>> example if I 'sed s/transcript/foo/g'  the original file it's slow,
>> and if I 'sed s/gene/transcript/g' the new file it's fast.  I have
>> manually verified that the MySQL database is empty before each attempt
>> and even wiped the tmp directory before each attempt.
>>
>> Any ideas why non-transcript features take so long?
>>
>> Thanks,
>>
>> Dustin Cram
>
> --  
> ----------------------------------------------------------------------- 
> -
> Scott Cain, Ph. D.                                          
> cain at cshl.org
> GMOD Coordinator (http://www.gmod.org/)                      
> 216-392-3087
> Cold Spring Harbor Laboratory
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
--
Aaron J. Mackey, Ph.D.
Dept. of Biology, Goddard 212
University of Pennsylvania       email:  amackey at pcbi.upenn.edu
415 S. University Avenue         office: 215-898-1205
Philadelphia, PA  19104-6017     fax:    215-746-6697



More information about the Bioperl-l mailing list