[Bioperl-l] Error reporting/Validation implemented

Mingyi Liu mingyi.liu at gpc-biotech.com
Tue Mar 15 17:01:36 EST 2005


Stefan Kirov wrote:

> Mingyi,
> Few things:
> I used your parser to produce Bioperl objects based on some of the 
> high level features and compared it ot what I have. Your parser is 
> considerably faster (about twice), but it is still hard to tell as I 
> am descending further  in the hierarchy with mine. At the same time I 
> don't think the difference will vanish, so I will start building over 
> your parser to produce bioperl objects. I am not sure exactly how I am 
> going to deal with the relationships that are necessary, but I'll deal 
> with it when I finsih everything else.

Hi, Stefan,

Thanks for the comparison result!  That was fast!  Please let me know if 
you need some help using the data structure of my parser.  I'll try to 
provide a skeleton code tonight for you (or maybe in the next couple of 
days since you're away anyway) that comes from my code that extracts all 
data (as far as I can tell) from Entrez Gene.  This way although it 
still does not construct objects for you, at least it's going to be 
easier to find the stuff you want for object construction, which is 
definitely the toughest step of creating a bioperl parser for Entrez Gene.

BTW, I just released version 1.04 with some simple improvements such as 
attempts (only on *NIX) to open file over 2 GB even if the perl version 
used does not support it (so that the file 'All_Data' to work for me 
without recompiling my Perl), 'file' option in 'new' method, etc.  It's 
more convenient to use (check the "regex_parser_test.pl" in V1.04 for 
usage example), somewhat like SeqIO's usage (send in 'file' in new() and 
call next_seq to get next record).

>
> By the way it took 9 minutes on a 64 bit Xeon  3.4GHz even with 
> Bioperl objects construction on the whole Homo_sapiens ASN file. 

Thanks for sharing the benchmark! It's definitely faster than my Xeon 
2.4 GHz.  I just ran my parser V1.04 on the file All_Data that contains 
all Entrez Gene genomes (about 7.4 GB) and it took the parser 98 minutes 
to finish with no error found.

> The data that went inside the objects was: general desc of the genes 
> (symbol, name, summary, etc.), organsism descr. but none of the truly 
> big parts. Unfortunately, I am leaving tomorrow for a conference, so I 
> will have some more next week earliest. Thanks for sharing the code!
> Stefan

Glad to be of help!

Best,

Mingyi



More information about the Bioperl-l mailing list