[Bioperl-l] Entrez Gene ASN parsers

Sat Mar 12 18:50:41 EST 2005

Hi, Stefan,

Yes, the advantage and disadvantage of my approach are that my parsers do not take the underlying data into account.  By totally ignoring the data content and focusing just on format, this appropach ensured that no data will be left behind in parsing and that the development of the parsers would be very fast, and the parsers perform very well.  In addition, even if NCBI changes the data content, the parser will most likely work just fine without any modifications.

However, this does result in a data structure that is not consolidated into, for example, the two level type you'd want.  The data structure generated merely reflects however NCBI chose to structure their Entrez Gene ASN files.  Building Bioperl objects based on my parser would take some serious efforts (1-2 weeks).  It is definitely doable though, and the performance should not slow down much.  The benchmark I gave included not just the time for parsing and data structure construction, but also data structure trimming, which traverses almost the entire data structure and make changes.  But the initiation of Bioperl objects may make the whole process slow down a few fold.

Regardless, I totally agree that it's the best if you could do a comparison and choose the most suitable approach.

BTW, can you send me example entries for which there are dead entries or 0-sized array in my parser?  I wonder if it's a problem of Entrez Gene file or my parser, since I simply let the data structure mirror the file.  But if it isn't, then I would want to check if it's a bug.  I did process the full human genome into XML files and did not see any empty elements or attributes, and the parser runs on entire mouse and rat genomes without problem, which is expected.

Thanks,

Mingyi

> -----Original Message-----
> From: Stefan Kirov [mailto:skirov at utk.edu]
> Sent: Saturday, March 12, 2005 5:59 PM
> To: Liu, Mingyi
> Cc: bioperl-l at portal.open-bio.org
> Subject: Re: [Bioperl-l] Entrez Gene ASN parsers
> 
> 
> Mingyi,
> I looked at the code (EntrezGene) and so far it seems to me 
> it gives as 
> you claim pretty accurate and easy to understand data structure (few 
> dead entries and some 0 size array, but nothing major).
> The only concern I have is that the data structure. If you want to 
> achieve a better structure (non-redundant, two level where 
> possible or a 
> collection of Bioperl objects) this will slow things down. I guess I 
> will compare how the code I wrote compares to yours and choose the 
> faster one. I think this makes sense.
> Stefan
>