[Bioperl-l] Entrez Gene ASN parsers

Sat Mar 12 19:33:24 EST 2005

I kind of like this approach, i.e., have a general purpose low-level 
parser that you have reasonable confidence in will never be the 
bottleneck, and then build a bioperl parser on top of it that now can 
focus its code on assembling the desired data structure as opposed to 
the file format itself.

And if course assembling that data structure will slow things down a 
lot but hey, either you want an object hierarchy in (bio-)perl or you 
don't.

Also, given the thread and previous ones, that ominous bioperl data 
structure may be very fluid initially, or even result in different 
top-level parsers depending on how compatible the different visions are 
for what to get out of that parser.

	-hilmar

On Saturday, March 12, 2005, at 03:50  PM, Liu, Mingyi wrote:

> Hi, Stefan,
>
> Yes, the advantage and disadvantage of my approach are that my parsers 
> do not take the underlying data into account.  By totally ignoring the 
> data content and focusing just on format, this appropach ensured that 
> no data will be left behind in parsing and that the development of the 
> parsers would be very fast, and the parsers perform very well.  In 
> addition, even if NCBI changes the data content, the parser will most 
> likely work just fine without any modifications.
>
> However, this does result in a data structure that is not consolidated 
> into, for example, the two level type you'd want.  The data structure 
> generated merely reflects however NCBI chose to structure their Entrez 
> Gene ASN files.  Building Bioperl objects based on my parser would 
> take some serious efforts (1-2 weeks).  It is definitely doable 
> though, and the performance should not slow down much.  The benchmark 
> I gave included not just the time for parsing and data structure 
> construction, but also data structure trimming, which traverses almost 
> the entire data structure and make changes.  But the initiation of 
> Bioperl objects may make the whole process slow down a few fold.
>
> Regardless, I totally agree that it's the best if you could do a 
> comparison and choose the most suitable approach.
>
> BTW, can you send me example entries for which there are dead entries 
> or 0-sized array in my parser?  I wonder if it's a problem of Entrez 
> Gene file or my parser, since I simply let the data structure mirror 
> the file.  But if it isn't, then I would want to check if it's a bug.  
> I did process the full human genome into XML files and did not see any 
> empty elements or attributes, and the parser runs on entire mouse and 
> rat genomes without problem, which is expected.
>
> Thanks,
>
> Mingyi
>
>> -----Original Message-----
>> From: Stefan Kirov [mailto:skirov at utk.edu]
>> Sent: Saturday, March 12, 2005 5:59 PM
>> To: Liu, Mingyi
>> Cc: bioperl-l at portal.open-bio.org
>> Subject: Re: [Bioperl-l] Entrez Gene ASN parsers
>>
>>
>> Mingyi,
>> I looked at the code (EntrezGene) and so far it seems to me
>> it gives as
>> you claim pretty accurate and easy to understand data structure (few
>> dead entries and some 0 size array, but nothing major).
>> The only concern I have is that the data structure. If you want to
>> achieve a better structure (non-redundant, two level where
>> possible or a
>> collection of Bioperl objects) this will slow things down. I guess I
>> will compare how the code I wrote compares to yours and choose the
>> faster one. I think this makes sense.
>> Stefan
>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------