[Bioperl-l] bioperl-db improvements

Elia Stupka elia@ebi.ac.uk
Sun, 22 Apr 2001 12:17:09 +0100 (BST)


Dear all,

I have been doing some work on the bioperl-db. I've been testing it with
large EMBL datasets, and noticed a few fields that were not being stored
and little bugs,etc. I am glad to say that now in my test cases the embl
entry is read into the mysql database and comes out almost identical.
There is no loss of fields, the "almost" applies to some little formatting
things, and some fuzzy locations.

This is a summary of what has been added:

ID line: was missing molecule and division, now in
SV line: was not working, now goes in/out
DT lines; dates were not stored, now go in/out
References: were not stored, now go in/out

References are unique on the basis of medline id, so if one reference is
shared by many bioentries, it doesn't get stored again.

I am now testing with the whole human section of EMBL, which should
definitely spot some bugs.

Obviously as well all know embl data is not perfect and the parser cannot
cope with absolutely every madness, so occasionally a location/feature
might be dropped, but as far as I can see it's extremely robust,
considering the number of entries I am parsing in.

Elia Stupka - EnsEMBL

**************************
tel:    +44 1223 49 44 31
mobile: +44 7971 59 03 69
fax:    +44 1223 49 44 68
**************************