[Bioperl-l] genpept/swiss

Hilmar Lapp hlapp@gmx.net
Mon, 04 Sep 2000 03:35:14 +0200


Jason recently reported problems with a certain sequence record fetched
from Entrez using Bio::DB::GenPept (see bugs 838 and 839 in fixed-bugs).

I've fixed these: qualifiers not satisfying the /tag=value syntax are now
reported through warn() instead of throw() in the genbank parser. Some of
you may object to this, but I'm myself tired of being thrown out of loops
over chunks of entries just because of one single misformatting. This
raises again the issue of a switch that can be user-enabled and causes
such things to throw() again. As I'm not sure about the (error)
notification levels available in the Bio/Root classes, could someone who
does know comment on this? That is, is it possible to set the reporting
level such that warn() actually becomes equivalent to throw()?

For the second fix I replaced primary_id() by display_id() in swiss.pm
ID-line generation. This should be the safest alternative since 1)
swiss.pm seq-parser sets display_id() itself, and 2) _every_ sequence
object is supposed to have a display_id(), as far as I understand.

The whole subject was raised by entry O18919; this illustrates a general
problem one should be aware of when interconverting between rich sequence
formats. There are several remarkable differences between this entry
fetched from GenPept and written out in SwissProt format as opposed to
the entry one can obtain from SwissProt (www.expasy.ch). E.g., the
feature table is rendered almost useless in terms of information content:

     Site            270
                     /site_type="active"
                     /note="BY SIMILARITY."
     Site            272
                     /site_type="metal-binding"
                     /note="MAGNESIUM (POTENTIAL)."

in GenPept becomes in SwissProt format

FT   Site        270    270
FT   Site        272    272

instead of

FT   ACT_SITE    269    269       BY SIMILARITY.
FT   METAL       271    271       MAGNESIUM (POTENTIAL).

in the SwissProt original. (In addition you may notice the offset of
coordinates by 1, which is due to the Methionin being omitted in
SwissProt.)

There are other things, some of which can be healed (like the CRC64
instead of CRC32 now being used by SwissProt), while others probably
cannot (like comments getting screwed).

The point I'd like to make may be best illustrated by comparing with
automated language translators that are around (like babelfish;
babelfish.altavista.com). Try to translate an only slightly complicated
sentence from one language into another, which already screws it up
half-way, and then translate the result into a third. I think it is
pointless for BioPerl to aim at clean and complete conversion from any
rich format into another rich format for sequences.

The only way this could be achieved with a reasonable effort is by
mapping languages to a common meta-representation, like XML or ASN.1 (and
anything the meta-format doesn't cover will still be lost).

So, you should be aware of this whenever you convert between sequence
formats using BioPerl.

If people disagree please post.

	Hilmar

-- 
-----------------------------------------------------------------
Hilmar Lapp                                email: hlapp@gmx.net
NFI Vienna, IFD/Bioinformatics             phone: +43 1 86634 631
A-1235 Vienna                                fax: +43 1 86634 727
-----------------------------------------------------------------