[Bioperl-l] genpept/swiss

Andrew Dalke dalke@acm.org
Mon, 4 Sep 2000 04:20:08 -0600


 Hilmar Lapp <hlapp@gmx.net>:
>I'm not at all certain, and that's why it is still reported as something
>you can turn into an exception programmatically.

A-ha!  I think I see the mapping between our viewpoints.  I say
it should be easy to build new parsers, as to handle variations and
wrongly formatted records.  You say you want a single parser which
handles the different cases, likely controlled by some state or global
variable.  The final result is the same - a way to get a parser for
your data that has the right treatment of errors.

>The main objective is I think to be
>able to read sequence entries produced by someone you believe 'does it
>right'.

There also the famous statement of writing tolerent readers and strict
writers.

>Most people will not bother about possibly missing important information
>for one of 1e5 sequences because that one has a misformatted tag.

Do you happen to know the error rate?  I know that something like 20%
of the PDB files, at least a couple of years ago, were not in the PDB
format.  When I worked with Prosite, around the same time, there were
two records with patterns not in the right format (they used lower case
characters.)  EMBL, again, two years ago, contained several improperly
formatted database cross-reference identifiers.  In SWISS-PROT 38, one
record was not in the right format, although I've not coded in all of
the syntax details yet.

There's also semantic errors.  For example, I just started working on
a PIR parser (I've only tested pir3).  One of the dates is from 1299
or so.

About 4 years ago there was a statement in Nature by Hooft, et.al.
(see http://www.cmbi.kun.nl/gv/articles/ref5.html) pointing out that
the the PDB contained over a million errors and outliers.  Would
such an analysis and report of the error rates in existing sequence
databases be worthwhile?

I can see at least four levels of analysis:
  0. is there a format document? (more of a prerequisite, hence "0")
  1. how many records are in the correct syntax/is the format spec complete?
  2. what sort of internal errors are there?  That is, what errors can
       be identified just by looking at the record, like improper ranges
       or a publication date around the time Cambridge was founded.
  3. cross reference errors - inconsistencies between two different
       databases, as Hilmar pointed out with GenPept.

0. is pretty easy to check
1. I've yet to find a format document which properly described all of
  the records, so you end up having to decide what is right and wrong
  yourself.  With Martel it takes about two good days to get the syntax
  down and corrected for all the "undocumented but correct" cases.  I'm
  not being really precise with the syntax, so it would probably take
  another day or two to really pin it down.  So each format takes about
  a week to validate.  There's about a dozen major formats, so under
  3 months.  (Though I've got a couple formats already worked out, and
  the family of swissprot-like formats should be easy to support.)
2. This doesn't feel as complicated as #2, although that's probably
  because I haven't tried to do it yet :)  Let's say 2 months for this,
  and we're at 5 months.  It should be simplified by using consistent
  data structures for the different databases, if appropriate, so the
  tests don't need to be modified for each one.
3. Could go on forever, but a major part of the work (parsing the other
  databases) would already have been done.  Call it 2 more months.

So about 7 months of work, with around factor of 2 on the error bars, but
with natural partitionings to limit the work which needs to be done.
(Fewer formats, less depth.)

BTW, since the flat files are machine generated, you wouldn't think
there would be all that many problems, would you?  Or that the major
database providers could test their outputs against something like
bioperl to see if there are any problems.

>In turn, together with what you said, this means that (most of?) the
>BioPerl parsers in their current state are not suitable if you want your
>program to cover any single piece of information covered in the database
>(e.g. if you wished to convert GenBank into a relational format).
>
>Am I missing something?

I didn't follow what you said, so I don't know.  Part of the problem
may be that I don't know much about how complete the bioperl parsers are.

>as may be the abuse of structured comments.

Just about any use of "structured comments" is an abuse, IMO. :(

>With lots of additional code it may indeed be possible, but I think it's
>beyond the scope of BioPerl, and I'm not sure that there is a big enough
>need in the user community that the scope of BioPerl be extended that
>way. At least, there are so many things that deserve attention and code
>before.

On the other hand, if you don't know the error rate, how do you know
the science you're doing is right?  And it's hard to judge the scope of
BioPerl since if one person wants to make the effort, that work can be
part of BioPerl.  But you are right, 1/2 year of development time can
do a lot, especially if like me you can only work in your free time.

                    Andrew
                    dalke@acm.org