[Biopython-dev] Unigene parsers

Wed Dec 13 03:10:08 EST 2000

  To write a UniGene parser,  several issues need to be resolved.  The
UniGene page is structured with major keys and subkeys.  Each major key is
on a line be itself and is in all caps, but several subkeys can be placed on
a single line.  Each subkey is separated from its value by a colon.

  One problem is that the records vary in which keys they contain.  I ran
into this with Gobase.  It required calls to routines with tests like

        start = string.find( text, field )
        if( start == -1 ):
            return ''

  Calls to useless routines could waste a lot of CPU time.

  Would it be cleaner to read the major keys into a temporary dictionary and
then consume the ones that ae present and check that all the necessary keys
are present?

  A second problem is that since there can be several subkeys on a line,
with only white space separating the value  from the next key, multiword
keys or values can be ambiguous.  You can make guesses but there's no
guaranteed way to disambiguate the subkey/value pairs.

 A third issue is that the record only displays the first ten sequences of
the cluster.  How do we deal with information that is spread over  several
web pages?

                                                Cayte