[Biopython-dev] ANN: mindy-0.1

Mon Mar 26 03:16:50 EST 2001

Thomas:
>Hmm. I don't think I understand what you are actually storing -
>how is the indexing done ? Are you preparsing all entries
>during the indexing part, or are you storing the positions
>of the entries via seek and get ? 

RTS,L?  :)

I'm parsing all entries through Martel.  Record boundaries
have tag events (eg, 'beginElement("swissprot38_record", ...)').
which can be used to tell where the record is located - so
long as the characters() are also counted.  This is used
to store start/end positions.  I do not save the text in
the database, although there's no reason not to do so, other
than the space duplication.  (There's no option to compress
the BSDDB.)

Inside of a record I look for text contained in other elements.
For example, text inside of 'entry_name' elements is used
for the primary key, and 'acession_number' used to get a list
of aliases.  This is used to make lookup tables to get back
to the offsets, which are used to read the record for disk.

>(for a simple position indexing tool ala TIGR's yank see
> getgene.py in biopython) 

I hadn't realized that code was there.  Its ability to
index is at some level the same, but there are quite
a few differences.  The biggest is that it is based on
Martel, so potentially anything expressed in a Martel
grammer can be indexed.  getgene is hard coded to only
work with SWISS-PROT.  That's actually a very important 
difference because by standardizing the lowest level
parsing (identification of interesting regions) makes
everything else much easier.

There are a lot of other differences between the two
approaches.  For example, I put the ID and AC fields
in different effective namespaces just in case there
are AC and ID fields which are identical but apply to
different records.  This isn't a problem in SWISS-PROT,
but I remember a few years ago I did some tests on
GenBank and there were a few dozen records repeated in
those fields.  Even what I did is incomplete for the
case of a record with multiple aliases which are from
different naming schemes when someone wants to know
XYZ's name for a record as compared to ABC's.

> (that would also answer the alias question)

I'm not sure what that question is.  Also, it looks
like that code only reads a single accession number,
specifically, the first number on the last AC line
of a record.

            elif line[:3] == 'AC ':
                acc = string.split(line)[1]
                if acc[-1] ==';': acc = acc[:-1]

There can be multiple accession numbers.

>> >> Would working with compressed files be useful?
>Always !!! - Does anybody know how to seek/tell in a gzipped file ?

That would depend on how the file is laid out, and I don't
know enough about the details of gzip'ed files.  As an
example, I know that after some number of characters the
compression table is reset, partially in case there is
a skew in the distribution of character frequencies in the
input stream.  If the number of characters is based on
the output size rather than input, then it should be
possible to jump to the next block and see if it's too
far or not.

All theoretical.  Real life may vary, and I bet it does.

>from gzip import open ???

Right.  Is there something for bzip2?

>> Brad:
>> >Another addition which I think would be nice is storing the size of
>> >the indexed files.
>> >This would allow you to potentially skip an
>> >indexing when index is called on a file. 

>Uhuh ... I don't think so, especially not if just accession
>numbers or ID's are changed ...
>Better to use checksum's or the
>indexed accession numbers/id's (best solution, but takes more time)

The requested functionality is a way to detect quickly if a
file has changed and not do an update if it hasn't changed.
There are lots of ways to do it:
  - file size
  - modify timestamp
  - some hash value of the whole file

File size isn't perfect, as Thomas pointed out.  The timestamp
isn't perfect because it can be copied without change, but
that affects the timestamp.  Hash value of the full file
calls for reading the full file.

Basically, there's no perfect solution so it's going to
be a tradeoff.

The other solution is to push the decision of when to
update to a different program, which is responsible
for calling the updater to tell if a file has changed.
I prefer this one because that's my usual solution for
trade-off problems - let something else figure out what
to do.

There's nothing to say this controller program
couldn't also use bsddb.

                    Andrew
                    dalke at acm.org