[Biopython-dev] ANN: mindy-0.1

Mon Mar 26 01:53:46 EST 2001

"Andrew Dalke" <dalke at acm.org> writes:

> 
> >Just curious -- why'd you decide to use Berekeley DB?
> 
> I considered the following choices:
>   - Berkeley DB
>   - mySQL
>   - PostgreSQL
>   - Oracle
> 
> The last three require knowledge of SQL, of which I
> have very little, and I wanted to get things up very
> quickly.  In addition, all I wanted to do was lookups,
> and BSDDB does that very well.  Plus, I liked that
> BSDDB works in the local process rather than talking
> to a server.

Hmm. I don't think I understand what you are actually storing - how is the
indexing done ? Are you preparsing all entries during the indexing part, or
are you storing the positions of the entries via seek and get ?  (for a
simple position indexing tool ala TIGR's yank see getgene.py in biopython)
(that would also answer the alias question)
> 
> I can envision intefaces to the other databases.  Perhaps
> for the future.
> 
> >> Would working with compressed files be useful?
Always !!! - Does anybody know how to seek/tell in a gzipped file ?

> Easy enough I think to stick a bit of code on the beginning
> of the read to tell if the file is compressed or not.  I
> think Python now includes some in-built modules for reading
> compressed files, else popen'ing through zcat or bzcat is
> pretty easy.
from gzip import open ???

> No, it wouldn't.  But I think when you start getting into
> "real" databases (meaning ones with SQL) then people want
> the ability to set up their own schemas, sos the queries
> they have go quickly.  Should the database created be
> fully normalized (in which cases queries can be very
> complex and require a lot of joins) or denormalized (which
> make for easier queries but which is easier to accidently
> leave in an invalid state)?

Be careful, you are heading from a "simple" indexing scheme to a pySRS :-)

> 
> I don't think there is a solution, so the best is to
> wait until someone has a need for it.  Then pay me to
> write the interfaces :)  My need for now is indexed
> searches, so I used a database system which is designed
> for that task.  There is no possible confusion that the
> result is usable for larger scale queries.
> 
> >Another addition which I think would be nice is storing the size of
> >the indexed files.
> >This would allow you to potentially skip an
> >indexing when index is called on a file. 

Uhuh ... I don't think so, especially not if just accession numbers or ID's
are changed (e.g. from TREMBL ID yo SWISS ID) which could result in a
slightly changed db with the same size. Better to use checksum's or the
indexed accession numbers/id's (best solution, but takes more time)

> By the end of the week I hope to start working on it.
> OTOH, my laptop started acting flaky in the last few days :(
> Have I mentioned that me and hardware don't get along?

What laptop or hardware combination is causing you nightmares ?

seek-and-indexingly-y'rs
-thomas
-- 
Sicheritz-Ponten Thomas, Ph.D  CBS, Department of Biotechnology
thomas at biopython.org           The Technical University of Denmark
CBS:  +45 45 252489            Building 208, DK-2800 Lyngby
Fax   +45 45 931585            http://www.cbs.dtu.dk/thomas

	De Chelonian Mobile ... The Turtle Moves ...