[BioPython] what should I do next?

Andrew Dalke dalke@acm.org
Wed, 12 Apr 2000 03:22:23 -0600


(Merged replies to Brad and Cayte in this message.)

Brad Chapman:
>I personally would like to see all of the biopython code integrated so 
>that Jeff's parsers tie in with your sequence classes. I don't know 
>how you and Jeff feel about this though?


I'm taking a look at that.  He and I have a different idea of
how to write parsers, so I'm trying to get a feel for his style.

Also, I was looking some into writing scanner generators.
It turns out Python 1.6 will(?) have a sre_parse module, which
parses regular expressions.  This could be very handy.
I also came up with a name for a scanner generator; either
Linebarger or Martel.  The reason is a test for science fiction
fans :)

>Right now I'm integrating the existing corba server with your sequence 
>code so that it actually does something :)

Cool!  I haven't really looked at Ewan's proposal for a while, so
I didn't know how it would work together.

>To support this idl, it 
>would be nice to see more support for parsing different file types 
>(besides just Fasta files)

That interface is likely to change.  I'm thinking about adapters
to make it work with "pure" FASTA records (description and sequence)
or hypothetical sequence records (where the format of the description
line means something).

BTW, the most stable interfaces are the ones tested by the regression
suite.  The next most stable are used by those modules.  If it
isn't touched by the regression code, it is most likely to change.

>and to add code to work with sequence 
>features (so we can support the sequence feature and sequence feature 
>iterator interfaces).

I've been trying to figure out Bioperl's code, but haven't been
getting too far.  I think there needs to be Range objects for
proteins and ones for nucleotides (where the latter supports strand
information).  They also seem to have hierarchical regions, but
I see Ewan commented about that that might not have been as useful
as they thought.

Can anyone give me a synopsis/overview of sequence features?  Examples
of different type would be nice.  Can the base class assume:

  list of regions
    region has a start and end (and for nucleotides, a strand)
    must all strands in a list have the same strand?
    must they be ordered and non-overlapping?
  some name/description

and addition data is stored on a per-feature basis

Are things like parameters used, biblio reference, date of
feature determination, etc. useful enough to place in the base
class (even if only None)?  I've seen these fields put into some
database schemas.)


>I would be 
>interested in seeing more support for the corba interface and for 
>building this up.

You are using Fnorb, right?  As I recall, that's not for commercial
use without paying, yes?  Since I'm trying to do this commercially
(hopefully), I've been looking at ILU and pyomniorb.  I got stuck
in the latter trying to figure out memory management.  Keep meaning
to send email to the authors....


Cayte:
> If you tell me where your files are, I can write test cases.

I've placed the most recent copy of my proposed code at
ftp://starship.python.net/pub/crew/dalke/biopython/

There is some regression code under test/, but it is incomplete.
To run it, "make test".  The regression code is a modified
version of "regrtest" from the Python distribution.

There are hooks for a Python coverage utility, which is also
available from my Starship site.  I've not run the coverage yet -
I know there are modules which aren't even imported by any of my
tests and probably don't work since I rearranged the directory
structure somewhat.

Oh, and the SeqIO idea is likely to change as I figure out how
to make Jeff's work and mine fit together.

You mentioned you have some unit test code for Python.  Can
you point it out to me?


> From what I've seen of open source efforts, nobody "drives" them.
> People contribute in a bursty way, when they have the time and
> motivation.

Well, I've got both time and motivation (until someone pays me
to do otherwise).  I just don't have the calibrated judgement of
what's a good design or not.

I'm also more used to projects where there are a couple of people
working on the core, and a half dozen or so in the same place
(company, research group, whatever) using it.  The previous
open and semi-open source projects I've worked on have had only
sporadic non-local contributors to them.

So, I'm not used to bursty development from several different
places.

                    Andrew
                    dalke@acm.org