From dalke at acm.org  Fri Sep  1 03:16:10 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:51 2005
Subject: [Biopython-dev] XSLT and Martel output
Message-ID: <39AF57BA.A43BF7CB@acm.org>

Hello,

  With some pointers from Brad I managed to get an XSLT converter for
the Martel SWISS-PROT output into FASTA.  I would have tried an XML
one, but wasn't sure which to use.

  The input was the example output file I have at
http://www.biopython.org/~dalke/Martel/BOSC2000.poster/sample.xml.txt
This has 8 records and is about 60K long.

  The XSLT engine I used is 4XSLT from ForeThought.  BTW, it was
entirely too complicated to install esp. since there aren't any
instructions and there seems to be a missing file from one of
the distributions (but which is in the other).  :(

  The actual XSLT text I used is

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">

  <xsl:template match="//swissprot38_record">
    <xsl:text disable-output-escaping="yes">&gt;sp|</xsl:text>
    <xsl:value-of select="*/ac_number"/>
    <xsl:text disable-output-escaping="yes">|</xsl:text>
    <xsl:value-of select="*/entry_name"/>
    <xsl:for-each select="DE_block/DE/description">
      <xsl:text> </xsl:text>
      <xsl:value-of select="."/>
    </xsl:for-each>
    <xsl:text>&#010;</xsl:text>
    <xsl:for-each
select="sequence_block/SQ_data_block/SQ_data/sequence">
      <xsl:value-of select="translate(., ' ', '')"/>
      <xsl:text>&#010;</xsl:text>
    </xsl:for-each>
    <xsl:if test="position()!=last()">
      <xsl:text>&#010;</xsl:text>
    </xsl:if>
  </xsl:template>

</xsl:stylesheet>

Example output looks like:
====
>sp|Q43495|108_LYCES PROTEIN 108 PRECURSOR.
MASVKSSSSSSSSSFISLLLLILLVIVLQSQVIECQPQQSCTASLTGLNVCAPFLVPGSP
TASTECCNAVQSINHDCMCNTMRIAAQIPAQCNLPPLSCSAN

>sp|P18646|10KD_VIGUN 10 KD PROTEIN PRECURSOR (CLONE PSAS10).
MEKKSIAGLCFLFLVLFVAQEVVVQSEAKTCENLVDTYRGPCFTTGSCDDHCKNKEHLLS
GRCRDDVRCWCTRNC
====

It took about 3.5 seconds to load the file into the DOM and about 1.5
seconds to process it.  Since there are 80,000 records in sprot38, it
would take nearly 14 hours to convert everything.  It would take about
20 minutes to translated it using a SAX-based converter, so a factor
of 70 slower.

Of course, it would also require that I have enough memory since the
DOM I'm using (4DOM, also from ForeThought) keeps everything in
RAM.

There are some performance things you need to learn using XSLT (or at
least tricks specific to this engine.)  For example
    <xsl:for-each
select="sequence_block/SQ_data_block/SQ_data/sequence">
is a lot faster (20-fold or so!) than
    <xsl:for-each select="*//sequence">

It's a good thing that FASTA doesn't mandate that all sequence lines
(excepting the last) must be 65 characters long.  The SWISS-PROT
sequence lines are 60 characters long, and I can't figure out how to
wrap them to different lengths.


On the other hand, it *does* work, and the performance of the engines
should go up over time (eg, there is usually about a factor of 5-10 by
translation into C).  Plus, in theory you should be able to make it
work with other XSLT tools.  Anyone want to try it with XT, or one of
the browsers (does Mozilla or Opera support XSLT?).

Better yet, want to start playing around with the BLAST output from
Martel?  :)

			      Andrew
			      dalke@acm.org

From bradmars at yahoo.com  Fri Sep  1 13:09:47 2000
From: bradmars at yahoo.com (Bradley Marshall)
Date: Sat Mar  5 14:42:51 2005
Subject: [Biopython-dev] Re: [BioXML-dev] XSLT and Martel output
Message-ID: <20000901170947.23563.qmail@web208.mail.yahoo.com>

It looks great, Andrew.

I haven't crunched any numbers, but my gut feeling is
that xt (from jclark.com) is prob. 5-10 fold faster
than 4XSLT.  Unfortunately, 4XSLT is the only python
xslt processor that I know of.  It's good, but slow.

On the plus side, xt works quite nice in jpython.

Brad

--- Andrew Dalke <dalke@acm.org> wrote:
> Hello,
>
> With some pointers from Brad I managed to get an
> XSLT converter for
> the Martel SWISS-PROT output into FASTA.  I would
> have tried an XML
> one, but wasn't sure which to use.
>
> The input was the example output file I have at
>
http://www.biopython.org/~dalke/Martel/BOSC2000.poster/sample.xml.txt
> This has 8 records and is about 60K long.
>
> The XSLT engine I used is 4XSLT from ForeThought.
> BTW, it was
> entirely too complicated to install esp. since there
> aren't any
> instructions and there seems to be a missing file
> from one of
> the distributions (but which is in the other).  :(
>
> The actual XSLT text I used is
>
> <?xml version="1.0"?>
> <xsl:stylesheet
> xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
> version="1.0">
>
> <xsl:template match="//swissprot38_record">
> <xsl:text
> disable-output-escaping="yes">&gt;sp|</xsl:text>
> <xsl:value-of select="*/ac_number"/>
> <xsl:text
> disable-output-escaping="yes">|</xsl:text>
> <xsl:value-of select="*/entry_name"/>
> <xsl:for-each select="DE_block/DE/description">
> <xsl:text> </xsl:text>
> <xsl:value-of select="."/>
> </xsl:for-each>
> <xsl:text>
</xsl:text>
> <xsl:for-each
>
select="sequence_block/SQ_data_block/SQ_data/sequence">
> <xsl:value-of select="translate(., ' ', '')"/>
> <xsl:text>
</xsl:text>
> </xsl:for-each>
> <xsl:if test="position()!=last()">
> <xsl:text>
</xsl:text>
> </xsl:if>
> </xsl:template>
>
> </xsl:stylesheet>
>
> Example output looks like:
> ====
> >sp|Q43495|108_LYCES PROTEIN 108 PRECURSOR.
>
MASVKSSSSSSSSSFISLLLLILLVIVLQSQVIECQPQQSCTASLTGLNVCAPFLVPGSP
> TASTECCNAVQSINHDCMCNTMRIAAQIPAQCNLPPLSCSAN
>
> >sp|P18646|10KD_VIGUN 10 KD PROTEIN PRECURSOR (CLONE
> PSAS10).
>
MEKKSIAGLCFLFLVLFVAQEVVVQSEAKTCENLVDTYRGPCFTTGSCDDHCKNKEHLLS
> GRCRDDVRCWCTRNC
> ====
>
> It took about 3.5 seconds to load the file into the
> DOM and about 1.5
> seconds to process it.  Since there are 80,000
> records in sprot38, it
> would take nearly 14 hours to convert everything.
> It would take about
> 20 minutes to translated it using a SAX-based
> converter, so a factor
> of 70 slower.
>
> Of course, it would also require that I have enough
> memory since the
> DOM I'm using (4DOM, also from ForeThought) keeps
> everything in
> RAM.
>
> There are some performance things you need to learn
> using XSLT (or at
> least tricks specific to this engine.)  For example
> <xsl:for-each
>
select="sequence_block/SQ_data_block/SQ_data/sequence">
> is a lot faster (20-fold or so!) than
> <xsl:for-each select="*//sequence">
>
> It's a good thing that FASTA doesn't mandate that
> all sequence lines
> (excepting the last) must be 65 characters long.
> The SWISS-PROT
> sequence lines are 60 characters long, and I can't
> figure out how to
> wrap them to different lengths.
>
>
> On the other hand, it *does* work, and the
> performance of the engines
> should go up over time (eg, there is usually about a
> factor of 5-10 by
> translation into C).  Plus, in theory you should be
> able to make it
> work with other XSLT tools.  Anyone want to try it
> with XT, or one of
> the browsers (does Mozilla or Opera support XSLT?).
>
> Better yet, want to start playing around with the
> BLAST output from
> Martel?  :)
>
> Andrew
> dalke@acm.org
> _______________________________________________
> BioXML-dev mailing list  -  BioXML-dev@bioxml.org
> http://bioxml.org/mailman/listinfo/bioxml-dev


__________________________________________________
Do You Yahoo!?
Yahoo! Mail - Free email you can access from anywhere!
http://mail.yahoo.com/

From katel at worldpath.net  Tue Sep  5 03:10:17 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:51 2005
Subject: [Biopython-dev] Gobase
Message-ID: <004201c01708$59752300$010a0a0a@0q6vm>

  I just committed a Gobase parser.

                            Cayte
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://portal.open-bio.org/pipermail/biopython-dev/attachments/20000905/2182c586/attachment.htm
From bradmars at yahoo.com  Tue Sep  5 14:41:09 2000
From: bradmars at yahoo.com (Bradley Marshall)
Date: Sat Mar  5 14:42:51 2005
Subject: [Biopython-dev] Re: [BioXML-dev] XSLT and Martel output
Message-ID: <20000905184109.20724.qmail@web208.mail.yahoo.com>

So I went back and checked the python xml sig mailing
list, and fourthought claims that 4XSLT 0.9.2 is up to
100 times faster thann 0.8.2.  However, it wasn't
available from their web site.  There was a link to
the rpms, though, and there I found 4XSLT 0.9.2.  So,
if anybody wants it, it's at :

ftp://fourthought.com/pub/mirrors/python4linux/redhat/i386/4XSLT-0.9.2-1.i386.rpm


Brad


--- Bradley Marshall <bradmars@yahoo.com> wrote:
> 
> It looks great, Andrew.
> 
> I haven't crunched any numbers, but my gut feeling
> is
> that xt (from jclark.com) is prob. 5-10 fold faster
> than 4XSLT.  Unfortunately, 4XSLT is the only python
> xslt processor that I know of.  It's good, but slow.
> 
> On the plus side, xt works quite nice in jpython.
> 
> Brad
> 
> --- Andrew Dalke <dalke@acm.org> wrote:
> > Hello,
> >
> > With some pointers from Brad I managed to get an
> > XSLT converter for
> > the Martel SWISS-PROT output into FASTA.  I would
> > have tried an XML
> > one, but wasn't sure which to use.
> >
> > The input was the example output file I have at
> >
>
http://www.biopython.org/~dalke/Martel/BOSC2000.poster/sample.xml.txt
> > This has 8 records and is about 60K long.
> >
> > The XSLT engine I used is 4XSLT from ForeThought.
> > BTW, it was
> > entirely too complicated to install esp. since
> there
> > aren't any
> > instructions and there seems to be a missing file
> > from one of
> > the distributions (but which is in the other).  :(
> >
> > The actual XSLT text I used is
> >
> > <?xml version="1.0"?>
> > <xsl:stylesheet
> > xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
> > version="1.0">
> >
> > <xsl:template match="//swissprot38_record">
> > <xsl:text
> > disable-output-escaping="yes">&gt;sp|</xsl:text>
> > <xsl:value-of select="*/ac_number"/>
> > <xsl:text
> > disable-output-escaping="yes">|</xsl:text>
> > <xsl:value-of select="*/entry_name"/>
> > <xsl:for-each select="DE_block/DE/description">
> > <xsl:text> </xsl:text>
> > <xsl:value-of select="."/>
> > </xsl:for-each>
> > <xsl:text>
> </xsl:text>
> > <xsl:for-each
> >
>
select="sequence_block/SQ_data_block/SQ_data/sequence">
> > <xsl:value-of select="translate(., ' ', '')"/>
> > <xsl:text>
> </xsl:text>
> > </xsl:for-each>
> > <xsl:if test="position()!=last()">
> > <xsl:text>
> </xsl:text>
> > </xsl:if>
> > </xsl:template>
> >
> > </xsl:stylesheet>
> >
> > Example output looks like:
> > ====
> > >sp|Q43495|108_LYCES PROTEIN 108 PRECURSOR.
> >
>
MASVKSSSSSSSSSFISLLLLILLVIVLQSQVIECQPQQSCTASLTGLNVCAPFLVPGSP
> > TASTECCNAVQSINHDCMCNTMRIAAQIPAQCNLPPLSCSAN
> >
> > >sp|P18646|10KD_VIGUN 10 KD PROTEIN PRECURSOR
> (CLONE
> > PSAS10).
> >
>
MEKKSIAGLCFLFLVLFVAQEVVVQSEAKTCENLVDTYRGPCFTTGSCDDHCKNKEHLLS
> > GRCRDDVRCWCTRNC
> > ====
> >
> > It took about 3.5 seconds to load the file into
> the
> > DOM and about 1.5
> > seconds to process it.  Since there are 80,000
> > records in sprot38, it
> > would take nearly 14 hours to convert everything.
> > It would take about
> > 20 minutes to translated it using a SAX-based
> > converter, so a factor
> > of 70 slower.
> >
> > Of course, it would also require that I have
> enough
> > memory since the
> > DOM I'm using (4DOM, also from ForeThought) keeps
> > everything in
> > RAM.
> >
> > There are some performance things you need to
> learn
> > using XSLT (or at
> > least tricks specific to this engine.)  For
> example
> > <xsl:for-each
> >
>
select="sequence_block/SQ_data_block/SQ_data/sequence">
> > is a lot faster (20-fold or so!) than
> > <xsl:for-each select="*//sequence">
> >
> > It's a good thing that FASTA doesn't mandate that
> > all sequence lines
> > (excepting the last) must be 65 characters long.
> > The SWISS-PROT
> > sequence lines are 60 characters long, and I can't
> > figure out how to
> > wrap them to different lengths.
> >
> >
> > On the other hand, it *does* work, and the
> > performance of the engines
> > should go up over time (eg, there is usually about
> a
> > factor of 5-10 by
> > translation into C).  Plus, in theory you should
> be
> > able to make it
> > work with other XSLT tools.  Anyone want to try it
> > with XT, or one of
> > the browsers (does Mozilla or Opera support
> XSLT?).
> >
> > Better yet, want to start playing around with the
> > BLAST output from
> > Martel?  :)
> >
> > Andrew
> > dalke@acm.org
> > _______________________________________________
> > BioXML-dev mailing list  -  BioXML-dev@bioxml.org
> > http://bioxml.org/mailman/listinfo/bioxml-dev
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Yahoo! Mail - Free email you can access from
> anywhere!
> http://mail.yahoo.com/
> _______________________________________________
> BioXML-dev mailing list  -  BioXML-dev@bioxml.org
> http://bioxml.org/mailman/listinfo/bioxml-dev


__________________________________________________
Do You Yahoo!?
Yahoo! Mail - Free email you can access from anywhere!
http://mail.yahoo.com/

From jchang at SMI.Stanford.EDU  Wed Sep  6 00:42:43 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:51 2005
Subject: [Biopython-dev] Gobase
In-Reply-To: <004201c01708$59752300$010a0a0a@0q6vm>
Message-ID: <Pine.GSO.4.21.0009052139590.13727-100000@riboweb.Stanford.EDU>

Great!

I noticed that you created a suite of regression tests.  Could you also
commit a hand-verified file for Tests/output/test_gobase?

I nothing catastrophic happens, I'd like to put together a new build
tomorrow afternoon (PST).  If the file doesn't get in before then, that's
ok too.

Thanks,
Jeff


On Tue, 5 Sep 2000, Cayte wrote:

>   I just committed a Gobase parser.
> 
>                             Cayte
> 


From jchang at SMI.Stanford.EDU  Wed Sep  6 18:32:51 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:51 2005
Subject: [Biopython-dev] Biopython 0.90d03 released
Message-ID: <Pine.GSO.4.21.0009061522200.15733-100000@riboweb.Stanford.EDU>

Hello everybody,

Biopython 0.90d03 is now available at:
http://www.biopython.org/Download/

Changes from the previous version are:
  Blast updates:
    - bug fixes in NCBIStandalone, NCBIWWW
    - some __str__ methods in Record.py implemented (incomplete)
  Tests
    - new BLAST regression tests
    - prosite tests fixed
  New parsers for Rebase, Gobase
  pure python implementation of C-based tools
  Thomas Sicheritz-Ponten's xbbtools
  can now generate documentation from docstrings using HappyDoc

The tests for prodoc and rebase are not working yet, so if you run the
regression tests, those two should fail, but the other 14 should work.

Enjoy, and keep those bug reports, feature requests, patches, new modules,
coming in!

Jeff


From dagdigian at ComputeFarm.com  Wed Sep 13 13:21:50 2000
From: dagdigian at ComputeFarm.com (Chris Dagdigian)
Date: Sat Mar  5 14:42:51 2005
Subject: [Biopython-dev] bio*.org server bandwidth upgrade on the horizon
Message-ID: <4.3.2.7.0.20000913131410.00ad1b20@fedayi.sonsorol.org>

I just received word that the net connection for the bio*.org server(s) is 
going to be upgraded from T1 to a T3 line. Given the insane lead time for 
telecommunication orders the best timeframe I have at this time is that the 
work will be completed sometime before the end of the year. I'll provide 
more info as I get it especially if it involves significant downtime for us.

-Chris


From katel at worldpath.net  Sun Sep 17 02:35:52 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:51 2005
Subject: [Biopython-dev] Martel
Message-ID: <005801c02071$8721b3a0$010a0a0a@cadence.com>

   My impression of Martel is that it will require extensive testing, because it has so many paths.  The tests cover the basic expressions, but I'd be surprised if there are no weird interactions.  The code may lose its context, on complicated paths.  I could help with adding unit tests.

  In a few cases, I think the names need to be more descriptive.  Variables like p, s or av don't give a lot of information.  Also, the name "pattern" is used for too many things, that have different meanings.  The regular expression is sometimes "source" and sometimes "s".  At least, I, need all the help I can get, navigating the recursion.:)!  An example of a construct that confused me is:

    x = sre_parse.parse(s, pattern = MultigroupPattern())
    return convert_list(x.pattern, x)


  Finding self-documenting names can be hard, but sometimes the effort to find the right metaphor clarifies your thinking.

                                                       Cayte
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://portal.open-bio.org/pipermail/biopython-dev/attachments/20000916/ecb91433/attachment.htm
From dalke at acm.org  Sun Sep 17 14:39:34 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:51 2005
Subject: [Biopython-dev] Martel
Message-ID: <003201c020d6$a1ce0ea0$359c343f@josiah>

Cayte <katel@worldpath.net>
>  My impression of Martel is that it will require extensive testing,
because
> it has so many paths.  The tests cover the basic expressions, but I'd be
> surprised if there are no weird interactions.  The code may lose its
> context, on complicated paths.  I could help with adding unit tests.

One of the things I found during development was that it was almost
impossible to write a parser without testing each of the components against
real text.  What you are seeing is the support framework needed for that.

Concerning the number of paths; I'm not sure which paths you're talking
about.  There are two I can think of.  One is the generation of the
state table for mxTextTools and the other is the evaluation of the text
through that state table.  The first is somewhat straight-forward;
very much like unoptimized code generation from a parse tree.  It does
need documentation so others can verify my work.  The second is indeed
more complicated, but it should be almost identically complicated to
hand written parser code of equivalent abilities.

Debugging, btw, is also somewhat complicated because failures are identified
as the last character that something worked, as compared to the last
character which was used for a test.  I need to take a look at the
mxTextTools code to see if there's a way to give better position
information.

>  In a few cases, I think the names need to be more descriptive.  Variables
> like p, s or av don't give a lot of information.  Also, the name "pattern"
> is used for too many things, that have different meanings.

You're missing a few other naming clashes in my code.  I agree, it needs
a full cleanup before it is of good enough quality that I would foist it
off on most people.  The names are confusing because I was confused myself
when doing the code.  I was working with a couple of toolsets (sre_parse
and mxTexTools) which I hadn't used before, and I was changing my idea of
how things should be done based on what I learned using them.  (Not an
excuse, just history, and they do need to get fixed.)

There are two major reasons why I haven't fixed things.  One is, alas, the
lack of time.  The other is that there are a few changes I need to make to
support certain formats and needs.  I've added a "named group repeat" where
a named group can be used as the repeat count for later groups.  (This is
needed for MDL's CT format, which gives the atom and bond counts then
"atom_count" lines of atom records and "bond_count" lines of bond records.)
I also need to redo how it handles files so I can feed it a record at
a time rather than the whole data file, but without changing the SAX events.
(Jeff first suggested this one.)

So I'm still in the experimental phase to see what other changes are needed,
and I'm hoping to get feedback from others about it.  Thus, I haven't
wanted to go through the code cleaning it up until I know more about
what to change.

>  Finding self-documenting names can be hard, but sometimes the effort
> to find the right metaphor clarifies your thinking.

Yep, and yep.

                    Andrew
                    dalke@acm.org


From katel at worldpath.net  Sun Sep 17 23:45:48 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:51 2005
Subject: [Biopython-dev] Martel
References: <003201c020d6$a1ce0ea0$359c343f@josiah>
Message-ID: <001f01c02122$efa75720$010a0a0a@cadence.com>

  When I ran _test in Generate.py, I received this message:

Traceback (innermost last):
  File "test_generate.py", line 3, in ?
    Generate._test()
  File "Generate.py", line 471, in _test
    exp = _generate(convert_re.make_expression(re_pat))
TypeError: not enough arguments; expected 2, got 1


convert_re.make_expression returns the results from convert_list.
convet_list returns an Expression.Seq object that is passed to _generate.

                                                                   Cayte


From dalke at acm.org  Tue Sep 19 19:52:44 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:51 2005
Subject: [Biopython-dev] Martel
Message-ID: <003201c02294$b6059fe0$399c343f@josiah>

Cayte:

>  When I ran _test in Generate.py, I received this message:
>
>Traceback (innermost last):
>  File "test_generate.py", line 3, in ?
>    Generate._test()
>  File "Generate.py", line 471, in _test
>    exp = _generate(convert_re.make_expression(re_pat))
>TypeError: not enough arguments; expected 2, got 1

Oops!  Yeah, if I don't make all of the tests accessible from one
spot I forgot the run them.  I changed the API after I wrote that
test.  It can be fixed by using {} at the second parameter.

    exp = _generate(convert_re.make_expression(re_pat), {})

The second parameter is a dictionary of names needed for group
references, like the \1 in r"(?P<blah>....)\1".  (It's a dict
instead of a list because I like the O(1) lookup performance,
and because the parameter is not exposed as part of the API.)

I'm cleaning up the code now, including changing the names to
be more consistent.  I'll include moving all of the tests to
the test directory instead of including them in the module.

                    Andrew
                    dalke@acm.org


From katel at worldpath.net  Thu Sep 21 04:53:08 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:51 2005
Subject: [Biopython-dev] Martel
References: <003201c02294$b6059fe0$399c343f@josiah>
Message-ID: <007401c023a9$5e112140$010a0a0a@cadence.com>

----- Original Message -----
From: "Andrew Dalke" <dalke@acm.org>
To: <biopython-dev@biopython.org>
Sent: Tuesday, September 19, 2000 4:52 PM
Subject: Re: [Biopython-dev] Martel


> Oops!  Yeah, if I don't make all of the tests accessible from one
> spot I forgot the run them.  I changed the API after I wrote that
> test.  It can be fixed by using {} at the second parameter.
>
>     exp = _generate(convert_re.make_expression(re_pat), {})
>
   It works with the patch.  But when I pasted in some regexps from the perl
EMBL.pm, it rejected these constructs.  I can't send the perl expressions in
this message , because the email software interprets the backslashes as the
prefix of a url.

                                      Cayte


From dalke at acm.org  Thu Sep 21 05:27:33 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:51 2005
Subject: [Biopython-dev] Martel
Message-ID: <002901c023ae$2ce10540$0680343f@josiah>

Cayte:
> In a few cases, I think the names need to be more descriptive.  Variables
> like p, s or av don't give a lot of information.

I've been cleaning up the code over the last couple of days, and adding
docstring and comments.  I haven't gone through all my code yet, but
I think the case you're seeing is the sre_parse.py code.  I grabbed that
module and sre_constants.py from the 1.6 distribution.  It was written by
Fredrick Lundh and I don't have much control over it.  I have made some
changes to sre_parse.py, but I've tried to minimize those changes to make
it easier to stay in synch with future changes.

On the other hand, the sre code is being tested by a lot of people, so
there shouldn't need to be many tests for it except for the changes I've
added.  (Those changes are marked with an 'APD'.)


>  Also, the name "pattern" is used for too many things, that have different
> meanings.  The regular expression is sometimes "source" and sometimes "s".

Again, that appears to be the sre_parse code.  I have cleaned up my code
to distinguish between a regular expression in the abstract and its
representation as a "pattern" string and an "expression" tree.  My next
project is to clean up 'Generate.py' which has the regexp represented
as a tagtable.

> An example of a construct that confused me is:
>
>     x = sre_parse.parse(s, pattern = MultigroupPattern())
>    return convert_list(x.pattern, x)

I changed the 'MultigroupPattern' class name to 'GroupName'.  There's
nothing I can really do with the "pattern = " and the "x.pattern" code,
since that's the way sre_parse wants it.  I did add some documentation
beforehand saying to basically ignore the names :)

I'm giving a presentation to people tomorrow (oops! today!) about Martel.
They are chemistry people and will want to see support for MDL's file
formats.  That wasn't in the previous release so I've made a 0.25 release
which contains that format and the cleanups I've done to date.

It's at http://www.biopython.org/~dalke/Martel/Martel-0.25.tar.gz .

Also, all of the regression tests are now runnable from test/__init__.py.

Version 0.3 will be the release with the complete code cleanup (excluding
the two sre_*.py files)

                    Andrew
                    dalke@acm.org


From dalke at acm.org  Mon Sep 25 22:42:06 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:51 2005
Subject: [Biopython-dev] sre bug(s)
Message-ID: <00c601c02763$7f909040$2980343f@josiah>

Sigh.  Looks like I'll update the sre_parse.py in Martel tomorrow once
2.0b2 is released tomorrow.

                    Andrew

-----Original Message-----
From: Fredrik Lundh <effbot@telia.com>
Newsgroups: comp.lang.python
Date: Monday, September 25, 2000 4:12 PM
Subject: Re: You can never go down the drain...


>Phlip wrote:
>> C> I'd search the bugbase for this, to see if 1.6 had it, but I have no
idea
>> how to search for things like [] and \1 and sub from a web page.
>
>it's bug 114660
>
>> 4> Just for curiosity, any re workaround that's obvious to all you
>> expression regulars? (Besides a loop statement?)
>
>here's the patch (the line numbers might differ slightly from
>your copy)
>
>Index: sre_parse.py
>===================================================================
>RCS file: /cvsroot/python/python/dist/src/Lib/sre_parse.py,v
>retrieving revision 1.33
>retrieving revision 1.34
>diff -C2 -r1.33 -r1.34
>*** sre_parse.py 2000/09/02 11:03:33 1.33
>--- sre_parse.py 2000/09/24 14:46:19 1.34
>***************
>*** 635,639 ****
>                      group = _group(this, pattern.groups+1)
>                      if group:
>!                         if (not s.next or
>                              not _group(this + s.next, pattern.groups+1)):
>                              code = MARK, int(group)
>--- 635,639 ----
>                      group = _group(this, pattern.groups+1)
>                      if group:
>!                         if (s.next not in DIGITS or
>                              not _group(this + s.next, pattern.groups+1)):
>                              code = MARK, int(group)
>
></F>
>


From roybryant at SEVENtwentyfour.com  Tue Sep 26 10:59:10 2000
From: roybryant at SEVENtwentyfour.com (Roy Bryant)
Date: Sat Mar  5 14:42:51 2005
Subject: [Biopython-dev] Broken link in www.biopython.org
Message-ID: <A6F1F91CEB6CD411B27200A0C99834D8517CEA@MAIL724>

There appears to be a problem on this page of your site.

    On your page http://www.biopython.org/wiki/html/BioPython/BioCorba.html
    when you click on your link to http://www.biopython.org/Download/)
    you get the error: Not found

As recommended by the Robot Guidelines, this email is to explain our robot's
activities and to let you know about one of the broken links we encountered.
LinkWalker does not store or publish the content of your pages, but rather
uses the link information to update our map of the World Wide Web.

Are these reports helpful? I'd love some feedback. If you prefer not to
receive these occasional error notices please let me know.

Roy Bryant

 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
 Roy Bryant, roybryant@seventwentyfour.com
 President
 SEVENtwentyfour Inc. ("Always watching the Web")
 http://www.seventwentyfour.com
 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

From dalke at acm.org  Thu Sep 28 02:43:00 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:51 2005
Subject: [Biopython-dev] lost my Jitterbug password
Message-ID: <02a701c02917$5928b300$ec7f343f@josiah>

What should I do?

BTW, I fixed bug 11, "tranlate by name" and added a test for it in
test_translate.py .

                    Andrew


From thomas at cbs.dtu.dk  Fri Sep 29 03:35:34 2000
From: thomas at cbs.dtu.dk (Thomas Sicheritz-Ponten)
Date: Sat Mar  5 14:42:51 2005
Subject: [Biopython-dev] Background process handling
Message-ID: <y9vwvfvfy61.fsf@genome.cbs.dtu.dk>

Hej Biopythoner's,

I need to include a pure python solution of background database searches in
the 'xbbtools' program ... but I am not sure how to implement that.

I want to start one or several Blast searches from the graphical sequence
editor. The individual results should be continuously updated in different
windows (one per blast search). The different windows should signal (maybe
by changing background color) when the blast search is finished. The user
can stop a blast search simply by destroying the associated window.
Of course, all windows shall be updated continuously and the user should
not feel any lag in the main editor window.

How should I solve this ?
a) fork and exec*
b) popen
c) write to temporary file, start blast into new file, continuously read
   new file
d) use an expect module
e) threads
f) a combination with a LOT of updates ?
g) ???

Any suggestions ?

thx
-thomas

-- 
Sicheritz Ponten Thomas E.  CBS, Department of Biotechnology
thomas@bioinformatics.org   The Technical University of Denmark
CBS:  +45 45 252489         Building 208, DK-2800 Lyngby
Fax   +45 45 931585         http://www.cbs.dtu.dk/thomas/index.html

	De Chelonian Mobile ... The Turtle Moves ...


From thomas at cbs.dtu.dk  Thu Sep 28 15:34:21 2000
From: thomas at cbs.dtu.dk (thomas@cbs.dtu.dk)
Date: Sat Mar  5 14:42:51 2005
Subject: [Biopython-dev] Background process handling
Message-ID: <14803.40253.441744.109557@bb1.home>

Hej Biopythoner's,

I need to include a pure python solution of background database searches in
the 'xbbtools' program ... but I am not sure how to implement that.


I want to start one or several Blast searches from the graphical sequence
editor. The individual results should be continuously updated in different
windows (one per blast search). The different windows should signal (maybe
by changing background color) when the blast search is finished. The user
can stop a blast search simply by destroying the associated window.
Of course, all windows shall be updated continuously and the user should
not feel any lag in the main editor window.

How should I solve this ?
a) fork and exec*
b) popen
c) write to temporary file, start blast into new file, continuously read
   new file
d) use an expect module
e) threads
f) a combination with a LOT of updates ?
g) ???

Any suggestions ?

thx
-thomas

Sicheritz Ponten Thomas E.  CBS, Department of Biotechnology
thomas@bioinformatics.org   The Technical University of Denmark
CBS:  +45 45 252485         Building 208, DK-2800 Lyngby
Fax   +45 45 931585         http://www.cbs.dtu.dk/thomas/index.html

        De Chelonian Mobile ... The Turtle Moves ...

From antoine at egenetics.com  Fri Sep 29 04:34:48 2000
From: antoine at egenetics.com (Antoine van Gelder)
Date: Sat Mar  5 14:42:51 2005
Subject: [Biopython-dev] Background process handling
References: <14803.40253.441744.109557@bb1.home>
Message-ID: <39D45428.86E54ED2@egenetics.com>

thomas@cbs.dtu.dk wrote:
> How should I solve this ?
> a) fork and exec*
> b) popen
> c) write to temporary file, start blast into new file, continuously read
>    new file
> d) use an expect module
> e) threads
> f) a combination with a LOT of updates ?
> g) ???

In the Stackpack EST clustering pipeline I use a thread wrapped around
popen to fire off jobs that are expected to take some time.

Main program updates can be handled either through polling the thread
(not so good) or a callback from the thread (much better) 

:>

 - antoine

From dalke at acm.org  Fri Sep 29 06:20:35 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:51 2005
Subject: [Biopython-dev] Background process handling
Message-ID: <044401c029fe$ed8dd060$ec7f343f@josiah>

Thomas Sicheritz-Ponten <thomas@cbs.dtu.dk>:
>How should I solve this ?
>a) fork and exec*
>b) popen
>c) write to temporary file, start blast into new file, continuously read
>   new file
>d) use an expect module
>e) threads
>f) a combination with a LOT of updates ?
>g) ???

There are two usual approaches - select based and thread based.

"select" is a mechanism to tell if something happend on a file handle.
Under unix, nearly everything is a file handle (files, network I/O, X).
Under Windows it only works with sockets.  See the select module.

Using selects in a command line application works something like this.
Have a central list of "jobs", each of which is a select'able object.
(Warning: you are limited to the number of file descriptors on a
machine, which also includes stdin, stdout and stderr.  On some machines
this may be 64 or lower, though lower is quite rare these days.)

The outermost loop of your program does a select on the task list, to
see which had changes.  From this it maps the activity information to
an action, which is most likely a callback for that object.  The function
can read text from the descriptor, remove the task from the list of
tasks, or whatever.

With a GUI things become a bit more complicated.  Some GUIs want to
be the main event loop, but realize that other people use select based
multitasking, so provide a way to register file descriptors and callbacks.
Other GUIs act more like a library, and give you a way to get a (possible
list of) file descriptor for the GUI, which you use for your event loop.
I believe Tk is of the first form, but I've never really looked into it.
The GUI documentation should go into the details.

The select approach can be used with a) os.popen, b) fork/exec (see the
popen2 module for one way) and c) reading a file using the regular open.
Actually, b) is used as the basis for both a) and the system call you
need for c).

I've never used d) so cannot comment.

If you really want to get into select based systems, take a look
at Sam Rushing's Medusa, part of which is included in Python as
asyncore and asynchat.  The Design Pattern for this approach is,
I believe, called the "Reactor."


The other usual approach, and the one often considered more modern,
is to use threads.  This is what you almost must do if you want to
run under MS Windows.  Threads is to select as preemtive multitasking
is to non-preemptive.

The mechanism for threads is conceptually simpler than select: "start
this function and let it do whatever it needs to do while I work on
other things." L ikely you will want to create a thread task object
which takes the BLAST input parameters and runs blast.  The thread
will use the same methods as select (os.popen, fork/exec, etc.) but
instead of using select to tell if the status changed, it just sits
there waiting for input.  It can do this since the thread library will
run other threads to prevent the program from completely halting.

The downside of threads used to be that most application code, its
libraries and even POSIX calls weren't all thread safe.  POSIX added
some new functions (the "*_r" ones) to fix the problems, and many
libraries are thread safe.  Still, some aren't and so things like
Tk must be dealt with specially to keep all the Tk calls in a single
thread.

That doesn't prevent you from writing non-thread safe code, or using
libraries (like biopython?) which aren't thread safe.  You start having
to worry about how to serialize library calls so that you don't trigger
problems.  Hint: use the higher level primitives for threading, like
Queue.

Debugging becomes more complicated because if there are timing problems,
like non-thread safe libraries, you can't always get a good
reproducible.  I tend to write my threaded objects with a very state
machine like behaviour so that I can make good guarantees about when
and how it should be used.  (This is a good programming style in general.)

Also, Python's core is only thread safe at the coarse grained level.
There is a single, global interpreter lock which prevents two pieces
of Python code from running at the same time.  The lock is rescended
every so often to allow multiple threads to work.  However, this is not
a problem with you since you aren't interested in threads as a way to
increase compute performance.

It used to be that there were a lot of timing problems because the
thread libraries were buggy, but those


Given all of this, I suggest using threads.  It's an easier programming
model (even given the possible non-thread safe parts), works on Unix
and MS Windows, and there are now more people with thread development
experience than select.  It looks like Antoine is one to ask :)

Here's a sketch of one way to write your code using threads.  It assumes
all GUI events are serialized in one thread, which is the main one.

  class BlastWindow:
    def __init__(self, gui_change):
        self.gui_change = gui_change
        self.result = None
    def set_results(self, result):
        # using the caller's thread, not the GUI thread, so set the
        # data but don't do anything using the GUI until called later
        self._result = result
        self.gui_change.put(self)
    def do_change(self):
        # the BLAST run is finished, so get the result data and use it to
        # update the window
        self.result = result
        del self._result
        # change GUI ...

  class BlastTask(threading.Thread):
      def __init__(self, blast_params, window):
          threading.Thread.__init__(self)
          self.window = window
          ...
      def run(self):
          # set up the tmpdir and files like .ncbirc, etc.
          os.system("cd tmpdir; blast -i ..")
          # no error checking for now
          self.window.set_results(blast_parse(open("tmpdir/blast.output")))

  gui_change = Queue(-1)  # used to serialize GUI updates
  app = App(gui_change)
  window = app.createBlastWindow()
  ...
  blast = BlastTask(blast_params, window)
  ...
  while 1:
    change = gui_change.get()
    if change == <exit>:  # however you define an "exit"
       break
    change.do_change()


                    Andrew
                    dalke@acm.org


From chapmanb at arches.uga.edu  Sat Sep 30 12:55:57 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:51 2005
Subject: [Biopython-dev] Martel stuff
Message-ID: <200009301655.MAA46310@archa10.cc.uga.edu>

Hey all;
	I have a few things relating to Martel:

1. I have a bit of a problem with something in the Clustalw parsing that
I just checked in (Bio/Clustalw/clustal_format.py). The newer clustalw
formatted files have these annoying starts that are in the file (I call
them match_stars in the format file). I just realized that these stars
aren't always there, so I made the following change to try and make them
optional:

--- clustal_format.py.orig	Thu Sep 28 19:49:37 2000
+++ clustal_format.py	Sat Sep 30 12:41:10 2000
@@ -59,7 +59,7 @@
 
 block_info = Martel.Group("block_info",
			   Martel.Rep(seq_line) +
-			   match_stars +
+			   Martel.MaxRepeat(match_stars, 0, 1) +
			   Martel.MaxRepeat(new_block, 0, 1))


I think this is right, but when I do this it makes the parse hang and
never finish. Hmmm.... I'm not sure how to debug this, any ideas?

2. I just installed 2.0b2, and it looks like we'll need the PyXML package
:-< Python2.0 doesn't seem to come with saxlib, which we need to
implement handler classes for the XML produced by Martel. The standard
xml library also doesn't have saxexts/sax2exts, and seems to have some
other differences from the PyXML package. Once the next version of PyXML
(0.6.1, I think) which is supposed to work with b2 comes out, I guess I
can see how well this works with what is in the standard library.
Anyways, I think this is the situation with python2.0. I'm not sure what
thoughts are about this...

3. What are people's thoughts about integrating Martel more tightly with
Biopython? Do you think it would be worthwhile for me to try my hand at
implementing a Martel based Fasta parser that would work with the code
Jeff has already got in place? 

Thanks for listening!

Brad

From dalke at acm.org  Sat Sep 30 13:50:44 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:52 2005
Subject: [Biopython-dev] Martel stuff
Message-ID: <071401c02b06$f737c7c0$ec7f343f@josiah>

Brad:
> I think this is right, but when I do this it makes the parse hang and
> never finish. Hmmm.... I'm not sure how to debug this, any ideas?

The code looks correct, except you should use "Opt(expr)" as a shorthand
for "MaxRepeat(expr, 0, 1)".

The hang you are seeing is likely a problem with Martel.  Suppose it needs
to match 0 or more times, and one of the matches can be of size 0.  Then
it will set on that spot forever, continuously eating groups of size 0.
The best way to work around the problem is to make sure that all repeat
groups are guaranteed to be able to consume a character.  Another work
around is to put an upper limit on the repeat count.

Once I get this next release out, I'll see about generating tag tables
which check the size of any match.  There will be quite a bit of overhead
for doing that, so I'm thinking of having a debug version which would
handle this and be better able to pinpoint error positions.

> it looks like we'll need the PyXML package :-< Python2.0 doesn't seem to 
> come with saxlib, which we need to implement handler classes for the XML
> produced by Martel.

What about xml.sax.handler ?

I haven't sat down with the new Python distro to see what's changed.
Again, that will wait until after I get this 0.3 release out.

> 3. What are people's thoughts about integrating Martel more tightly with
> Biopython?

Jeff says that he's for it.  I just need to (again ): get this release
out so people can start testing it.

>  Do you think it would be worthwhile for me to try my hand at
> implementing a Martel based Fasta parser that would work with the code
> Jeff has already got in place? 

Yes, and no.  The biggest change for 0.3 is support for hybrid parsers,
which uses a simple reader to grab a record at a time, then passes that
to Martel for in-depth parsing.  This reduces the amount of memory needed
to parse a file.

So the "yes" part means, go ahead and write a parser for FASTA which
produces Biopython data structures.  However, it will likely change for
the future.  In fact, for FASTA I would probably have the regexp available,
so it can be merged with other expressions, but have it create a scanner
which is pure Python generating the SAX events, rather than going through
mxTextTools.

                    Andrew


From chapmanb at arches.uga.edu  Sat Sep 30 15:26:55 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:52 2005
Subject: [Biopython-dev] Martel stuff
In-Reply-To: <071401c02b06$f737c7c0$ec7f343f@josiah>
Message-ID: <200009301926.PAA121030@archa11.cc.uga.edu>

>  The hang you are seeing is likely a problem with Martel.  Suppose it
>  needs to match 0 or more times, and one of the matches can be of size
0. 
>  Then it will set on that spot forever, continuously eating groups of
size 0.

Aha! Thanks! The solution was to use Rep1 where I want to be guaranteed
to get a match (instead of all Rep like I was doing previously) and this
stopped the hanging. Thanks for the pointer on that.

[XML in python2.0]
>  What about xml.sax.handler ?

Doh! You're right -- we can use handler and get things to work properly.
Thanks! In addition, there is a small change that needs to happen in
Generate.py to make things fully work (instead of using
xml.sax.saxlib.SAXException, using xml.sax._exceptions.SAXException). But
after that things seem to work! Snazzy!