[Biopython-dev] XSLT and Martel output

Andrew Dalke dalke at acm.org
Fri Sep 1 03:16:10 EDT 2000


Hello,

  With some pointers from Brad I managed to get an XSLT converter for
the Martel SWISS-PROT output into FASTA.  I would have tried an XML
one, but wasn't sure which to use.

  The input was the example output file I have at
http://www.biopython.org/~dalke/Martel/BOSC2000.poster/sample.xml.txt
This has 8 records and is about 60K long.

  The XSLT engine I used is 4XSLT from ForeThought.  BTW, it was
entirely too complicated to install esp. since there aren't any
instructions and there seems to be a missing file from one of
the distributions (but which is in the other).  :(

  The actual XSLT text I used is

<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">

  <xsl:template match="//swissprot38_record">
    <xsl:text disable-output-escaping="yes">&gt;sp|</xsl:text>
    <xsl:value-of select="*/ac_number"/>
    <xsl:text disable-output-escaping="yes">|</xsl:text>
    <xsl:value-of select="*/entry_name"/>
    <xsl:for-each select="DE_block/DE/description">
      <xsl:text> </xsl:text>
      <xsl:value-of select="."/>
    </xsl:for-each>
    <xsl:text>&#010;</xsl:text>
    <xsl:for-each
select="sequence_block/SQ_data_block/SQ_data/sequence">
      <xsl:value-of select="translate(., ' ', '')"/>
      <xsl:text>&#010;</xsl:text>
    </xsl:for-each>
    <xsl:if test="position()!=last()">
      <xsl:text>&#010;</xsl:text>
    </xsl:if>
  </xsl:template>

</xsl:stylesheet>

Example output looks like:
====
>sp|Q43495|108_LYCES PROTEIN 108 PRECURSOR.
MASVKSSSSSSSSSFISLLLLILLVIVLQSQVIECQPQQSCTASLTGLNVCAPFLVPGSP
TASTECCNAVQSINHDCMCNTMRIAAQIPAQCNLPPLSCSAN

>sp|P18646|10KD_VIGUN 10 KD PROTEIN PRECURSOR (CLONE PSAS10).
MEKKSIAGLCFLFLVLFVAQEVVVQSEAKTCENLVDTYRGPCFTTGSCDDHCKNKEHLLS
GRCRDDVRCWCTRNC
====

It took about 3.5 seconds to load the file into the DOM and about 1.5
seconds to process it.  Since there are 80,000 records in sprot38, it
would take nearly 14 hours to convert everything.  It would take about
20 minutes to translated it using a SAX-based converter, so a factor
of 70 slower.

Of course, it would also require that I have enough memory since the
DOM I'm using (4DOM, also from ForeThought) keeps everything in
RAM.

There are some performance things you need to learn using XSLT (or at
least tricks specific to this engine.)  For example
    <xsl:for-each
select="sequence_block/SQ_data_block/SQ_data/sequence">
is a lot faster (20-fold or so!) than
    <xsl:for-each select="*//sequence">

It's a good thing that FASTA doesn't mandate that all sequence lines
(excepting the last) must be 65 characters long.  The SWISS-PROT
sequence lines are 60 characters long, and I can't figure out how to
wrap them to different lengths.


On the other hand, it *does* work, and the performance of the engines
should go up over time (eg, there is usually about a factor of 5-10 by
translation into C).  Plus, in theory you should be able to make it
work with other XSLT tools.  Anyone want to try it with XT, or one of
the browsers (does Mozilla or Opera support XSLT?).

Better yet, want to start playing around with the BLAST output from
Martel?  :)

			      Andrew
			      dalke at acm.org



More information about the Biopython-dev mailing list