From chapmanb at arches.uga.edu  Sun Oct  1 10:48:31 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:52 2005
Subject: [Biopython-dev] Martel based replacement for Fasta _Scanner
Message-ID: <200010011448.KAA83954@archa13.cc.uga.edu>

Hey all;
	As I was talking about yesterday, I went ahead and generated a
Martel-based replacement for the current _Scanner framework that Jeff
wrote for Fasta parsing. I was just interested in doing this so that I
could see how Martel based parsing could fit in with the nice
Scanner/Consumer framework that Jeff set up.

Basically the approach I took was to let Martel do the low level parsing,
and then generate the appropriate scanner events using the SAX handler
that looks at the XML generated by Martel. So basically all I did was
rewrite the _Scanner to use Martel.

I attached two files to this mail which shows this in action:

1. Fasta.py -> This is a replacement for Bio/Fasta/Fasta.py. It just
replaces _Scanner and adds a SAX handler class to turn the Martel XML
into Scanner events.

2. fasta_format.py -> This should be put in Bio/Fasta, and is the Martel
based regexp for reading fasta files. My regular expressions suck, so
this got pretty ugly, especially when I was trying to deal with that
annoying dos line break stuff in the test suite. I'm quite open to
suggestions for making this nicer!

This should work almost exactly the same as the _Scanner class from
before, except that it parses everything that gets fed into it (instead
of just one record from a file, as before). So all of the tests work with
the new parser, but test_Fasta will fail in the regression test because
of this different behavior.

Feedback on all of this would be very welcome!

Brad


-------------- next part --------------
A non-text attachment was scrubbed...
Name: Fasta.py
Type: application/x-unknown
Size: 11129 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20001001/ff075c6e/Fasta.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: fasta_format.py
Type: application/x-unknown
Size: 1219 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20001001/ff075c6e/fasta_format.bin
From chapmanb at arches.uga.edu  Sun Oct  1 19:52:25 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:52 2005
Subject: [Biopython-dev] Parser problem with blastpgp v. 2.0.13?
In-Reply-To: <Pine.GSO.4.21_heb2.09.0010011255240.1371-300000@new-shum>
Message-ID: <200010012352.TAA24976@archa12.cc.uga.edu>

Iddo wrote:
>  I already submitted a bug report (#16). Basically, i cannot seem to
>  work the NCBIStandalone parser with the output I get. I did run it on
>  similar btXXX files, and that seemed to go well.
>
>  I am using blastpgp V 2.0.13

Hmmm... I took a quick look at this, and I think this is the problem. It
looks from the comments that Jeff has only tested this with v 2.0.10 and
v 2.0.11 so it looks like the output has changed somewhat (of course!).

I think the problem is that _scan_masterslave_alignment isn't figuring
out that it should stop reading alignments, so it is trying to convert
'....' into an integer, which obviously didn't work so hot.

The new break between between rounds is a Searching.... line, instead of
the Database line that _scan_masterslave_alignment is looking for, so if
you add a check to break on finding Searching..., then the parse seems to
complete okay.

I was playing with this to look at the results, and it also looks like
the record isn't giving up the data from the multiple alignments, so I
also had a quick patch to fix this. 

Here are the patches, against CVS, that seem to make things look okay for
me. Jeff is the master of Blast, so it is up to him to approve these (or
let me know where I went wrong :-). Hope this helps.

Brad

*** NCBIStandalone.py.orig	Sun Oct  1 18:36:01 2000
--- NCBIStandalone.py	Sun Oct  1 19:47:27 2000
***************
*** 329,335 ****
	  consumer.start_alignment()
	  while 1:
	      line = safe_readline(uhandle)
!	      if line[:10] == '  Database':
		  uhandle.saveline(line)
		  break
	      elif is_blank_line(line):
--- 329,340 ----
	  consumer.start_alignment()
	  while 1:
	      line = safe_readline(uhandle)
!	      # PSIBlast 2.0.13 appears to have a Searching... line after
!	      # rounds instead of a Database line
!	      if line[:9] == 'Searching':
!		  uhandle.saveline(line)
!		  break
!	      elif line[:10] == '  Database':
		  uhandle.saveline(line)
		  break
	      elif is_blank_line(line):
***************
*** 1178,1184 ****
	  _AlignmentConsumer.end_alignment(self)
	  if self._alignment is not None:
	      self._round.alignments.append(self._alignment)
!	  elif self._multiple_alignment is not None:
	      self._round.multiple_alignment = self._multiple_alignment
  
      def end_hsp(self):
--- 1183,1189 ----
	  _AlignmentConsumer.end_alignment(self)
	  if self._alignment is not None:
	      self._round.alignments.append(self._alignment)
!	  if self._multiple_alignment is not None:
	      self._round.multiple_alignment = self._multiple_alignment
  
      def end_hsp(self):


From jchang at SMI.Stanford.EDU  Mon Oct  2 00:16:20 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:52 2005
Subject: [Biopython-dev] Parser problem with blastpgp v. 2.0.13?
In-Reply-To: <200010012352.TAA24976@archa12.cc.uga.edu>
Message-ID: <Pine.GSO.4.21.0010012113080.11555-100000@taiyang>

Hi Iddo,

Yes, there's definitely a bug here.  The problem is that the master-slave
alignment code wasn't built with PSI-BLAST in mind.  When it encounters a
master-slave alignment, it keeps on parsing until it finds the database
section.  Unfortunately, this breaks with PSI-BLAST, since it generates
many alignments in a single run.

The fix, as Brad noted, is to allow the masterslave code to recognize
psi-blast output.  This has been checked in and will go into the next
developmental release.

Thanks for the report and the fix!

Jeff


On 1 Oct 2000, Brad Chapman wrote:

> Iddo wrote:
> >  I already submitted a bug report (#16). Basically, i cannot seem to
> >  work the NCBIStandalone parser with the output I get. I did run it on
> >  similar btXXX files, and that seemed to go well.
> >
> >  I am using blastpgp V 2.0.13
> 
> Hmmm... I took a quick look at this, and I think this is the problem. It
> looks from the comments that Jeff has only tested this with v 2.0.10 and
> v 2.0.11 so it looks like the output has changed somewhat (of course!).
> 
> I think the problem is that _scan_masterslave_alignment isn't figuring
> out that it should stop reading alignments, so it is trying to convert
> '....' into an integer, which obviously didn't work so hot.
> 
> The new break between between rounds is a Searching.... line, instead of
> the Database line that _scan_masterslave_alignment is looking for, so if
> you add a check to break on finding Searching..., then the parse seems to
> complete okay.
> 
> I was playing with this to look at the results, and it also looks like
> the record isn't giving up the data from the multiple alignments, so I
> also had a quick patch to fix this. 
> 
> Here are the patches, against CVS, that seem to make things look okay for
> me. Jeff is the master of Blast, so it is up to him to approve these (or
> let me know where I went wrong :-). Hope this helps.
> 
> Brad
> 
> *** NCBIStandalone.py.orig	Sun Oct  1 18:36:01 2000
> --- NCBIStandalone.py	Sun Oct  1 19:47:27 2000
> ***************
> *** 329,335 ****
> 	  consumer.start_alignment()
> 	  while 1:
> 	      line = safe_readline(uhandle)
> !	      if line[:10] == '  Database':
> 		  uhandle.saveline(line)
> 		  break
> 	      elif is_blank_line(line):
> --- 329,340 ----
> 	  consumer.start_alignment()
> 	  while 1:
> 	      line = safe_readline(uhandle)
> !	      # PSIBlast 2.0.13 appears to have a Searching... line after
> !	      # rounds instead of a Database line
> !	      if line[:9] == 'Searching':
> !		  uhandle.saveline(line)
> !		  break
> !	      elif line[:10] == '  Database':
> 		  uhandle.saveline(line)
> 		  break
> 	      elif is_blank_line(line):
> ***************
> *** 1178,1184 ****
> 	  _AlignmentConsumer.end_alignment(self)
> 	  if self._alignment is not None:
> 	      self._round.alignments.append(self._alignment)
> !	  elif self._multiple_alignment is not None:
> 	      self._round.multiple_alignment = self._multiple_alignment
>   
>       def end_hsp(self):
> --- 1183,1189 ----
> 	  _AlignmentConsumer.end_alignment(self)
> 	  if self._alignment is not None:
> 	      self._round.alignments.append(self._alignment)
> !	  if self._multiple_alignment is not None:
> 	      self._round.multiple_alignment = self._multiple_alignment
>   
>       def end_hsp(self):
> 
> 
> 
> 
> 
> 
> 
> 
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev@biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev
> 


From dalke at acm.org  Mon Oct  9 07:45:41 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:52 2005
Subject: [Biopython-dev] Martel-0.3 available
Message-ID: <003f01c031e6$7480a180$99d6cdcf@josiah>

Martel-0.3 is finally available.  This finishes the major cleanup,
has a more SAX-like interface, fixes various problems, adds a framework
and code for parsing a record at a time, and has first draft documentation
about both the internals and a tutorial on how to write a parser.

Excepting bug fixes, this is the last version which will work with Python
1.5.2.  The next one will work with 2.0 and use its new xml package.

There are several changes in this version which are incompatible with
the 0.25 release - mostly so that the SAX names are correct (eg, now
using DocumentHandler instead of ContentHandler, which was just wrong).
There should be very few API breakages in future versions other than the
support for the new XML module and SAX 2.0 and a change/simplification
in how to access parsers for specific formats and format versions.

Martel can be found at http://www.biopython.org/~dalke/Martel/ .  Links
to the new documentation are available from that page.

                    Andrew Dalke
                    dalke@acm.org


From chapmanb at arches.uga.edu  Mon Oct  9 15:49:28 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:52 2005
Subject: [Biopython-dev] Martel-0.3 available
In-Reply-To: <003f01c031e6$7480a180$99d6cdcf@josiah>
Message-ID: <200010091949.PAA124060@archa13.cc.uga.edu>

Andrew wrote:
>  Martel-0.3 is finally available.  

Cool! Thanks for this.

> a tutorial on how to write a parser.

Very nice. I wish my Fasta parser looked as nice as yours :-). One thing
that had me concerned about your parser, though, was the use of Str("\n")
to detect end of new lines. I was using this with lots o' luck with all
of my unix formatted files, but it didn't seem to work right for me when
I was using it on the Windows formatted (I think) files in the Fasta test
directory. I ended up having to use Martel.MaxRepeat(Martel.Re("[\s]"),
0, 2) to detect end o' lines, which seems to work properly, but it pretty
ugly looking.

Yours seemed to work okay though at detecting the end of the lines, so
I'm not positive what is going on... Hmmm, I don't know, I'll have to
look at this more, I guess. I don't really know anything at all about
line-break madness.

>  Excepting bug fixes, this is the last version which will work with
>  Python
>  1.5.2.	The next one will work with 2.0 and use its new xml
package.

Since I'm using 2.0 right now, I made the necessary changes to get it
working for me with just the xml packages. The changes I made were in
Parser.py, and are attached as Parser.diff, in case they will be of any
use to you in making these changes. BTW, pyXML-0.6.1 is out, so hopefully
now 1.5.2 with PyXML should work interchangably with python2.0 alone.

>  mostly so that the SAX names are correct (eg, now
>  using DocumentHandler instead of ContentHandler, which was just
>  wrong).

Really? I didn't even see DocumentHandler in 2.0 -- I think that
ContentHandler is DocumentHandler (at least in 2.0), but I'm not
positive. Hard to follow all of the changes in that stuff...

Thanks again for the new version -- I'm looking forward to having some
time to play around with it more :-).

Brad
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Parser.diff
Type: application/x-unknown
Size: 9161 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20001009/a2109b64/Parser.bin
From dalke at acm.org  Mon Oct  9 20:15:34 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:52 2005
Subject: [Biopython-dev] Martel-0.3 available
Message-ID: <002e01c0324f$372a5420$46d5cdcf@josiah>

Brad:
> One thing that had me concerned about your parser, though, was the
> use of Str("\n") to detect end of new lines.  I was using this with
> lots o' luck with all of my unix formatted files, but it didn't seem
> to work right for me when I was using it on the Windows formatted (I
> think) files in the Fasta test directory. I ended up having to use 
> Martel.MaxRepeat(Martel.Re("[\s]"), 0, 2) to detect end o' lines,
> which seems to work properly, but it pretty ugly looking.

Yeah, I'm worried about that as well, but I haven't really looked at
the problem.  Dug around for a bit now.  Under MS Windows, reading a
native file (which "od -c" shows as having "\r\n"), open("test.dat").read()
only shows "\n", so it's been translated as I expect.  Using
open("test.dat", "rb").read() shows the "\r\n".

So so long as the file is read in text mode and is used on an OS with the
same line endings, then it will be fine.  However, it does mean my byte
counts will be off, depending on your viewpoint :(

There might be a problem with interoperability between difference OSes.
That could be addressed in one of several ways:
  1) require the input to be converted to the local line ending and provide
       no support for doing so
  2) supply some adapters ("FromMac", "FromUnix", "FromDos") but don't use
       them; instead leaving the decision up to the client code
  3) provide a tool which autodetects endings and uses the right adapter
  4) http://members.nbci.com/_XOOM/meowing/python/index.html
  5) define an  EOL = Re(r"\n|\r\n?")

I prefer 2-4, but would like to stick with 1 for now.  I don't like 5
because people will forget to use it.

> I don't really know anything at all about line-break madness.

I've been a unix weenie for too long, and agree with you.

> I didn't even see DocumentHandler in 2.0 -- I think that
> ContentHandler is DocumentHandler (at least in 2.0), but I'm not
> positive. Hard to follow all of the changes in that stuff...

According to my XML book, it's Document Handler, and it works with DOM
and the other XML tools, so it's likely correct.

                    Andrew


From dalke at acm.org  Wed Oct 11 21:43:51 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:52 2005
Subject: [Biopython-dev] Martel-0.3 available
Message-ID: <018e01c033ed$e0ef0380$d5ab323f@josiah>

Brad:
> Really? I didn't even see DocumentHandler in 2.0 -- I think that
> ContentHandler is DocumentHandler (at least in 2.0), but I'm not
> positive. Hard to follow all of the changes in that stuff...

*sigh* I figured out the problem.  ContentHandler is the SAX 1.0
interface while DocumentHandler is 2.0.  My XML book, which is less
than a year old, covers the 1.0 interface only.  SAX 2 also adds
methods for namespace support like startPrefixMapping. (huh?)

So Martel uses the old SAX API and I've got to figure out the new one.
Yippee.

                    Andrew


From dalke at acm.org  Thu Oct 12 03:55:37 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:52 2005
Subject: [Biopython-dev] Martel timings
Message-ID: <000901c03421$e502b960$85ab323f@josiah>

I'm starting to compare Martel parsing with the existing biopython code.
I wrote a Martel document handler called SwissProtBuilder.py (attached)
which creates Bio.SwissProt.Record objects.

The output is comparable to the existing code, although this new code
is wrong.  (There are a couple of minor things I need to fix in the
grammer to make the parsing easy.)

The timings are also comparable.  The biopython code is about 8% slower
than the Martel code.  The Martel code takes about 25 minutes to parse
sprot38.

Because of the new RecordReader, it only needed about 4MB of memory.
I assume the biopython code is at least that good.

One of the reasons for the good performance on the Martel side is that
I'm pruning the expression tree to get rid of events which aren't handled
by the callback object.  That eliminates a lot of function call overhead.
I also turned a long if/elif chain in endElement into a dispatch table,
which saved more time because of the conversion from an O(N) lookup to
O(1).

It turns out there is a bug in the pruning code because the RecordReader
doesn't prune its children.  It doesn't cause an error but just a slowdown,
so I didn't notice it until now.  I've included a patch with this email
which brings Martel-0.3 up to my internal development version.

                    Andrew
                    dalke@acm.org


-------------- next part --------------
"""SwissProtBuilder - create a biopython Bio.SwissProt.Record

This is a first attempt at a Martel interface to create SwissProt
records.  It is incomplete because Martel's SWISS-PROT format
definition is a bit lacking, although not enough to affect timings.

I have a test data set which is the first 200024 lines of sprot38.  It
takes this code 59.9 seconds to parse the file while the existing
biopython code takes 65.1 seconds, so about 8% faster.  There is still
some performance I can eek out of this.

All of sprot38 takes around 25 minutes to parse.  The mxTextTools
analysis takes about 10 minutes so the rest is spent in callbacks and
creation code.

"""

import string
from Bio.SwissProt import KeyWList, SProt
from xml.sax import saxlib

# These are elements whose text I want to get
capture_names = ("entry_name", "data_class_table",
                 "molecule_type", "sequence_length", "ac_number",
                 "day", "month", "year", "description", "gene_names",
                 "organism_species", "organelle",
                 "organism_classification", "reference_number",
                 "reference_position", "reference_comment",
                 "bibliographic_database_name",
                 "bibliographic_identifier", "reference_author",
                 "reference_title", "reference_location",
                 "comment_text", "database_identifier",
                 "primary_identifier", "secondary_identifier",
                 "status_identifier", "keyword", "ft_name", "ft_from",
                 "ft_to", "ft_description", "molecular_weight",
                 "crc32", "sequence", )

# These are all of the elements events I'm interested in
select_names = capture_names + \
               ("swissprot38_record", "DT_created", "DT_seq_update",
               "DT_ann_update", "reference", "feature", "ID",
               "reference", "DR", "comment")
                                

class SwissProtBuilder(saxlib.DocumentHandler):
    def __init__(self):
        self.records = []
        self.capture = 0

    def startElement(self, name, attrs):
        # Arranged in order of most used to least
        if name in capture_names:
            self.capture = 1
            self.text = ""
        elif name == "reference":
            self.reference = SProt.Reference()
        elif name == "feature":
            self.ft_desc = ""
        elif name == "comment":
            self.comment = ""
        elif name == "swissprot38_record":
            self.record = SProt.Record()
        elif name == "DT_created":
            self.in_date = "created"
            self.date = []
        elif name == "DT_seq_update":
            self.in_date = "sequence_update"
            self.date = []
        elif name == "DT_ann_update":
            self.in_date = "annotation_update"
            self.date = []
        
    def characters(self, ch, start, length):
        if self.capture:
            self.text = self.text + ch[start:start+length]
            
    def endElement(self, name):
        # Doing the dispatch like this instead of a chain of if/elif
        # statements saved me about 15% because the lookup time goes
        # from O(N) to O(1)
        f = getattr(self, "end_" + name, None)
        if f is not None:
            f()
        
        if self.capture:
            del self.text
            self.capture = 0
            
    def end_swissprot38_record(self):
        self.record.sequence = string.replace(self.record.sequence,
                                              " ", "")
        # Delete for now since I'm just doing timings
        #self.records.append(self.record)
        #print self.record
        del self.record
        
    def end_entry_name(self):
        self.record.entry_name = self.text
    def end_data_class_table(self):
        self.record.data_class = self.text
    def end_molecule_type(self):
        self.record.molecule_type = self.text
    def end_sequence_length(self):
        # Used in both the ID and the SQ lines
        self.seq_len = int(self.text)
    def end_ID(self):
        self.record.sequence_length = self.seq_len
    def end_ac_number(self):
        self.record.accessions.append(self.text)
    def end_day(self):
        self.date.append(self.text)
    def end_month(self):
        self.date.append(self.text)
    def end_year(self):
        self.date.append(self.text)
        setattr(self.record, self.in_date, "%s-%s-%s" % tuple(self.date))
        
    def end_description(self):
        if self.record.description == "":
            self.record.description = self.text
        else:
            self.record.description = self.record.description + self.text
    def end_gene_names(self):
        # XXX parser isn't correct
        self.record.gene_name = self.text
    def end_organism_species(self):
        # XXX parser isn't correct
        self.record.organism = self.text
    def end_organelle(self):
        # XXX parser isn't correct
        self.record.organelle = self.text
    def end_organism_classification(self):
        # XXX parser isn't correct
        self.record.organism_classification.extend(\
                string.split(self.text[:-1], "; "))

    def end_reference(self):
        self.record.references.append(self.reference)
        del self.reference
    def end_reference_number(self):
        self.reference.number = int(self.text)
    def end_reference_position(self):
        # XXX Why is this a list?
        self.reference.positions.append(self.text)
    def end_reference_comment(self):
        # XXX needs to be list of (token, text)
        self.reference.comments.append(self.text)
    def end_bibliographic_database_name(self):
        self.bib_db_name = self.text
    def end_bibliographic_identifier(self):
        self.reference.references.append( (self.bib_db_name, self.text) )
    def end_reference_author(self):
        if self.reference.authors:
            self.reference.authors = self.reference.authors + " " + self.text
        else:
            self.reference.authors = self.text
    def end_reference_title(self):
        if self.reference.title:
            self.reference.title = self.reference.title + " " + self.text
        else:
            self.reference.title = self.text
    def end_reference_location(self):
        if self.reference.location:
            self.reference.location = self.reference.location + " " + self.text
        else:
            self.reference.location = self.text
    def end_comment_text(self):
        if self.comment:
            self.comment = self.comment + " " + self.text
        else:
            self.comment = self.text
    def end_comment(self):
        self.record.comments.append(self.comment)
    def end_database_identifier(self):
        self.db_id = self.text
    def end_primary_identifier(self):
        self.ids = [self.text]
    def end_secondary_identifier(self):
        self.ids.append(self.text)
    def end_status_identifier(self):
        self.ids.append(self.text)
    def end_DR(self):
        self.record.cross_references.append( (self.db_id,) + tuple(self.ids))
    def end_keyword(self):
        # XXX parser isn't correct
        kw = string.split(self.text[:-1], "; ")
        self.record.keywords.extend(kw)
    def end_feature(self):
        self.record.features.append( (self.ft_name, self.ft_from,
                                      self.ft_to, self.ft_desc) )
    def end_ft_name(self):
        self.ft_name = string.rstrip(self.text)
    def end_ft_from(self):
        self.ft_from = string.lstrip(self.text)  # Jeff first tries int ...
    def end_ft_to(self):
        self.ft_to = string.lstrip(self.text)  # Jeff first tries int ...
    def end_ft_description(self):
        if self.ft_desc:
            self.ft_desc = self.ft_desc + " " + self.text
        else:
            self.ft_desc = self.text
    def end_molecular_weight(self):
        self.mw = int(self.text)
    def end_crc32(self):
        self.record.seqinfo = (self.seq_len, self.mw, self.text)
    def end_sequence(self):
        # Strip out spaces in end_swissprot38_record
        self.record.sequence = self.record.sequence + self.text


def test():
    from Martel.formats import swissprot38
    from xml.sax import saxutils
    import Martel
    import time
    t1 = time.time()

    # Send only the events which the callback will use
    # (saves another 32% of performance, after doing the if/elif speedup)
    format = Martel.select_names(swissprot38.format, select_names)

    parser = format.make_parser()
    dh = SwissProtBuilder()
    parser.setDocumentHandler(dh)
    eh = saxutils.ErrorRaiser()
    parser.setErrorHandler(eh)
    
    #infile = open("/home/dalke/src/Martel/examples/sample.swissprot")
    #infile = open("/home/dalke/ftps/swissprot/sprot38.dat")
    infile = open("/home/dalke/ftps/swissprot/smaller_sprot38.dat")

    t2 = time.time()
    parser.parseFile(infile)
    t3 = time.time()
    print "startup", t2-t1
    print "eval", t3-t2
    

if __name__ == "__main__":
    test()

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Martel-0.3.patch
Type: application/octet-stream
Size: 1053 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20001012/a846b25c/Martel-0.3.obj
From chapmanb at arches.uga.edu  Thu Oct 12 14:10:36 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:52 2005
Subject: [Biopython-dev] Changes in WWW BLAST format
Message-ID: <200010121810.OAA84314@archa12.cc.uga.edu>

Hey all;
	I was working on the BLAST documentation, and while whipping up
an example for WWWBlast, noticed that it seems like the html output has
changed (again). It looks like the culprits are a new <p> tag, and a
</PRE> tag in place of a blank line. The output version is 2.1.1, I
believe.

The diff at the end of this seems to fix things for me, but of course I
wanted to run it by the master o' blast first before committing. If you
can't reproduce the parsing problem, please let me know and I'll send a
script that demonstrates it. 

Yours in blasting,
Brad

*** NCBIWWW.py.orig	Sat Aug 12 16:23:24 2000
--- NCBIWWW.py	Tue Oct 10 21:15:06 2000
***************
*** 148,153 ****
--- 148,156 ----
	  # Read the RID line, for version 2.0.12 (2.0.11?) and above.
	  attempt_read_and_call(uhandle, consumer.noevent, start='RID')
  
+	  # 2.1.1 seems to have another <p> here
+	  attempt_read_and_call(uhandle, consumer.noevent, start='<p>')
+ 
	  # Read the Query lines and the following blank line.
	  read_and_call(uhandle, consumer.query_info, contains='Query=')
	  read_and_call_until(uhandle, consumer.query_info, blank=1)
***************
*** 204,211 ****
	  read_and_call(uhandle, consumer.noevent, blank=1)
  
	  # Read the descriptions and the following blank line.
!	  read_and_call_until(uhandle, consumer.description, blank=1)
!	  read_and_call_while(uhandle, consumer.noevent, blank=1)
  
	  consumer.end_descriptions()
  
--- 207,214 ----
	  read_and_call(uhandle, consumer.noevent, blank=1)
  
	  # Read the descriptions and the following blank line.
!	  read_and_call_until(uhandle, consumer.description, contains =
'</PRE>)
!	  read_and_call_while(uhandle, consumer.noevent, contains =
'</PRE>')
  
	  consumer.end_descriptions()

From chapmanb at arches.uga.edu  Thu Oct 12 14:27:24 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:52 2005
Subject: [Biopython-dev] Proposed addition to Standalone BLAST
Message-ID: <200010121827.OAA33200@archa10.cc.uga.edu>

Hello again;
	More blast stuff from me -- can you tell I've had to parse a lot
of BLAST reports recently? :-).

	Anyways, I've been using the standalone BLAST parser to parse
some big ol' BLAST runs that I'm doing, and I noticed that occassionally
blastall will report an error while running. This a pretty uninformative
error, and will generally either say something about being unable to
calculate parameters during the BLAST. Well, I investigated further and
found out that BLAST quits trying to run a search when it gets to a junk
sequence like this:

>gi|9854647|gb|BE599574.1|BE599574 PI1_77_C09.g1_A002 Pathogen induced 1
(PI1) Sorghum bicolor cDNA, mRNA sequence
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAAA

Right, so this is useless junk sequence and BLAST is right to bomb out on
it. 

The report that BLAST generates on something like this is attached.
Basically, the problem is that a truncated report missing all of the
statistics at the end. This causes the parser to run out of lines without
finding the statistics it is looking for and generate a SyntaxError.

What I'd like to propose is that the parser generate a new exception for
these kind of reports, a NCBIStandalone.BlastError exception, indicating
a failure in Blast, not in the parser.

The reason I want to do this is that I would like to rig the exception up
to return the query that failed in this way, so that I can easily send
some messages to the owners of these sequences, asking them to kindly
remove the sequence from GenBank.

Anyways, attached is a patch (NCBIStandalone.diff) that implements this
type of exception-raising behavior for the BlastParser, which allows you
to parse like this:

try:
     b_record = iterator.next()
except NCBIStandalone.BlastError, info:
     print 'Got a blast error on query', info[1]

Do people think this is a good idea and something that can get into the
standalone parser? Comments are very welcome!

Brad
-------------- next part --------------
A non-text attachment was scrubbed...
Name: problem.blast
Type: application/x-unknown
Size: 834 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20001012/60782c48/problem.bin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: NCBIStandalone.diff
Type: application/x-unknown
Size: 2024 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20001012/60782c48/NCBIStandalone.bin
From chapmanb at arches.uga.edu  Thu Oct 12 20:38:05 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:52 2005
Subject: [Biopython-dev] Martel-0.3 available
In-Reply-To: <002e01c0324f$372a5420$46d5cdcf@josiah>
Message-ID: <200010130038.UAA49974@archa10.cc.uga.edu>

Andrew wrote:
[my worries about different types of line breaks]
>  There might be a problem with interoperability between difference
>  OSes.
>  That could be addressed in one of several ways:
>    1) require the input to be converted to the local line ending and
>  provide no support for doing so
>    2) supply some adapters ("FromMac", "FromUnix", "FromDos") but don't
>  use them; instead leaving the decision up to the client code
>    3) provide a tool which autodetects endings and uses the right
>  adapter
>    4) http://members.nbci.com/_XOOM/meowing/python/index.html
>    5) define an	EOL = Re(r"\n|\r\n?")
>  
>  I prefer 2-4, but would like to stick with 1 for now.  I don't like 5
>  because people will forget to use it.

Hmmm, I don't know, I think I like 5 best of all of these options. There
is definately the problem of people forgetting, as you mention, but it
does have a number of bonuses:

1. Easy to implement, and isn't very likely to break :-).

2. Provided the regexp would recognize Mac line breaks (hmmm, I'm not
positive what those look like) then this could deal with files with
multiple different types of line breaks without whining. There are times
where people have generated files like this in my lab (the sequencer is
running Windows, but they like to play around on the files on a Mac -- I
still don't know how they got a mix of line breaks -- I think by cutting
and pasting between files with different line breaks). Anyways, the point
is that the regexp can deal with "worst case" scenarios, whereas the
other options can bomb out.

Anyways, that is why I am for 5, especially as a short-term solution over
1. 

Brad


From chapmanb at arches.uga.edu  Thu Oct 12 20:51:45 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:52 2005
Subject: [Biopython-dev] Martel timings
In-Reply-To: <000901c03421$e502b960$85ab323f@josiah>
Message-ID: <200010130051.UAA134710@archa13.cc.uga.edu>

Andrew wrote:
>  I'm starting to compare Martel parsing with the existing biopython
>  code. I wrote a Martel document handler called SwissProtBuilder.py
>  (attached) which creates Bio.SwissProt.Record objects.
>  
>  The biopython code is about 8%
>  slower than the Martel code.  The Martel code takes about 25 minutes
to
>  parse sprot38.

Cool! It is great to hear they are both comparible in terms of times. I'm
definately not a speed freak myself, but it is very nice to have a slight
speed improvement (and at least not a speed decrease) on switching over
to Martel based parser stuff.

>  Because of the new RecordReader, it only needed about 4MB of memory.
>  I assume the biopython code is at least that good.

Hmmm, one side not about RecordReader. I really like the way you can
interface with the parsers in multiple ways in the current Biopython
parsing. I think it is really useful to be able to iterate over a record
and get the record back, instead of automatically having to parse it (I
find this useful for pulling a "bad" record out of a big file of
records). 

Do you think there is a way to make the RecordReader act similar to the
Iterators in this regard? Right now, the fact that it is reading things
one record at a time is kind of hidden inside the parse, and I'm not
exactly positive how you can make the record reader just return the raw
info making up the record that is being parsed.

BTW, I like the StartsWith, EndsWith in the new RecordReader! When I was
doing the FASTA stuff I couldn't figure out any way to recognize new
files with only the EndsWith behavior :-).
  
>  One of the reasons for the good performance on the Martel side is that
>  I'm pruning the expression tree to get rid of events which aren't
>  handled by the callback object.  That eliminates a lot of function
call
>  overhead.

Very cool idea to reduce the size of the XML generated and returned.
Nifty stuff!

Brad

From dalke at acm.org  Fri Oct 13 01:54:46 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:52 2005
Subject: [Biopython-dev] Martel timings
Message-ID: <01fc01c034da$1837d140$85ab323f@josiah>

Brad:
> Hmmm, one side not about RecordReader. I really like the way you can
> interface with the parsers in multiple ways in the current Biopython
> parsing. I think it is really useful to be able to iterate over a record
> and get the record back, instead of automatically having to parse it

I've got experimental code for that at
http://www.biopython.org/~dalke/SaxRecords.py

It uses a new threads for the callback object and a Queue to send parsed
records back to the iterator interface in the originating thread.
Currently looses memory if the thread doesn't go to completion because
the new thread is sitting there waiting for the queue to empty.

> (I find this useful for pulling a "bad" record out of a big file of
> records). 

That's a bit different topic.   Currently all errors are "fatalError"s,
which under the SAX spec means the parser must stop.  However, SAX
also supports "error"s, which are recoverable.  (Of course, the error
handler can raise an exception, which causes a dead stop in the parser.)

Huh, there's some bugs in the record parser code:

            elif isinstance(result, saxlib.SAXException):
                # Wrong format
                self.err_handler.fatalError(result)
                return
            else:
                # did not reach end of string
                pos = filepos + result
                self.err_handler.fatalError(StateTableEOFException(pos))

That last branch should do a "return" to meet the spec, and as I learned
yesterday, both need to send an "endDocument" event after the fatalError.
And I do need to fix the following to give some sort of error event.

            record = reader.next()  # XXX what if an exception is raised?

> Do you think there is a way to make the RecordReader act similar to
> the Iterators in this regard?

So yes.  Convert the "fatalError" events to "error" and do recovery by
skipping to the next record.  Then have the SaxRecords code, which does the
Iterator-like interface, return the right information for problematical
records.

Umm, what does the Iterator do for bad records?  It looks like it raises
an exception, but allows you to call next() to get the next record?
That's reasonable to me (since I think I can support it :)

I'll work on it; unless you want to do it?

> BTW, I like the StartsWith, EndsWith in the new RecordReader! When I was
> doing the FASTA stuff I couldn't figure out any way to recognize new
> files with only the EndsWith behavior :-).

Thanks!  If you didn't notice, it also plays some tricks to read ahead many
lines, which should give better overall performance.  The File.UndoHandle
isn't as tricky but has better guarantees of where it is in the file and
it allows undos, which Martel doesn't need.  I bet changing the code to
read ahead multiple lines would speed up the existing biopython code.

>>  pruning the expression tree 

> reduce the size of the XML generated and returned.

Good point - I hadn't even thought about how it affect XML output.  I
was more concerned about reducing function call overhead.

                    Andrew
                    dalke@acm.org


From jchang at SMI.Stanford.EDU  Fri Oct 13 02:16:28 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:52 2005
Subject: [Biopython-dev] Changes in WWW BLAST format
In-Reply-To: <200010121810.OAA84314@archa12.cc.uga.edu>
Message-ID: <Pine.GSO.4.21.0010122315220.16928-100000@taiyang>

Great!  Thanks for catching this.  Could you send the output?  I'd like to
add it to the suite of blast tests.

> ***************
> *** 204,211 ****
> 	  read_and_call(uhandle, consumer.noevent, blank=1)
>   
> 	  # Read the descriptions and the following blank line.
> !	  read_and_call_until(uhandle, consumer.description, blank=1)
> !	  read_and_call_while(uhandle, consumer.noevent, blank=1)
>   
> 	  consumer.end_descriptions()
>   
> --- 207,214 ----
> 	  read_and_call(uhandle, consumer.noevent, blank=1)
>   
> 	  # Read the descriptions and the following blank line.
> !	  read_and_call_until(uhandle, consumer.description, contains =
> '</PRE>)
> !	  read_and_call_while(uhandle, consumer.noevent, contains =
> '</PRE>')
>   
> 	  consumer.end_descriptions()

I don't know for sure, but wouldn't this break compatibility with older
formats?

Jeff


From chapmanb at arches.uga.edu  Fri Oct 13 11:58:45 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:52 2005
Subject: [Biopython-dev] Changes in WWW BLAST format
In-Reply-To: <Pine.GSO.4.21.0010122315220.16928-100000@taiyang>
Message-ID: <200010131558.LAA73594@archa11.cc.uga.edu>

Jeff:
>  Could you send the output?  I'd like to add it to the suite of blast
tests.

Surely, it's attached. There is also an example in Doc/examples
(www_blast.py) which will give it you as well (I found the format change
when writing that example).

[patch with no respect for back-compatibility]
Jeff:
>  I don't know for sure, but wouldn't this break compatibility with
>  older formats?

*slaps self in forehead* Doh! Respect for old formats. Sorry, I guess I
was just too excited that I actually figured out how to fix the blast
parser :-).

Here's a better patch which supports the new change, and passes all of
the tests with the old formats. Let me know if there are any better ways
to do things. Thanks for catching my mistake. Now I remember why you are
the master of blast (not that I really needed any reminding :-).

Brad


*** NCBIWWW.py.orig	Sat Aug 12 16:23:24 2000
--- NCBIWWW.py	Fri Oct 13 11:47:12 2000
***************
*** 149,154 ****
--- 149,155 ----
	  attempt_read_and_call(uhandle, consumer.noevent, start='RID')
  
	  # Read the Query lines and the following blank line.
+	  read_and_call_until(uhandle, consumer.noevent,
contains='Query=')
	  read_and_call(uhandle, consumer.query_info, contains='Query=')
	  read_and_call_until(uhandle, consumer.query_info, blank=1)
	  read_and_call_while(uhandle, consumer.noevent, blank=1)
***************
*** 203,212 ****
			start='Sequences producing')
	  read_and_call(uhandle, consumer.noevent, blank=1)
  
!	  # Read the descriptions and the following blank line.
!	  read_and_call_until(uhandle, consumer.description, blank=1)
!	  read_and_call_while(uhandle, consumer.noevent, blank=1)
! 
	  consumer.end_descriptions()
  
      def _scan_alignments(self, uhandle, consumer):
--- 204,220 ----
			start='Sequences producing')
	  read_and_call(uhandle, consumer.noevent, blank=1)
  
!	  # Read the descriptions
!	  read_and_call_while(uhandle, consumer.description,
!			      blank = 0, contains = '<a')
! 
!	  # two choices here, either blanks lines or a </PRE>
!	  if attempt_read_and_call(uhandle, consumer.noevent, blank = 1):
!	      read_and_call_while(uhandle, consumer.noevent, blank = 1)
!	  # otherwise we've got a </PRE> (introduced in 2.1.1)
!	  else:
!	      read_and_call_while(uhandle, consumer.noevent, contains =
'</PRE>')
!	  
	  consumer.end_descriptions()
  
      def _scan_alignments(self, uhandle, consumer):
-------------- next part --------------
A non-text attachment was scrubbed...
Name: m_cold_blast.out
Type: application/x-unknown
Size: 19423 bytes
Desc: not available
Url : http://portal.open-bio.org/pipermail/biopython-dev/attachments/20001013/2e58541a/m_cold_blast.bin
From katel at worldpath.net  Sat Oct 14 04:08:42 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:52 2005
Subject: [Biopython-dev] Gui
References: <01fc01c034da$1837d140$85ab323f@josiah>
Message-ID: <009601c035b5$f8976d00$010a0a0a@cadence.com>

  I just integrated SeqGui.py with Translate.py and Transcribe.py.  To
support the testing, I also wrote new unit tests for Transcribe.py in
TranscribeTestCase.py.

                                                   Cayte


From katel at worldpath.net  Sat Oct 14 21:57:53 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:52 2005
Subject: [Biopython-dev] clustalw question
Message-ID: <003201c0364b$55dbf720$c3dc85d0@g0fjl>

 clustal_format.py only allows asterisks and spaces in the last line of an alignment.  I just ran an alignment from:

http://www2.ebi.ac.uk/clustalw/


The equivalent line contained colons and periods, too.

The regexp is 

match_stars = Martel.Group("match_stars",
                           Martel.Re("[ \*]+") +
                           Martel.Opt(Martel.Str("\n")))

I'll send the output if you like.

                                              Cayte
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://portal.open-bio.org/pipermail/biopython-dev/attachments/20001014/e6822020/attachment.htm
From chapmanb at arches.uga.edu  Sun Oct 15 03:35:06 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:52 2005
Subject: [Biopython-dev] clustalw question
In-Reply-To: <003201c0364b$55dbf720$c3dc85d0@g0fjl>
Message-ID: <200010150737.DAA129956@archa13.cc.uga.edu>

Cayte wrote:
>  clustal_format.py only allows asterisks and spaces in the last line
>  of an alignment.  I just ran an alignment from:
>  
>  http://www2.ebi.ac.uk/clustalw/
>  
>  The equivalent line contained colons and periods, too.

Thanks for trying it out, and thanks for the catch! I'll happily fix it
to accept this output.

>  The regexp is 
>  
>  match_stars = Martel.Group("match_stars",
>			   Martel.Re("[ \*]+") +
>			   Martel.Opt(Martel.Str("\n")))

So, for a quick fix, you can change the second line to:

Martel.Re("[ :\*\.]+")

>  I'll send the output if you like.

Please do, and I'll add it to the test suite and fix the parser. I just
poked around a bit to see what that line actually means, and starss are
identical residues, colons are conserved substitutions and periods are
semi-conserved substitutions. Neat! I never saw these since I have been
using Clustalw to align nucleic acids and not proteins.

Thanks again for catching this!

Brad


From katel at worldpath.net  Sun Oct 15 15:44:19 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:52 2005
Subject: [Biopython-dev] removing boiler plate
Message-ID: <007c01c036e0$50645220$98dc85d0@g0fjl>

  In using Martel, how do we strip boiler plate that may vary from site to site?  Things like user instructions, legends for graphics, etc.

                       Cayte
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://portal.open-bio.org/pipermail/biopython-dev/attachments/20001015/24915ca0/attachment.htm
From dalke at acm.org  Sun Oct 15 22:52:28 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:52 2005
Subject: [Biopython-dev] removing boiler plate
Message-ID: <000a01c0371c$36ddee60$a4ab323f@josiah>

Cayte:
>  In using Martel, how do we strip boiler plate that may vary from site to
site?
>  Things like user instructions, legends for graphics, etc.

That's going to depend on the boiler plate.  For example, suppose there's
an arbitrary amount of header text which is site specific, followed by
the site independent text.  Suppose also that the transition occurs with
a line containing 5 =s ("=====").

You can use Re(".*\n") to grab all of the header lines, but this will also
grab the "=====\n" line.  Instead, use a negative lookahead assertion to
match all lines except the =s line, as in  Re("(?!=====).*\n").  Of course,
you'll want to get all of those lines, so

header = Rep(Re("(?!=====).*\n"))

The re documentation covers both positive and negative lookaheads.

                    Andrew


From jchang at SMI.Stanford.EDU  Thu Oct 19 19:44:01 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:52 2005
Subject: [Biopython-dev] Changes in WWW BLAST format
In-Reply-To: <200010131558.LAA73594@archa11.cc.uga.edu>
Message-ID: <Pine.GSO.4.21.0010191641300.24821-100000@taiyang>

Thanks for the update!  I've incorporated your suggested changes.  Please
let me know if this works out.

Jeff


On 13 Oct 2000, Brad Chapman wrote:

> Jeff:
> >  Could you send the output?  I'd like to add it to the suite of blast
> tests.
> 
> Surely, it's attached. There is also an example in Doc/examples
> (www_blast.py) which will give it you as well (I found the format change
> when writing that example).
> 
> [patch with no respect for back-compatibility]
> Jeff:
> >  I don't know for sure, but wouldn't this break compatibility with
> >  older formats?
> 
> *slaps self in forehead* Doh! Respect for old formats. Sorry, I guess I
> was just too excited that I actually figured out how to fix the blast
> parser :-).
> 
> Here's a better patch which supports the new change, and passes all of
> the tests with the old formats. Let me know if there are any better ways
> to do things. Thanks for catching my mistake. Now I remember why you are
> the master of blast (not that I really needed any reminding :-).
> 
> Brad
> 
> 
> *** NCBIWWW.py.orig	Sat Aug 12 16:23:24 2000
> --- NCBIWWW.py	Fri Oct 13 11:47:12 2000
> ***************
> *** 149,154 ****
> --- 149,155 ----
> 	  attempt_read_and_call(uhandle, consumer.noevent, start='RID')
>   
> 	  # Read the Query lines and the following blank line.
> +	  read_and_call_until(uhandle, consumer.noevent,
> contains='Query=')
> 	  read_and_call(uhandle, consumer.query_info, contains='Query=')
> 	  read_and_call_until(uhandle, consumer.query_info, blank=1)
> 	  read_and_call_while(uhandle, consumer.noevent, blank=1)
> ***************
> *** 203,212 ****
> 			start='Sequences producing')
> 	  read_and_call(uhandle, consumer.noevent, blank=1)
>   
> !	  # Read the descriptions and the following blank line.
> !	  read_and_call_until(uhandle, consumer.description, blank=1)
> !	  read_and_call_while(uhandle, consumer.noevent, blank=1)
> ! 
> 	  consumer.end_descriptions()
>   
>       def _scan_alignments(self, uhandle, consumer):
> --- 204,220 ----
> 			start='Sequences producing')
> 	  read_and_call(uhandle, consumer.noevent, blank=1)
>   
> !	  # Read the descriptions
> !	  read_and_call_while(uhandle, consumer.description,
> !			      blank = 0, contains = '<a')
> ! 
> !	  # two choices here, either blanks lines or a </PRE>
> !	  if attempt_read_and_call(uhandle, consumer.noevent, blank = 1):
> !	      read_and_call_while(uhandle, consumer.noevent, blank = 1)
> !	  # otherwise we've got a </PRE> (introduced in 2.1.1)
> !	  else:
> !	      read_and_call_while(uhandle, consumer.noevent, contains =
> '</PRE>')
> !	  
> 	  consumer.end_descriptions()
>   
>       def _scan_alignments(self, uhandle, consumer):


From jchang at SMI.Stanford.EDU  Thu Oct 19 19:57:40 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] Proposed addition to Standalone BLAST
In-Reply-To: <200010121827.OAA33200@archa10.cc.uga.edu>
Message-ID: <Pine.GSO.4.21.0010191645470.24821-100000@taiyang>

I'm not sure what's going on, but it looks like BLAST may be masking out
low-complexity regions and ending up with little or nothing to search
with.  Unfortunately, there's nothing in the output that clearly tells us
what's going on.  For example, it'd be nice if there were a message
explaining why the parameters are missing.

Although something's clearly wrong here, I'm hesitant to try and diagnose
the error within the parser.  I don't know what's a real syntax error and
what's a BLAST error.

However, perhaps we can push the error detection higher up.  Possible 
solutions might be:
1) developed a Parser that could catch a SyntaxError, do some diagnostics
on the Record, and then raise a BlastError
2) make the parameters section optional in the Scanner, and then let the
user either check the Record, or adapt the Consumer to check

Would either of these be helpful?  Or something else?

Jeff


On 12 Oct 2000, Brad Chapman wrote:

> Hello again;
> 	More blast stuff from me -- can you tell I've had to parse a lot
> of BLAST reports recently? :-).
> 
> 	Anyways, I've been using the standalone BLAST parser to parse
> some big ol' BLAST runs that I'm doing, and I noticed that occassionally
> blastall will report an error while running. This a pretty uninformative
> error, and will generally either say something about being unable to
> calculate parameters during the BLAST. Well, I investigated further and
> found out that BLAST quits trying to run a search when it gets to a junk
> sequence like this:
> 
> >gi|9854647|gb|BE599574.1|BE599574 PI1_77_C09.g1_A002 Pathogen induced 1
> (PI1) Sorghum bicolor cDNA, mRNA sequence
> TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTGTTTTTTTTTTTTTTTTTTTTTTTT
> TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT
> TTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTAAA
> 
> Right, so this is useless junk sequence and BLAST is right to bomb out on
> it. 
> 
> The report that BLAST generates on something like this is attached.
> Basically, the problem is that a truncated report missing all of the
> statistics at the end. This causes the parser to run out of lines without
> finding the statistics it is looking for and generate a SyntaxError.
> 
> What I'd like to propose is that the parser generate a new exception for
> these kind of reports, a NCBIStandalone.BlastError exception, indicating
> a failure in Blast, not in the parser.
> 
> The reason I want to do this is that I would like to rig the exception up
> to return the query that failed in this way, so that I can easily send
> some messages to the owners of these sequences, asking them to kindly
> remove the sequence from GenBank.
> 
> Anyways, attached is a patch (NCBIStandalone.diff) that implements this
> type of exception-raising behavior for the BlastParser, which allows you
> to parse like this:
> 
> try:
>      b_record = iterator.next()
> except NCBIStandalone.BlastError, info:
>      print 'Got a blast error on query', info[1]
> 
> Do people think this is a good idea and something that can get into the
> standalone parser? Comments are very welcome!
> 
> Brad


From chapmanb at arches.uga.edu  Sun Oct 29 11:09:28 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] Proposed addition to Standalone BLAST
In-Reply-To: <Pine.GSO.4.21.0010191645470.24821-100000@taiyang>
References: <200010121827.OAA33200@archa10.cc.uga.edu>
	<Pine.GSO.4.21.0010191645470.24821-100000@taiyang>
Message-ID: <14844.19384.149965.283975@taxus.athen1.ga.home.com>

[I was having problems with BLAST failing on really low-quality 
sequences from GenBank]

Jeff:
> I'm not sure what's going on, but it looks like BLAST may be masking out
> low-complexity regions and ending up with little or nothing to search
> with.  Unfortunately, there's nothing in the output that clearly tells us
> what's going on.  For example, it'd be nice if there were a message
> explaining why the parameters are missing.

Agreed. The BLAST report doesn't look like there is really any problem 
(it just looks like it didn't find any hits). There are error messages 
in the xterm when you are running it from the command line, but they
aren't very helpful either, since they don't have any info about which 
sequences are failing.
 
> Although something's clearly wrong here, I'm hesitant to try and diagnose
> the error within the parser.  I don't know what's a real syntax error and
> what's a BLAST error.

This is a very good point. We don't want to cluter the parser trying
to deal with BLAST errors.

> However, perhaps we can push the error detection higher up.  Possible 
> solutions might be:
> 1) developed a Parser that could catch a SyntaxError, do some diagnostics
> on the Record, and then raise a BlastError

I really like this option, and think this is a good way to go. I have
been doing something semi-similar to find the bad records in my big
BLAST files, which basically involves:

1. Using the iterator (without a parser) to grab records one at a time 
from the file. 

2. Copying the handle so we can parse it and have an extra copy to
work with later.

3. Parse the record I got.
   If I get a SyntaxError, figure out what is wrong with the record
(right now I've just been writing it out to a file.

I actually wrote about this in the documentation (section 3.1.7) so
that should give you a better idea of what exactly I'm trying to do.

What do you think about generalizing this somehow to get the kind of
functionality you are talking about? I'm not sure if there is a better 
way to do it, and I don't know how much overhead is introduced by
copying the handle. So I'm very open to suggestions on this...

Thanks!

Brad


From chapmanb at arches.uga.edu  Sun Oct 29 11:35:56 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] Martel timings
Message-ID: <14844.20972.867166.469991@taxus.athen1.ga.home.com>

[Martel thread]
I wrote:
> > Hmmm, one side note about RecordReader. I really like the way you can
> > interface with the parsers in multiple ways in the current Biopython
> > parsing. I think it is really useful to be able to iterate over a
> record
> > and get the record back, instead of automatically having to parse it

Andrew:
> I've got experimental code for that at
> http://www.biopython.org/~dalke/SaxRecords.py
> 
> It uses a new threads for the callback object and a Queue to send
> parsed
> records back to the iterator interface in the originating thread.
> Currently looses memory if the thread doesn't go to completion because
> the new thread is sitting there waiting for the queue to empty.

Hmmmmm, I admit I am having lots of problems groking this -- I think
my mind must be really cloudy. I just can't exactly see why using
threads is the best way. The way that Biopython parsers work is:

1. Get a handle with the next record in a big file.

2. If a parser is passed, parse the handle and return the results.

   Otherwise (no parser), return the handle itself.

This seems to make more sense (ie. simpler for my simple mind :-), but 
I'm not sure -- what are your thoughts?

[helpful description of errors in Martel]
I wrote:
> > Do you think there is a way to make the RecordReader act similar to
> > the Iterators in this regard?

Andrew:
> So yes.  Convert the "fatalError" events to "error" and do recovery by
> skipping to the next record.  Then have the SaxRecords code, which
> does the
> Iterator-like interface, return the right information for
> problematical
> records.
> 
> Umm, what does the Iterator do for bad records?  It looks like it
> raises
> an exception, but allows you to call next() to get the next record?
> That's reasonable to me (since I think I can support it :)

Yup, that's the way Iterator works, which would be very nice. It would 
be a serious pain to have a huge parse completely die near the end
because of a single bad record.

There is  also the issue I was just discussing with Jeff about getting 
back bad records and trying to find why they are bad (ie. in BLAST
output, but I would imagine it might be helpful in other cases as well 
-- badly formatted GenBank entries that the parser doesn't like?).
 
> I'll work on it; unless you want to do it?

I can try, although I'm not exactly positive about the best way to
proceed. This is related (at least in my mind) with the other problem
I was discussing with Jeff...

[cool new stuff in the RecordReader]
> Thanks!  If you didn't notice, it also plays some tricks to read ahead
> many lines, which should give better overall performance.  The
> File.UndoHandle isn't as tricky but has better guarantees of where 
> it is in the file and
> it allows undos, which Martel doesn't need.  I bet changing the code
> to read ahead multiple lines would speed up the existing biopython code.

Yeah, this stuff is very cool. My mind is still kind of blown away by
both this and Jeff's File.UndoHandle stuff -- it is really nifty that
you can do so much cool stuff with the handles!

Brad


From katel at worldpath.net  Sun Oct 29 20:12:47 2000
From: katel at worldpath.net (Cayte)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] Martel
Message-ID: <000701c0420e$84e5c540$010a0a0a@cadence.com>

  I was just writing some unit tests, with my tool,  for Martel.  It failed
the AtEnd test on Windows.  I wonder if this is one of those Unix/Dos
things?

                                                   Cayte


From dalke at acm.org  Sun Oct 29 20:15:12 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] Martel
Message-ID: <011401c04210$84c6b9a0$13ac323f@josiah>

Cayte:
>  I was just writing some unit tests, with my tool,  for Martel.  It failed
>the AtEnd test on Windows.  I wonder if this is one of those Unix/Dos
>things?

Yes, it is.  The current code requires "\n" and does't allow "\r\n".  I
still haven't sat down to figure out the details of unix vs. dos end-of-line
conventions.

                    Andrew


From dalke at acm.org  Mon Oct 30 02:58:13 2000
From: dalke at acm.org (Andrew Dalke)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] Martel timings
Message-ID: <016d01c04247$418f5e80$13ac323f@josiah>

Brad:
>Hmmmmm, I admit I am having lots of problems groking this -- I think
>my mind must be really cloudy. I just can't exactly see why using
>threads is the best way.

It's the solution for a somewhat different problem.  Suppose you have
an arbitrary SAX interface, where you cannot change the event
generation code, and want to turn it into an iterator interface.

One way to implement it is to store the events on a list then after
the callbacks are finished, scan it to produce the records.  This has
the problem of storing all of the events before processing them, so
there can be some memory problems.

Another way is to spawn off a new thread and do the processing there.
When a record is processed, send it over to the original thread.  (I
believe this would work even better using Stackless Python.)  This is
the most general but is (as you noticed) more complex.

I said "somewhat different problem" because we have control over the
Martel definitions.  There's already a specialization (RecordParser)
which has better memory usage for record oriented data.  By definition,
that means it can be used to convert all of a record's callback events
into a list of events, as in the first possibility, then scan the list
to create records.

So what I've done is add a new method to the expression objects called
"make_iterator" just like they have a "make_parser" method.  The
make_iterator takes a string, which is the tag name used at the start&end
of the record.  The object returned parse(...), parseString(...) and
parseFile(..) just like the parser object returned from "make_parser",
except it also takes a second parameter which is used to make records.
That description is easier to understand as code:

   iterator = format.make_iterator("swissprot38_record")
   for record in iterator.parseString(text, make_biopython_record):
       ...

The implementation uses an EventStream protocol.  An EventStream has a
'.next()' method, which returns a list of events.  If there are no events,
it returns None.  In the standard case, the EventStream converts all of
the input into a list of events and returns it.  For a record reader,
each call of next reads a record and returns its events.

The EventStream object is passed to Iterator class's constructor, which
is a forward iterator for reading records (the 'for record in ...' part
of the above).  When *its* .next() is called, it starts processing the
list of available events, calling the EventStream if more events are needed.
As it scans the list, it looks for the start and end tags.  Everything
inside of those tags are passed to the SAX parser object created by the
factory object passed in (the 'make_biopython_record').  It also sends
startDocument/endDocument events.   The Iterator's next() method returns
the created SAX parser objects.

Again, it's easier to use than describe.

This approach, BTW, is vaguely similar to the pulldom of Paul Prescod's.

The nice thing about the "make_iterator" API is that is supports both
this event stream approach and also allows threads, if there's no way to
modify the parser code.

>Andrew:
>> Umm, what does the Iterator do for bad records?  It looks like it
>> raises
>> an exception, but allows you to call next() to get the next record?
>> That's reasonable to me (since I think I can support it :)
>
>Yup, that's the way Iterator works, which would be very nice. It would
>be a serious pain to have a huge parse completely die near the end
>because of a single bad record.

After reflection, I've come to a different conclusion about how to handle
bad records.  It's really easy to make a new format which handles swissprot
records as well as errors.

format = ParseRecords(swissprot38.format |
                      Rep(Group("bad_record",
Re("^((?!//)[^\n]*\n)*//\n"))),
                      EndsWith("//\n"))

(I don't have the source code available now, so the syntax is probably
a bit off.)

Then the SAX parser for records just needs to know how to handle
swissprot38_record and bad_record records.  I like this because I like
strict code, where you have to be explicit to tell it how to ignore errors.

Plus, if you want to do some recovery with data extraction, it could
switch to a different syntax which might not be as strict (like
'(?P<name>..)   (?P<text>[^\n])\n')

>> I'll work on it; unless you want to do it?
>
>I can try, although I'm not exactly positive about the best way to
>proceed.

I've got the iterator code mostly working.  I'm doing documentation and
adding more regression tests.  How about when I finish I send you a
version to test out?  Don't ask when :(

                    Andrew
                    dalke@acm.org


From thomas at cbs.dtu.dk  Mon Oct 30 22:20:45 2000
From: thomas at cbs.dtu.dk (thomas@cbs.dtu.dk)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] GenBank parser ?
Message-ID: <14846.14989.117057.444180@delphinus.cbs.dtu.dk>

Hej All,

Do we - or someone else - have a genbank parser ? I remember something came
up in the news groups, but I cannot find it anymore ...

thx
-thomas 


-- 
Sicheritz Ponten Thomas E.  CBS, Department of Biotechnology
thomas@biopython.org        The Technical University of Denmark
CBS:  +45 45 252489         Building 208, DK-2800 Lyngby
Fax   +45 45 931585         http://www.cbs.dtu.dk/thomas

	De Chelonian Mobile ... The Turtle Moves ...


From jchang at SMI.Stanford.EDU  Mon Oct 30 14:25:28 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] GenBank parser ?
In-Reply-To: <14846.14989.117057.444180@delphinus.cbs.dtu.dk>
Message-ID: <Pine.GSO.4.21.0010301124250.14405-100000@riboweb.Stanford.EDU>

No.  The currently plan is to use this as a test case for Martel.  Any
takers?  :)

Jeff


On Tue, 31 Oct 2000 thomas@cbs.dtu.dk wrote:

> Hej All,
> 
> Do we - or someone else - have a genbank parser ? I remember something came
> up in the news groups, but I cannot find it anymore ...
> 
> thx
> -thomas 
> 
> 
> -- 
> Sicheritz Ponten Thomas E.  CBS, Department of Biotechnology
> thomas@biopython.org        The Technical University of Denmark
> CBS:  +45 45 252489         Building 208, DK-2800 Lyngby
> Fax   +45 45 931585         http://www.cbs.dtu.dk/thomas
> 
> 	De Chelonian Mobile ... The Turtle Moves ...
> 
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev@biopython.org
> http://biopython.org/mailman/listinfo/biopython-dev
> 


From chapmanb at arches.uga.edu  Mon Oct 30 16:26:10 2000
From: chapmanb at arches.uga.edu (Brad Chapman)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] GenBank parser ?
In-Reply-To: <Pine.GSO.4.21.0010301124250.14405-100000@riboweb.Stanford.EDU>
References: <14846.14989.117057.444180@delphinus.cbs.dtu.dk>
	<Pine.GSO.4.21.0010301124250.14405-100000@riboweb.Stanford.EDU>
Message-ID: <14845.59250.69115.532080@taxus.athen1.ga.home.com>

Thomas:
> > Do we - or someone else - have a genbank parser ? I remember something came
> > up in the news groups, but I cannot find it anymore ...

Jeff:
> No.

There are a couple of ways around this that I have found which allow
you still use python to get at GenBank:

1. Use jpython and the biojava libraries for parsing GenBank. I
attached a file which shows a basic example of doing this.

2. Use the python BioCorba interface (biopython-corba). 
 I use a bioperl based server and a biopython client, and this works 
quite well, at least for what I'm doing (parsing out CDS
info). Sometime soon I hope to make a new release of biopython-corba
with documentation on how to do stuff like this. I just need to revise 
the docs, and do some more testing to make sure everything in CVS is
kosher. If you are interested in trying this way, I would definately be
willing to help (hey, it would be quite exciting to have someone
using biopython-corba besides me :-).

Jeff:  
> The currently plan is to use this as a test case for Martel.  Any
> takers?  :)

I think one of our biggest sticking points is that we don't really
have anything in terms of features, which would be really really
useful to parse the GenBank files into. It seems like it is pretty
tricky to have classes which can deal with all of the possible
complexities of GenBank (also EMBL) formats, so it would be nice to
think of and implement some feature classes which do this first. There 
was an interesting discussion about some of this on the biocorba list
(in the October archives under the threads 'Biocorba IDL --
Clarifications' and 'SeqFeatures and the EMBL IDL').

Anyways, I don't have much time at the moment to work on this 100%,
but would be willing to do part o' the coding/hashing things out if
other people are willing to work on it as well. I think once we have 
a feature class, the GenBank parser won't be too incredibly horrible
to do from Martel (fingers crossed :-).

Brad


-------------- next part --------------
#!/usr/bin/env jpython
"""Read info from GenBank files.

This uses jpython and biojava (http://www.biojava.org) to read from a
GenBank file.

This is basically a jpython translation of demos/seq/TestGenbank.java"""
# standard python libs
import os

# java stuff
from java.io import *

# biojava
from org.biojava.bio.seq.io import *
from org.biojava.bio import *
from org.biojava.bio.symbol import *
from org.biojava.bio.seq import *

# set up the files
file = os.path.join('test.gb')

gb_file = File(file)
reader = BufferedReader(InputStreamReader(FileInputStream(gb_file)))

# set up biojava stuff to parse the files
alphabet = DNATools.getDNA()
seq_factory = SimpleSequenceFactory()
parser = alphabet.getParser("token")
gb_format = GenbankFormat()

iterator = StreamReader(reader, gb_format, parser, seq_factory)

while iterator.hasNext():
    seq = iterator.nextSequence()

    print 'name:', seq.getName()
    print 'num features:', seq.countFeatures()


From thomas at cbs.dtu.dk  Tue Oct 31 02:36:14 2000
From: thomas at cbs.dtu.dk (thomas@cbs.dtu.dk)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] genbank parser
Message-ID: <14846.30318.641161.786999@delphinus.cbs.dtu.dk>

ok - now I remember where I have seen a genbank parser ...

Object-oriented parsing of biological databases with Python
Chenna Ramu, Christine Gemuend and Toby J. Gibson
Bioinformatics, Volume 16, Issue 7, Pages 628-638 : July 2000 

http://shag.embl-heidelberg.de:8000/Biopy/


testing it right now ...

c ya
-thomas

-- 
Sicheritz Ponten Thomas E.  CBS, Department of Biotechnology
thomas@biopython.org        The Technical University of Denmark
CBS:  +45 45 252489         Building 208, DK-2800 Lyngby
Fax   +45 45 931585         http://www.cbs.dtu.dk/thomas

	De Chelonian Mobile ... The Turtle Moves ...


From jchang at SMI.Stanford.EDU  Tue Oct 31 21:58:20 2000
From: jchang at SMI.Stanford.EDU (Jeffrey Chang)
Date: Sat Mar  5 14:42:53 2005
Subject: [Biopython-dev] Proposed addition to Standalone BLAST
In-Reply-To: <14844.19384.149965.283975@taxus.athen1.ga.home.com>
Message-ID: <Pine.GSO.4.21.0010311853280.16965-100000@riboweb.Stanford.EDU>

> What do you think about generalizing this somehow to get the kind of
> functionality you are talking about? I'm not sure if there is a better 
> way to do it, and I don't know how much overhead is introduced by
> copying the handle. So I'm very open to suggestions on this...

Sure.  Having some code that would help to diagnose errors in BLAST
reports would be a very nice feature.  Certainly more user friendly than
having SyntaxError this or SyntaxError that.

We would have to build this on top of the current exceptions, though.  
It's still nice to have the SyntaxErrors under the hood, as an explanation
on why the parser is complaining in the first place.

How are you copying the handle?  If you read the contents of the handle as
a string (ummm, could be iffy parsing PSI-BLAST on RAM-starved machines,
but probably not a problem), and then wrapped a StringHandle around it,
there should be little overhead aside from the string containing the blast
results.

Jeff