[BioPython] new parser questions

Fri, 5 Apr 2002 07:54:49 -0600

Hello,

I have more parser questions too.  I'm a bit behind current progress,
but last time I looked at BioPython a few months ago, the parser I
looked at seemed to be highly line-oriented.  I noticed this when
fixing up a script that downloaded blast output from (I think)
NCBI. The website had added a single newline in a nonsubstantive part
of the html output and the parser choked.  I found it kind of
discouraging to see this lack of robustness (while at the same timing
being thankful that someone had done the work in the first place).

Has the framework  changed?  I can imagine a approach for dealing with
this sort of html-encapsulated data.

  1. load the file (use mmap if possible in order to more easily deal
     with large datafiles eg. 100's of MB and upwards.)

  2. parse languages that have a grammar (e.g. html/xml) with parsers
     that understand that grammar.  Python has lots of great tools for
     parsing html and xml, and using knowledge of the language grammar
     would avoid the "choke on a changed newline" problem mentioned above.

  3. Apply specialized parsers to data that's embeded within html/xml.

Maybe the current code already does this---I'm updating from CVS now.

Thanks in advance.

Ignorant-of-the-current-code-ly yours, chris