[BioPython] Robustness of parsing

Wed May 5 13:54:52 EDT 2004

Hi Stephan;

> I have a question about the robustness of parsing in my case the UniGene 
> at NCBI. 
> I have been assigned to write a parser to put the UniGene flat files 
> into an existing database structure, before starting writing code I thought
> I'd better search the web to find some existing solutions. This is when I 
> opened your site and found a parser for the UniGene.

The Biopython parser is designed to parse the UniGene cluster HMTL 
pages which list the EST sequences making up a cluster. Honestly,
I'm not sure if it is well maintained currently and will work with
the current UniGene pages.

I'm not sure exactly what kind of UniGene information you want to 
be parsing. If by flatfiles you are talking about the downloads
available from:

ftp://ftp.ncbi.nih.gov/repository/UniGene/

then you can parse this with the standard Fasta parser in Biopython,
but would then need to build up some code to parse the UniGene
specific information out of of the Fasta headers.

> At my work we already have a parser for these flat files written in C, 
> the only problem with this parser is, is that it will not run anymore 
> if the structure of the UniGene changes. For instance if a new field is 
> added or if relations change from a 1-to-1 to 1-to-many.
> My question about biopython; has it the same problems? If that is the case; 
> in what timespan are updates available?

To answer in general terms (due to my lack of understanding about
exactly what you are parsing), most parsers will suffer from this
same problem -- it is really not possible to anticipate all of the
various changes which will happen to flat file formats. The benefit
of writing code in a Biopython-ish (or BioPerl-ish, BioJava-ish --
it applies to them all) are two fold:

1. There are already existing utilities and structures for parsing
flat files which make writing the parsers easier. In Biopython these
include the Martel regular expression and the Scanner/Consumer
frameworks.

2. Since your code is publicly available and others are using it,
the inevitable job of updating parsers is distributed over multiple
people who can offer suggestions and fixes.

> If biopython is also lacking these problems I want to write a 
> more generetic solution, perhaps in python.

We certainly would welcome a robust solution for parsing UniGene in
Biopython. Please feel free to ask more questions here; sorry if I
don't have a full grasp on exactly what you are parsing but I hope
this helps some.

Brad