[Biopython] Bug in Geo.parser when reading some GDS files
Erik C
erikclarke at gmail.com
Mon Apr 23 19:54:20 EDT 2012
Hi all,
When parsing a NCBI GEO dataset (GDS) file such as this:
ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/GDS_full/GDS1962_full.soft.gz
the Bio.Geo.parse(handle) method fails with an assertion error. Example
code:
>> for record in Geo.parse(open('GDS1962_full.soft')): print record
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "Geo/__init__.py", line 54, in parse
assert key not in record.col_defs
AssertionError
It appears that this is due to the failed assumption that each column
header exists only once, when it seems that a common trend in GDS files is
to have two columns each titled GO:Function, GO:Process, and GO:Component.
The first of these duplicate columns is the Gene Ontology terms for the
probe at that row, and the second column is the GO ids for those terms.
>From GDS3646_full.soft:
#GO:Function = Gene Ontology Function term
#GO:Process = Gene Ontology Process term
#GO:Component = Gene Ontology Component term
#GO:Function = Gene Ontology Function identifier
#GO:Process = Gene Ontology Process identifier
#GO:Component = Gene Ontology Component identifier
While the duplicate header names is not ideal for tabular data, these GO
columns do seem to appear regularly for GDS files (see GDS1962, GDS3646,
and others) and they consistently break the parser. There should be a
disabling of this assertion for this particular case or a more flexible
column header check. I suggest using the assertion only for the sample
columns (those prefixed with GSM).
I'm using BioPython 1.59 (issue exists also in Git repository) with Python
2.7.1 on Mac OS 10.7.3.
Cheers,
Erik
More information about the Biopython
mailing list