[Biopython-dev] GEO SOFT parser

Phillip Garland pgarland at gmail.com
Sun May 15 01:13:28 UTC 2011


Hello,

I've created a new parser for GEO SOFT files- a fairly simple
line-orientated format used by NCBI's Gene Expression Omnibus for
holding gene expression data, information about the experimental
platform used to generate the data, and associated metadata. At the
moment if parses platform (GPL), series (GSE), sample (GSM), and
dataset (GDS) files into objects, with access to the metadata, and
data table entries.

It's accessible through my github biopython repo:
https://github.com/pgarland/biopython
git://github.com/pgarland/biopython.git

Branch:
new-geo-soft-parser

All the changed files are in the Bio/Geo directory.

The existing parser has the virtue of being simple and short. The
parser I've written is less parsimonious, but should handle everything
specified by NCBI, as well as some unspecified quirks, and documents
what GEO SOFT files are expected to contain. I'm taking a look at Sean
Davis's GEOquery Bioconductor package for ideas for the interface.

There is a class for each GEO record type: GSM, GPL, GSE, and GDS.
After instantiating each of these, you can call the parse method on
the resulting object to parse the file, e.g.:

>>> from Bio import Geo
>>> gds858 = Geo.GDS()
>>> gds858.parse('GDS858_full.soft')

Each object has a dictionary named 'meta' that contains the file's metadata:

>>> gds858.meta['channel_count']
1

Each attribute has a hook to hang a function to perform additional
parsing of a value, but most values are stored as strings.

There is also a parseMeta() method if you just need the file's
metadata (the entity attributes and data table column descriptions)
and not the data table.

There is also a rudimentary __str__ method to print the metadata.

For files that can have data tables (GSM, GPL, and GDS files), there
is currently just one method for accessing values: getTableValue()
that takes an ID and a column name and returns the associated value:

>>> gds858.getTableValue(1007_s_at, 'GSM14498')
3736.9000000000001

but I will implement other methods to provide more convenient access
to the data table.

Right now, the data table is just an 2D array and can be accessed like
any 2D array:

gds858.table[0][2]
'3736.900'

There are dictionaries for converting between IDs and column names and
rows and columns:

>>> gds858.idDict['1007_s_at']
0

>>> gds858.columnDict['GSM14498']
2

It is possible that the underlying representation of the data table
could change though.

On my dual-core laptop with 4GB of RAM and a 7200RPM hard drive,
parsing single files is more than fast enough, but I haven't
benchmarked it or looked at RAM consumption. If it's a problem for
computers with less RAM or use cases that require having a lot of GEO
SOFT objects in memory, I can take a look at changing the data table
representation.

If this parser is incorporated in BioPython, I'm happy to maintain it.
The code is well-commented, but I still need to write the
documentation. I've tested it on a few files of each type, but I still
need to write unit tests. Since SOFT files can be fairly large-  a few
MB gzipped, 10's of MB unzipped, it seems undesirable to package them
with the biopython source code. I could make the unit test optional
and have interested users supply their own files and/or have the test
download files from NCBI and unzip them.

~ Phillip



More information about the Biopython-dev mailing list