[Biopython-dev] Unigene flat file parser

Davis, Sean (NIH/NCI) [E] sdavis2 at mail.nih.gov
Thu Oct 26 20:15:52 UTC 2006


Michiel,

It looks to me like it parses an HTML file downloaded from the NCBI website containing a single unigene record of interest--potentially useful if one knows what one needs.

I, on the other hand, have always just used the flat files as the source for unigene, as I typically want ALL the data for one or several species available.  A single flat file is available for each organism and contains ALL the unigene entries and their associated information for that organism.  By concatenating several files (they are simple text files), one can parse the entire unigene database.  

So, in short, I don't see this unigene parser as a replacement for the current module.  They fill different needs; this one fills a need that I have and is useful for whole-genome, multiple species work, or microarray analyses and whether and where it fits into biopython is really up to the community.  

Just a quick comment on speed for the parser--it parses Hs.data (the largest flat file in unigene, 84,000 entries, with just under 7,000,000 sequence entries, 150 Mb file size) in just under 5 minutes on my Xeon desktop.  

Sean



-----Original Message-----
From: Michiel Jan Laurens de Hoon [mailto:mdehoon at c2b2.columbia.edu]
Sent: Thu 10/26/2006 2:01 PM
To: Davis, Sean (NIH/NCI) [E]
Cc: biopython-dev at lists.open-bio.org
Subject: Re: [Biopython-dev] Unigene flat file parser
 
Sean Davis wrote:
> I have put together a parser for the Unigene flat file format described here:

Perhaps a silly question from a non-Unigene user, but what is the 
relation between your parser and the one in Bio/UniGene/__init__.py? The 
latter seems to parse HTML files (see the example in 
Tests/test_unigene.py) instead of flat files. Is your parser intended as 
a replacement for Bio/UniGene/__init__.py?

--Michiel.

-- 
Michiel de Hoon
Center for Computational Biology and Bioinformatics
Columbia University
1130 St Nicholas Avenue
New York, NY 10032





More information about the Biopython-dev mailing list