[Bioperl-l] Re: entrezgene binary ASN

Stefan Kirov skirov at utk.edu
Fri Sep 30 15:06:43 EDT 2005


Michael,
Sean Davis had a problem, similar to what Mingyi is describing... The 
problem is mixing two possible types of error is something I would not 
like to have myself. Don't forger that many people that may use the 
parser are not really that proficient (and should not be).
As for the multiprocessor... You can distribute the parsing based on the 
species files (All_Data takes >12 hours to parse anyway), which will 
lead to far better results (especially if you have couple of machines 
with few processors each).
As for the indexing- if you have to rebuilt the index of All_Data... it 
will be a while. Better to have static file I think (if I understood 
correctly).

Best regars,
Stefan

Michael Seewald wrote:

>
> Hi Mingyi,
>
> I am not sure, what you mean. The piping (in my example) did already 
> work nicely, not for you?
>
> With respect to gene2xml failures: This is nothing the module has to 
> care about. It *might* check for correct ASN1 syntax, but this is as 
> much as I would go. Otherwise, I would just try to make sure, that any 
> errors gene2xml throws are caught and passed on. It is the duty of the 
> module and/or the person running the script to watch STDERR output!
>
> With respect to the indexing: Again I do not think this would break 
> anything. Both gunzipping and transforming with gene2xml are 
> transparent to the module. The index must not care about it! The 
> indexer should recognize, however, if the index has to be rebuilt. 
> (This is something that some bioperl modules have problems with AFAICR.)
>
> With respect to disc i/o: This is definitely a time-saver as more and 
> more of us are running multi-CPU machines.
>
> Just my 2p,
> Michael
>
>
> On 9/30/05, *Mingyi Liu* <mingyi.liu at gpc-biotech.com 
> <mailto:mingyi.liu at gpc-biotech.com>> wrote:
>
>     I was half way through adding the support for pipe in
>     Bio::ASN1::EntrezGene before I realized that this is not a good
>     solution.  The problem I have with the pipe thing is that it merely
>     added more troubles and did not really save anything.
>
>     I mean, one superficial advantage of using pipe directly would be that
>     you don't need to first launch gene2xml.  But 1. Nobody needs to
>     manually launch gene2xml.  In any shell/perl script that does the
>     automatic download of the NCBI binary ASN files, just add a line to
>     launch gene2xml right after download.  2. Having EntrezGene module
>     deal
>     with it transparently would force it to deal with multiple failure
>     possibilities (no gene2xml installed? gene2xml choked? ...), let alone
>     hassles of changing syntax in input_file.  Simply put, it's not
>     worth it.
>
>     Another proposed advantage is saving disk I/O, in a sense it does (the
>     gzipped binary files are much smaller), but that does not necessarily
>     lead to shorter processing time since the time gene2xml doing its work
>     on the fly should be counted as well.  Not to mention if gene2xml
>     choked
>     for whatever reason.
>
>     A major disadvantage of using pipe would be doing any sort of seeking
>     operation on the file - the performance would be abysmal.  For
>     indexing
>     and indexed entry retrieval, one simply have to do the
>     pre-conversion of
>     those binary gzipped files.
>
>     As such I feel there are compelling reasons for one to first
>     convert the
>     binary gzip files to text files, then use the existing Bioperl modules
>     to parse, index, retrieve.  Any further input/discussions on the
>     matter
>     is welcomed!
>
>     Thanks,
>
>     Mingyi
>
>     Michael Seewald wrote:
>
>     >Hi Stefan,
>     >
>     >There are ways to capture these errors. Perl exception handling might
>     >be way to do it.
>     >
>     >On the other hand: Wouldn"t incomplete .gz downloads throw an error
>     >right away? I have to check (but can't right now).
>     >
>     >Michael
>     >
>
>
> -- 
> Dr. Michael Seewald
> Bioinformatics
> Bayer HealthCare AG


-- 
Stefan Kirov, Ph.D.
University of Tennessee/Oak Ridge National Laboratory
5700 bldg, PO BOX 2008 MS6164
Oak Ridge TN 37831-6164
USA
tel +865 576 5120
fax +865-576-5332
e-mail: skirov at utk.edu
sao at ornl.gov

"And the wars go on with brainwashed pride
For the love of God and our human rights
And all these things are swept aside"



More information about the Bioperl-l mailing list