[Bioperl-l] Genome scanning questions/strategies

Wed Sep 16 13:22:00 UTC 2009

On Sep 15, 2009, at 3:05 AM, Robert Bradbury wrote:

> I have several applications which require scanning multiple genomes,  
> in some
> cases I can get away with scanning the protein sequences, in other  
> cases I
> need to scan the mRNA, or in the worst case the DNA sequences  
> themselves.  I
> have most of the available genomes on my hard drive but in cases  
> where they
> are not complete or undergo frequent revisions, I may need to  
> interface
> through the Genbank | Ensembl | JGI (or other?) databases.
>
> Some of the applications are basic counting statistics:
> 1) How many proteins?
> 2) How many amino acids in the proteins?
> 3) What are the species specific codon frequencies in the codons?
> 4) What fraction of the genome is ncRNA, junk DNA, etc.?
>
> Other applications involve some functional analysis, e.g. find all  
> specified
> protein domains of interest (presumably some HMM matching or  
> equivalent),
> find all signal sequences (nuclear targeting, mitochondrial  
> targeting, ER
> targeting, etc.), find all mRNA restriction enzyme cut sites, etc..
>
> Questions are:
> 1) Are there "remote" functions that use genome center  
> "supercomputers"
> (other than say Remote Blast) that can be used for some of these  
> purposes
> and are interfaced in some way to BioPerl?

Re: remote tasks, there are a few tools for that.  See  
Bio::Tools::Analysis modules for ones that access remote servers, or  
the HOWTO:

http://www.bioperl.org/wiki/HOWTO:Simple_web_analysis

Setting up modules for these services can be risky, though, as we have  
no control over the continued evolution of the remote servers in  
question.  For instance, we had a set of Pise modules (around 100 I  
think) for remotely accessing services at any Pise server; however,  
these are now obsolete in favor of Mobyle.  I have long thought of  
setting something up to interface with either that service or Galaxy  
(which may be a more stable alternative), just haven't had the time.

Re databases: we have access to NCBI, EMBL, UniProt, and many others.   
NCBI eutils are available via Bio::DB::EUtilities.  You can use the  
Ensembl perl API for accessing Ensembl (including Compara and others),  
and Mark Jensen added Bio::DB::HIV for accessing HIV database  
information at LANL HIV Sequence Database.  These were all working  
with bioperl 1.6 last I tried (ensembl's API is separate and available  
from their website).

We don't have much beyond that, primarily b/c most other centers are  
very particular when queried remotely and will block IPs that spam  
their servers w/o an adequate timeout.  That's completely  
understandable from a webadmin perspective (think: possible denial of  
service attack).

> 2) Will I incur genome center wrath by running all my queries  
> "remotely"
> (i.e. I do the computing, but they handle the database retreival &  
> network
> distribution)?  If not, what is a good "max query frequency"? [I'm  
> on a DSL
> line, so I can't push most servers very hard from an I/O standpoint.]

You may if you abuse a specified timeout.  UCSC and NCBI both have  
been known to block IPs, but the timeout is quite different between  
the two (NCBI just reduced theirs to three queries per second, whereas  
I last heard UCSC was once per 30 seconds).

The best thing to do is check the documentation for the site in  
question or contact the webadmin to see if there is a requested  
timeout period.

> Finally, is there any "archive of experience" documenting the various
> information systems limitations on various bioinformatics  
> applications?
> I.e. for I/O requirements and/or CPU requirements, is: BLAST <
> HMM-domain-searching < Inter-genome-signal-scanning/matching?   
> Relates to
> the question of when home based bioinformaticians need to begin  
> considering
> switching from DSL to Cable to FIOS and/or 1/3/4/6/8 core machines/ 
> clusters
> can handle the workload.
>
> Thank you,
> Robert Bradbury

On that I'm not sure, but I would tend to think they don't want you  
taxing their local servers so there probably is some prioritization of  
tasks.

 From my perspective, if I were a home-based bioinformatician I would  
look seriously at cloud computing for most high-end tasks (Mark has  
even set up one for bioperl, bioperl-max).  It has a cost but it's  
very reasonable considering the cost of setting up a local cluster,  
maintenance and repairs, etc.  In fact, we have been putting serious  
thought into testing that direction instead of putting money into  
another high-cost local cluster, which is obsolete in, say, 3-4 years,  
or when we're getting Blue Waters in a couple years.

chris