[Bioperl-l] Genome scanning questions/strategies
Chris Fields
cjfields at illinois.edu
Wed Sep 16 13:22:00 UTC 2009
On Sep 15, 2009, at 3:05 AM, Robert Bradbury wrote:
> I have several applications which require scanning multiple genomes,
> in some
> cases I can get away with scanning the protein sequences, in other
> cases I
> need to scan the mRNA, or in the worst case the DNA sequences
> themselves. I
> have most of the available genomes on my hard drive but in cases
> where they
> are not complete or undergo frequent revisions, I may need to
> interface
> through the Genbank | Ensembl | JGI (or other?) databases.
>
> Some of the applications are basic counting statistics:
> 1) How many proteins?
> 2) How many amino acids in the proteins?
> 3) What are the species specific codon frequencies in the codons?
> 4) What fraction of the genome is ncRNA, junk DNA, etc.?
>
> Other applications involve some functional analysis, e.g. find all
> specified
> protein domains of interest (presumably some HMM matching or
> equivalent),
> find all signal sequences (nuclear targeting, mitochondrial
> targeting, ER
> targeting, etc.), find all mRNA restriction enzyme cut sites, etc..
>
> Questions are:
> 1) Are there "remote" functions that use genome center
> "supercomputers"
> (other than say Remote Blast) that can be used for some of these
> purposes
> and are interfaced in some way to BioPerl?
Re: remote tasks, there are a few tools for that. See
Bio::Tools::Analysis modules for ones that access remote servers, or
the HOWTO:
http://www.bioperl.org/wiki/HOWTO:Simple_web_analysis
Setting up modules for these services can be risky, though, as we have
no control over the continued evolution of the remote servers in
question. For instance, we had a set of Pise modules (around 100 I
think) for remotely accessing services at any Pise server; however,
these are now obsolete in favor of Mobyle. I have long thought of
setting something up to interface with either that service or Galaxy
(which may be a more stable alternative), just haven't had the time.
Re databases: we have access to NCBI, EMBL, UniProt, and many others.
NCBI eutils are available via Bio::DB::EUtilities. You can use the
Ensembl perl API for accessing Ensembl (including Compara and others),
and Mark Jensen added Bio::DB::HIV for accessing HIV database
information at LANL HIV Sequence Database. These were all working
with bioperl 1.6 last I tried (ensembl's API is separate and available
from their website).
We don't have much beyond that, primarily b/c most other centers are
very particular when queried remotely and will block IPs that spam
their servers w/o an adequate timeout. That's completely
understandable from a webadmin perspective (think: possible denial of
service attack).
> 2) Will I incur genome center wrath by running all my queries
> "remotely"
> (i.e. I do the computing, but they handle the database retreival &
> network
> distribution)? If not, what is a good "max query frequency"? [I'm
> on a DSL
> line, so I can't push most servers very hard from an I/O standpoint.]
You may if you abuse a specified timeout. UCSC and NCBI both have
been known to block IPs, but the timeout is quite different between
the two (NCBI just reduced theirs to three queries per second, whereas
I last heard UCSC was once per 30 seconds).
The best thing to do is check the documentation for the site in
question or contact the webadmin to see if there is a requested
timeout period.
> Finally, is there any "archive of experience" documenting the various
> information systems limitations on various bioinformatics
> applications?
> I.e. for I/O requirements and/or CPU requirements, is: BLAST <
> HMM-domain-searching < Inter-genome-signal-scanning/matching?
> Relates to
> the question of when home based bioinformaticians need to begin
> considering
> switching from DSL to Cable to FIOS and/or 1/3/4/6/8 core machines/
> clusters
> can handle the workload.
>
> Thank you,
> Robert Bradbury
On that I'm not sure, but I would tend to think they don't want you
taxing their local servers so there probably is some prioritization of
tasks.
From my perspective, if I were a home-based bioinformatician I would
look seriously at cloud computing for most high-end tasks (Mark has
even set up one for bioperl, bioperl-max). It has a cost but it's
very reasonable considering the cost of setting up a local cluster,
maintenance and repairs, etc. In fact, we have been putting serious
thought into testing that direction instead of putting money into
another high-cost local cluster, which is obsolete in, say, 3-4 years,
or when we're getting Blue Waters in a couple years.
chris
More information about the Bioperl-l
mailing list