[Bioperl-l] Reciprocal best hits using Bioperl?
Jason Stajich
jason at bioperl.org
Mon Jan 18 15:24:33 EST 2010
On Jan 18, 2010, at 8:12 AM, Chris Fields wrote:
> (my small rant on this)
>
> On Jan 18, 2010, at 4:20 AM, Robert Bradbury wrote:
>
>> My comment might be that the problem with OrthoMCL is that it is
>> primarily lower organisms. The problem with Ensembl (and some other
>> databases) is that it is primarliy higher organisms (though they do
>> include Drosophila, C. elegans and Yeast).
>
> OrthoMCL v2 handles both lower and higher organism; I've used it for
> both, with decent success. Most other ortholog tools do as well (if
> I'm not mistaken, ensembl also uses MCL under the hood, unless
> that's changed). I don't believe one should be completely bound to
> one toolset, particularly in this case (there are lots of nice
> ortholog clustering tools using various moeans of comparison out
> there), but I do think OrthoMCL is very good as an initial pass. If
> anything, I would like a set of (possibly bioperl-based, definitely
> DB-based) modules that can deal with this information.
>
> The more imperative issue in my opinion is that one is prisoner to
> the gene models for those specific organisms of interest, and this
> may vary widely depending on the source of those gene models
> (Ensembl, UCSC, NCBI, EBI, centralized MODs like FlyBase, etc). For
> instance, if gene models are poorly curated or rarely updated, the
> comparisons may be significantly flawed. Some of these issues may
> also be (somewhat) alleviated once more transcriptome data is
> available that helps clear up gene model ambiguities, but that won't
> be true for all organisms, at least initially.
>
> Note this isn't meant as a slam on any specific DBs or MODs in
> general, the problem is one born of the fact that there isn't a
> single, centralized, trusted, consistently updated source for this
> data, specifically something that will handle moderated third-party
> annotation. That's a very difficult problem to solve effectively.
> Some of these very issues crept up at the GMOD conference, and there
> appears to be consensus that a real attempt is needed to address this.
>
> I don't know, maybe it's just unicorns and rainbows. Personally I
> do think the situation will improve, as there seems to be great
> demand for it, but it requires time, resources, manpower, money, cat
> herding, etc.
>
>> The problem arises when one wants to cross those boundaries. For
>> example the 5-10 antioxidant proteins, the ~150 DNA repair proteins,
>> many of the mitochondrial (ETC) proteins, the ribosomal rRNA's &
>> tRNAs, and the fundamental biochemistry (EC) proteins are homologous
>> all the way from the most ancient bacteria through H. sapiens. The
>> only way to play in the mixed arena of prokaryotes and eukaryotes
>> involving fundamental vectors in evolution is to either construct
>> ones
>> own databases (which presumably means getting involved with MySQL,
>> and
>> probably spending some $$$ on hardware) or to develop some BioPerl
>> modules that can do the SpeciesX vs. SpeciesY comparisons on demand
>> using some part of the cloud. This problem isn't going to get
>> smaller
>> its only going to get larger, now that the cost of sequencing
>> (pseudo-resequencing) a vertebrate genome is starting to come in
>> under
>> $10,000 and people are starting to seriously talk about 10,000
>> vertebrate genomes. 10,000 x 10,000 x 20,000 (genes) isn't something
>> people are going to undertake very soon.
>>
>> Robert
>
> They're already undertaking it now using a broad range of organisms,
> in and out of the cloud. In most cases one can amend a prior recip.
> comparative analysis with new data fairly easily, if one takes care
> to do so early on (i.e. set up the BLAST databases with a specified
> defined size for comparative stats between separate analyses).
> OrthoMCL v2 describes a procedure to do this, and I believe others
> have similar methodology.
>
> I could also see possible ways one can further optimize this, for
> instance in cases where two very closely-related organisms are
> compared, where translated seqs are 100% identical, etc. IIRC, the
> OrthoMCL DB site already has a way to upload custom sets of protein
> data for mapping to (already pre-run) clusters. Just the fact that
> the tools are available as OS, they're semi-automated, and can be
> generically applied to data of personal interest is a great boon.
> Not sure I see the downside of that, and I'm pretty confident the
> scalability issues will be addressed in some way.
I think that the approach that Paul Thomas's group at SRI http://www.ai.sri.com/esb/
is doing is really what you'd want to focus on if you are only
interested in a particular set of gene families rather than de novo
clustering. That or the PhyloFacts approach http://phylogenomics.berkeley.edu/phylofacts/
. That is where HMMs are more appropriate, focusing on your initial
seed set of families of proteins. HMMs for your families with some
automated clustering initially to get better resolution. Once you
start throwing multiple 10^6 proteins the unsupervised clustering
approach may not be able to give as accurate or timely results but can
be a good initial filtering step depending on how much initial
knowledge you are starting with. Using HMM models won't be as
computationally expensive either if you are compute limited.
TreeFam is also providing curated phylogenies of gene families http://www.treefam.org/
that span the optisthokonts in that a few fungi are sprinkled in.
Also things like http://boinc.bio.wzw.tum.de/boincsimap/ provide ways
to use distributed computing to calculate the matrix of similarities
among proteins if you are interested in the exhaustive approach.
-jason
>
> chris
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
--
Jason Stajich
jason.stajich at gmail.com
jason at bioperl.org
http://fungalgenomes.org/
More information about the Bioperl-l
mailing list