[Bioperl-l] Reciprocal best hits using Bioperl?

Mon Jan 18 05:20:33 EST 2010

My comment might be that the problem with OrthoMCL is that it is
primarily lower organisms.  The problem with Ensembl (and some other
databases) is that it is primarliy higher organisms (though they do
include Drosophila, C. elegans and Yeast).

The problem arises when one wants to cross those boundaries.  For
example the 5-10 antioxidant proteins, the ~150 DNA repair proteins,
many of the mitochondrial (ETC) proteins, the ribosomal rRNA's &
tRNAs, and the fundamental biochemistry (EC) proteins are homologous
all the way from the most ancient bacteria through H. sapiens.  The
only way to play in the mixed arena of prokaryotes and eukaryotes
involving fundamental vectors in evolution is to either construct ones
own databases (which presumably means getting involved with MySQL, and
probably spending some $$$ on hardware) or to develop some BioPerl
modules that can do the  SpeciesX vs. SpeciesY comparisons on demand
using some part of the cloud.  This problem isn't going to get smaller
its only going to get larger, now that the cost of sequencing
(pseudo-resequencing) a vertebrate genome is starting to come in under
$10,000 and people are starting to seriously talk about 10,000
vertebrate genomes.  10,000 x 10,000 x 20,000 (genes) isn't something
people are going to undertake very soon.

Robert

On 1/17/10, Tristan Lefebure <tristan.lefebure at gmail.com> wrote:
> On Sunday 17 January 2010 18:59:05 Jason Stajich wrote:
>> yes - but mcl alone is something slightly different in
>>  that it doesn't   correct for inparalogs, but for
>>  incomplete genomes this is probably okay.
>
> interestingly, my experience with not too divergent
> bacterial genomes (same genera) does not support the
> normalization used in the orthoMCL (which, as far as I
> understand, is a standardization of the -Log10(evalue) per
> taxa combination, including a taxa with itself). MCL, which
> does not do any normalization (just -Log10(evalue)) gives
> about the same number of false negative (i.e. missed
> orthologs), but a lot less false positive (false orthologs).
> In other words, you get many fake singletons. I don't known
> exactly if the problem lies in the normalization process or
> the fact that orthoMCLv1.x is using a very old version of
> MCL. What I do known is that many false positive are made of
> short or incomplete proteins that are very common in draft
> genomes and automatic annotations... Things might be
> completely different with more divergent and globally longer
> proteins. Testing orthoMCLv2 on the same data set would
> probably give the answer.
>
> --Tristan
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>