[Bioperl-l] Generalized reciprocal blast

Wed Aug 26 15:55:04 UTC 2009

Robert -

BioPerl is has traditionally been a toolkit for building these types  
of pipelines and not intended to necessarily be a place for larger  
systems.  That said, BRH is a pretty easy algorithm that could be  
applied with the tools in place, the main issue is what kind of lookup  
table you want to do for establishing the BRH.  Hashes are okay, but I  
think BDB or Sqlite end up being more scalable and allow for  
persistence.

Really, I would use something like OrthoMCL rather than reciprocal  
BLAST to identify families anyways.

It uses Bioperl under the hood for parsing - though it suffers from  
some pretty inefficient management of the lookup table for the BRH  
part of the algorithm - it can be run on your own customized datasets  
to integrate public and private data.

You might also find better luck in building good alignments for the  
key members of your target gene family of interest and then using a  
profile HMM (or even just the new HMMER3 jackhmmer or phmmer which  
don't require a MSA) to identify the full set of homologs in all the  
databases.  If this is the only set of families you care about it is a  
lot less computational work to go through and pull these out with an  
HMM or HMMER search and build trees from these results rather than  
dealing with the computational time of the all-vs-all DB searches that  
you are proposing.

-jason
On Aug 26, 2009, at 8:38 AM, Robert Bradbury wrote:

> I would like to know whether or not anyone has attempted to create a
> "generalized" reciprocal blast component for BioPerl?
>
> One sees papers all the time where they discuss running reciprocal  
> blasts to
> compare a new species to an old "standard" species or a set of  
> species or
> running an all-to-all set of comparisons to match up all of the  
> "known"
> proteins from species and determine which are outliers (and therefore
> "novel").  There are also accumulating merged sets in NCBI  
> HomoloGene (which
> seems to be a some strict subset (perhaps a dozen) "well sequenced"  
> genomes)
> and Ensembl (which seems to be working with a much larger set of 40-50
> genomes some of which may be somewhat incomplete and are certainly  
> poorly
> "explored".
>
> I have, I believe, seen code "fragments" from various authors,  
> perhaps some
> on the BioPerl list, which perform some major subset of a typical
> "reciprocal blast".
>
> Now what I am looking for is a relatively generalizable some-to-some
> reciprocal blast utility.  I want to be able to specify the genes  
> (or gene
> family), e.g. some of the ~150 known DNA repair genes.  It would be  
> helpful
> to also specify how "tolerant" the blast "true reciprocal" criteria  
> are.
> There are some genes where there is a very strict 1-to-1  
> relationship across
> many genomes.  But for genes which involve relatively standard  
> domains, e.g.
> "helicase" domains, the 1-to-1 relationship becomes cloudy -- in  
> mammals for
> example its more like 5-to-5 and it would be really nice to be able to
> specify the strictness or quality level [1] for "matching" genes  
> (and even
> which genes are to be excluded because they are known to be false
> homologues).
>
> Then to top this off I want to be able to combine known public e.g.
> (HomoloGene / Uniigene / Ensembl) databases with perhaps local private
> databases or database subsets (e.g. emerging or specialized genomes).
>
> The goal here of course to determine the precise phylogenetic  
> relationships
> between all of the DNA repair genes and how there may be gain / loss /
> evolution of function that can be related to species characteristics  
> (size,
> longevity, etc.).
>
> Is there a generalized reciprocal blast component in BioPerl?  Or is  
> it a
> "build-it-yourself" situation (that I have to believe has been built
> probably a few dozen times by various researchers / organizations /
> companies)?
>
> Thanks,
> Robert Bradbury
>
> 1. This would be handled in BioPerl with a customizable user  
> function which
> could be tailored to handle specific cases -- for example a function  
> which
> when handed a set of 100 potential "matches" could go through those  
> 100
> matches, identify common domains, and then "re-rate" matches based on
> considerations such as the type and number of common domains,  
> domains being
> in the same order, etc.  I.e. criteria which may be difficult to  
> completely
> generalize across entire genomes but are fairly obvious if you are  
> looking
> at a graphical replication of a gene set in HomoloGene.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

--
Jason Stajich
jason.stajich at gmail.com
jason at bioperl.org