[Bioperl-l] Problems with "tblastx"-like output and GFF
    Neil Saunders 
    neil.saunders at unsw.edu.au
       
    Tue Nov  2 20:49:32 EST 2004
    
    
  
Dear all,
I am working on ways to visualise conserved gene clusters (or gene 
synteny) in microbial genomes.  Basically, what I want to do is (1) 
compare a set of contigs to 1 or more finished genomes using one or all 
of blast, blat and MUMmer, (2) convert the output to GFF including 
'match' lines and (3) visualise in Generic Genome Browser.  Along the 
way I've come across a few problems to share with you - I'm 
cross-posting to bioperl and gbrowse lists.
For comparing genomes in this way a "tblastx-like" approach is best, as 
proteins are more conserved than DNA.  This requires use of:
(1) tblastx
(2) blat with "t=dnax -q=dnax" (which I'll call "blatx")
(3) promer from the MUMmer package
So far so good.  Problems arise when parsing these outputs to generate 
GFF.  My problems are:
(1) Bio::Tools::Blat.
When I use this module to parse "blatx" output (just following the 
example in the module docs), I get errors like so:
Argument "++" isn't numeric in numeric ne (!=) at 
/usr/local/share/perl/5.8.4/Bio/Location/Atomic.pm line 170, <GEN0> line 
228.
"Blatx" has a strand column containing both query and target strand 
(e.g. "++") and it seems Blat.pm is not happy with this?  This looks 
like an easy fix.
(2) Bio::SearchIO::psl
This module gives me errors such as:
Use of uninitialized value in pattern match (m//) at 
/usr/local/share/perl/5.8.4/Bio/SearchIO/psl.pm line 176, <GEN1> line 
2503.
I'm not entirely sure what to make of this, as the psl file looks OK.
(3) search2gff and blast2gff
I've used the search2gff script from bioperl and the blast2gff script 
from gbrowse.  What I found was:
- they both work fine on blastn output
- search2gff with the "--match" switch works fine on blastn output, but 
gives errors on tblastx output
- blast2gff gives similar errors to search2gff when trying to construct 
a match line from tblastx output
The errors when trying to create "match" lines look like this:
------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Undefined sub-sequence (2400764,2400765). Valid range = 2400764 - 
2400871
STACK: Error::throw
STACK: Bio::Root::Root::throw 
/usr/local/share/perl/5.8.4/Bio/Root/Root.pm:328
STACK: Bio::Search::HSP::HSPI::matches 
/usr/local/share/perl/5.8.4/Bio/Search/HSP/HSPI.pm:711
STACK: Bio::Search::SearchUtils::_adjust_contigs 
/usr/local/share/perl/5.8.4/Bio/Search/SearchUtils.pm:389
STACK: Bio::Search::SearchUtils::tile_hsps 
/usr/local/share/perl/5.8.4/Bio/Search/SearchUtils.pm:182
STACK: Bio::Search::Hit::GenericHit::strand 
/usr/local/share/perl/5.8.4/Bio/Search/Hit/GenericHit.pm:1440
STACK: /usr/bin/bp_search2gff.pl:177
-----------------------------------------------------------
and the values in parentheses after "Undefined sub-sequence" always 
differ by one (as above, 2400764,2400765).
I had some discussion with Jason S. about this issue some months ago and 
we concluded there was some complicated problem in the internals of 
tile_hsps.
What all this boils down to is that "tblastx-like" output is quite hard 
to deal with.  I have taken to writing my own scripts where I:
1) store GFF lines for single HSPs in an array
2) loop through and create a hash of hashes of arrays with keys query_id 
+ target_id and an array of starts/ends
3) sort the array and use the first+last values to generate the "match" 
lines.
This works OK (I've done this for promer too), but is a bit basic (e.g. 
"match" lines are not given scores).  I have an example of how this 
looks for promer at:
http://psychro.bioinformatics.unsw.edu.au/perl/gbrowse_img/sim?name=Contig70:38893..138892;type=CDS+PROMER;width=1024
If anyone is interested, I can provide output files and sample scripts 
and perhaps we can work this issue.  I use:
Debian Linux (unstable)
Latest CVS bioperl and gbrowse
BLAST 2.2.8
MUMmer 3.15
BLAT v. 17
thanks for your ideas,
Neil
-- 
 School of Biotechnology and Biomolecular Sciences,
 The University of New South Wales,
 Sydney 2052,
 Australia
http://psychro.bioinformatics.unsw.edu.au/neil/index.php
    
    
More information about the Bioperl-l
mailing list