[Bioperl-l] Converting between allowed SearchIO formats?

Tue Sep 22 14:00:44 UTC 2009

On Sep 22, 2009, at 8:09 AM, Dan Bolser wrote:

> Hi all,
>
> I'm reading in a blasttable format blast result, filtering, and hoping
> to write out a similarly formatted result. Based on experience with
> SeqIO, I expected to do something like the following:
>
> use Bio::SearchIO;
>
> ## Open the sequence search report
> my $seqI = Bio::SearchIO->
> new( -file   => $file,
>      -format => $format,
>    );
>
> ## Open the output report
> my $seqO = Bio::SearchIO->
> new( -file   => ">OUTPUT",
>      -format => $format,
>    );
>
> while( my $result = $seqI->next_result ) {
> ## Do some filtering...
>
> $seqO->write_result( $result );
> }
>
>
> However, the above method does not work here. Is this for some deep
> reason, or could the above method (based on the way SeqIO works) be
> made to work? I'm guessing that the SearchIO object conversion is
> simply harder to do than with SeqIO?

This is something Jason could probably speak up on, but from my  
perspective it comes down to 'why?'.  This opens up a very hard-to- 
implement door (converting to and from, for instance, BLAST to HMMER),  
which doesn't make sense from the end-user perspective.  What most  
users want out of those formats is getting at the data in an easily  
accessible way, to further process them (filter, to GFF, etc), or to  
have them summarized.  the Writer classes take care of the latter.

There is a very generic, all-purpose write_result in Bio::SearchIO  
that just calls the a ResultWriter object (and dies if it isn't  
present).  Note that this expects a ResultWriter, not a Hit/HSPWriter;  
it is write_result() after all. I think this kind of goes against the  
well-established API that exists with the other write_foo  
implementations for the IO classes, where the input/output format  
should match, but there you have it.

> So now I'm trying to use the correct method, via
> Bio::SearchIO::Writer::HSPTableWriter. The problem is, I can't find a
> 1 to 1 correspondence between the fields in the blasttable and the
> columns provided by the writer. So far I have something like this:
>
> blasttable ->		HSPTableWriter
>
> (result) query_name	query_name
> (hit) name		hit_name
> (hsp) frac_identical	frac_identical_query?
> 			frac_identical_hit?
> (hsp) hsp_length	length_aln_query?
> 			length_aln_hit?
> (?) mismatches		?
> (hsp) gaps		?
> 			gaps_query?
> 			gaps_hit?
> 			gaps_total?
> (hsp) start('query')	start_query
> (hsp) end('query')	end_query
> (hsp) start('hit')	start_hit
> (hsp) end('hit')	end_hit
> (hsp) significance	expect
> (hsp) bits		bits
>
>
> For (hsp) frac_identical, it seems as if the (undocumented)
> frac_identical_total column is giving the right value, however, I'ts
> hard to be certain because the format is of the value is different
> (the blasttable says 93.51 while HSPTableWriter says 0.94). How can I
> change the output format of HSPTableWriter?

Not sure but it appears hard-coded.  This could probably be rewritten  
to spit out certain data attributes by name (e.g. you could ask for  
percent_identity), but I'm not sure.

> Is there any improvement on the above mapping? It seems strange that I
> can read in a blasttable, but I can't write one out (using a generic
> object interface). For example, where do I get the hsp length from
> (which column)?
>
> I'm sure this has come up before, so apologies for not being able to
> track down the appropriate docs.

 From the POD:

'Here are the columns that can be specified in the -columns
parameter when creating a HSPTableWriter object.  If a -columns  
parameter
is not specified, this list, in this order, will be used as the  
default.'

In other words, you keep track of the columns (which appear 1-based).

> Thanks for any help,
> Dan.
> P.S. when dumping a blasttable from a blasttable using HSP methods,
> how should I calculate the number of mismatches? Currently I'm trying:
>
>     my $len = $hsp->length;
>     my $match = $len * $hsp->frac_identical;
>     my $mismatch = $len - $match;
>
> but the resulting values differ from those in the original blasttable.
> I have the feeling this is a FAQ ...

Maybe use seq_inds instead?

BTW, HSP length() defaults on the 'total' length (includes gaps).  The  
above calculation doesn't account for that.

With seq_inds, 'mismatch' are residue-only (no gaps); 'no_match' is  
mismatched residues + gaps (you have to also indicate whether this is  
based on the query or hit).

Also note that seq_inds deals with (1) mapping differences, e.g. any  
query that requires translation, and (2) frameshifts, such as from  
FASTX/Y output (again translated sequence output).  If you are dealing  
with a translated sequence you will want to account for those bits as  
well.

chris