[Bioperl-l] Converting between allowed SearchIO formats?
Chris Fields
cjfields at illinois.edu
Tue Sep 22 14:00:44 UTC 2009
On Sep 22, 2009, at 8:09 AM, Dan Bolser wrote:
> Hi all,
>
> I'm reading in a blasttable format blast result, filtering, and hoping
> to write out a similarly formatted result. Based on experience with
> SeqIO, I expected to do something like the following:
>
> use Bio::SearchIO;
>
> ## Open the sequence search report
> my $seqI = Bio::SearchIO->
> new( -file => $file,
> -format => $format,
> );
>
> ## Open the output report
> my $seqO = Bio::SearchIO->
> new( -file => ">OUTPUT",
> -format => $format,
> );
>
> while( my $result = $seqI->next_result ) {
> ## Do some filtering...
>
> $seqO->write_result( $result );
> }
>
>
> However, the above method does not work here. Is this for some deep
> reason, or could the above method (based on the way SeqIO works) be
> made to work? I'm guessing that the SearchIO object conversion is
> simply harder to do than with SeqIO?
This is something Jason could probably speak up on, but from my
perspective it comes down to 'why?'. This opens up a very hard-to-
implement door (converting to and from, for instance, BLAST to HMMER),
which doesn't make sense from the end-user perspective. What most
users want out of those formats is getting at the data in an easily
accessible way, to further process them (filter, to GFF, etc), or to
have them summarized. the Writer classes take care of the latter.
There is a very generic, all-purpose write_result in Bio::SearchIO
that just calls the a ResultWriter object (and dies if it isn't
present). Note that this expects a ResultWriter, not a Hit/HSPWriter;
it is write_result() after all. I think this kind of goes against the
well-established API that exists with the other write_foo
implementations for the IO classes, where the input/output format
should match, but there you have it.
> So now I'm trying to use the correct method, via
> Bio::SearchIO::Writer::HSPTableWriter. The problem is, I can't find a
> 1 to 1 correspondence between the fields in the blasttable and the
> columns provided by the writer. So far I have something like this:
>
> blasttable -> HSPTableWriter
>
> (result) query_name query_name
> (hit) name hit_name
> (hsp) frac_identical frac_identical_query?
> frac_identical_hit?
> (hsp) hsp_length length_aln_query?
> length_aln_hit?
> (?) mismatches ?
> (hsp) gaps ?
> gaps_query?
> gaps_hit?
> gaps_total?
> (hsp) start('query') start_query
> (hsp) end('query') end_query
> (hsp) start('hit') start_hit
> (hsp) end('hit') end_hit
> (hsp) significance expect
> (hsp) bits bits
>
>
> For (hsp) frac_identical, it seems as if the (undocumented)
> frac_identical_total column is giving the right value, however, I'ts
> hard to be certain because the format is of the value is different
> (the blasttable says 93.51 while HSPTableWriter says 0.94). How can I
> change the output format of HSPTableWriter?
Not sure but it appears hard-coded. This could probably be rewritten
to spit out certain data attributes by name (e.g. you could ask for
percent_identity), but I'm not sure.
> Is there any improvement on the above mapping? It seems strange that I
> can read in a blasttable, but I can't write one out (using a generic
> object interface). For example, where do I get the hsp length from
> (which column)?
>
> I'm sure this has come up before, so apologies for not being able to
> track down the appropriate docs.
From the POD:
'Here are the columns that can be specified in the -columns
parameter when creating a HSPTableWriter object. If a -columns
parameter
is not specified, this list, in this order, will be used as the
default.'
In other words, you keep track of the columns (which appear 1-based).
> Thanks for any help,
> Dan.
> P.S. when dumping a blasttable from a blasttable using HSP methods,
> how should I calculate the number of mismatches? Currently I'm trying:
>
> my $len = $hsp->length;
> my $match = $len * $hsp->frac_identical;
> my $mismatch = $len - $match;
>
> but the resulting values differ from those in the original blasttable.
> I have the feeling this is a FAQ ...
Maybe use seq_inds instead?
BTW, HSP length() defaults on the 'total' length (includes gaps). The
above calculation doesn't account for that.
With seq_inds, 'mismatch' are residue-only (no gaps); 'no_match' is
mismatched residues + gaps (you have to also indicate whether this is
based on the query or hit).
Also note that seq_inds deals with (1) mapping differences, e.g. any
query that requires translation, and (2) frameshifts, such as from
FASTX/Y output (again translated sequence output). If you are dealing
with a translated sequence you will want to account for those bits as
well.
chris
More information about the Bioperl-l
mailing list