[Bioperl-l] Parsing FASTA m10 output

Sun Apr 22 20:24:23 UTC 2007

I do think that m9 is pretty compact if you don't need to see the  
alignment and just want the pairwise statistics and is analogous to  
BLAST m8/9 format.   I typically just use that + fastam9_to_table for  
input to MCL and other systems that can process tabular formats.

I cleaned up a few things in SearchIO::fasta but have not been able  
to see whether we can auto-detect m10 format and insert the necessary  
code just yet.

-jason
On Apr 22, 2007, at 10:11 AM, Ioannis Kirmitzoglou wrote:

> I agree with Jason. Both scripts (fastam9_to_table and  
> fastam10_to_table)
> are way faster and easier to use than the searchIO. Still, there  
> are a lot
> of cases where searchIO support for m10 would be useful (e.g when  
> trying to
> represent the alignment in a graphical way).
> Nevertheless I do think that FASTA needs an output similar to the  
> BLAST m8
> one which is really compact. Although I haven't tried it yet I do  
> believe
> that both scripts can be piped, so one easy and rather fast way to  
> produce
> an tabular output from FASTA would be to pipe its output directly  
> to one of
> the scripts.
> -- 
>
> *Ioannis Kirmitzoglou*, MSc
> PhD. Student,
> Bioinformatics Research Laboratory
> Department of Biological Sciences
> University of Cyprus
>
>
>
> On 21/04/07, Chris Fields <cjfields at uiuc.edu> wrote:
>>
>> Ioannis's fastm10_to_table script is available in the bugzilla
>> enhancement request (as an attachment) if anyone's interested:
>>
>> http://bugzilla.open-bio.org/show_bug.cgi?id=2278
>>
>> I haven't had a chance to really look into m10 output yet but it
>> looks easy enough to parse; may not be hard to get something  
>> SearchIO-
>> based up and running.
>>
>> chris
>>
>> On Apr 21, 2007, at 12:44 PM, Jason Stajich wrote:
>>
>> > We don't have one yet. This is a new format introduced in the most
>> > recent release of FASTA.  Hopefully someone can make some time  
>> to add
>> > some code to SearchIO::fasta for it.
>> >
>> > I do find that I when I need a fast FASTA to TAB converter that the
>> > simple script (fastam9_to_table) is more efficient that SearchIO
>> > framework so Ioannis is making a parallel one for the new m10
>> > output.  So I think having both is useful.
>> >
>> > -jason
>> > On Apr 21, 2007, at 10:14 AM, Hilmar Lapp wrote:
>> >
>> >> I haven't kept track of this - did this go anywhere? Do we not  
>> have
>> >> an -m10 fasta output parser in SearchIO? (I.e., my first thought
>> >> would be that that would be the desired solution; am I misled in
>> >> this?)
>> >>
>> >>      -hilmar
>> >>
>> >> On Apr 19, 2007, at 10:06 AM, Ioannis Kirmitzoglou wrote:
>> >>
>> >>> I have reported it as a bug on the bugzilla but due to bugzilla
>> >>> problems I
>> >>> was not able to attach my code and/or sample m10 files.
>> >>> Nevertheless here is the code that converts an m10 fasta  
>> output to
>> >>> an m8
>> >>> BLAST output which is parseable by the vast majority of software.
>> >>>
>> >>> <----------- CODE BEGINS HERE ------------------->
>> >>>
>> >>> #!/usr/bin/perl -w
>> >>>
>> >>> =head1 NAME
>> >>>
>> >>> fastam10_to_table  - turn FASTA -m 10 output into NCBI -m 8  
>> tabular
>> >>> output
>> >>>
>> >>> =head1 SYNOPSIS
>> >>>
>> >>>  fastam10_to_table [--header] [-o outfile] inputfile1  
>> inputfile2 ...
>> >>>
>> >>> =head1 DESCRIPTION
>> >>>
>> >>> Command line options:
>> >>>   --header                -- boolean flag to print column header
>> >>>   -o/--out                -- optional outputfile to write data,
>> >>>                              otherwise will write to STDOUT
>> >>>   -h/--help               -- show this documentation
>> >>>
>> >>> Not technically a SearchIO script as this doesn't use any Bioperl
>> >>> components but is a useful and fast.  The output is tabular  
>> output
>> >>> with the standard NCBI -m8 columns.
>> >>>
>> >>>  queryname
>> >>>  hit name
>> >>>  percent identity
>> >>>  alignment length
>> >>>  number mismatches
>> >>>  number gaps
>> >>>  query start  (if on rev-strand start > end)
>> >>>  query end
>> >>>  hit start (if on rev-strand start > end)
>> >>>  hit end
>> >>>  evalue
>> >>>  bit score
>> >>>
>> >>> Additionally 4 more columns are provided
>> >>>  percent similar
>> >>>  query length
>> >>>  hit length
>> >>>  query gaps
>> >>>  hit gaps
>> >>>
>> >>> =head1 AUTHOR - Ioannis Kirmitzoglou
>> >>>
>> >>> Ioannis Kirmitzoglou IoannisKirmitzoglou_at_gmail-dot-org
>> >>>
>> >>> =head1 ACKNOWLEDGMENTS - Ioannis Kirmitzoglou
>> >>>
>> >>> Headers as well as portions of code were taken
>> >>>> from fastam9_to_table.pl by Jason Stajich
>> >>>
>> >>> =head1 DISCLAIMER
>> >>>
>> >>> Copyright (c) <2007> <Ioannis Kirmitzolgou>
>> >>>
>> >>> Permission to use, copy, modify, merge, publish and distribute
>> >>> this software and its documentation, with or without  
>> modification,
>> >>> for any purpose, and without fee or royalty to the copyright  
>> holder
>> >>> (s)
>> >>> is hereby granted with no restictions and/or prerequisites.
>> >>>
>> >>> THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
>> >>> EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE  
>> WARRANTIES OF
>> >>> MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND
>> >>> NONINFRINGEMENT.
>> >>> IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE  
>> FOR ANY
>> >>> CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF  
>> CONTRACT,
>> >>> TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
>> >>> SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
>> >>>
>> >>> =cut
>> >>>
>> >>> use strict;
>> >>> use Getopt::Long;
>> >>>
>> >>> my %data=();
>> >>>
>> >>> my $outfile=''; my $header='';
>> >>> GetOptions(
>> >>>     'header'              => \$header,
>> >>>     'o|out|outfile:s'     => \$outfile,
>> >>>     'h|help'              => sub { exec('perldoc',$0); exit; }
>> >>>        );
>> >>>
>> >>> my $outfh;
>> >>> if( $outfile ) {
>> >>>     open($outfh, ">$outfile") || die("$outfile: $!");
>> >>> } else {
>> >>>     $outfh = \*STDOUT;
>> >>> }
>> >>>
>> >>>
>> >>> $/="\n>>>";
>> >>>
>> >>> my @fields = qw(qname hname percid alen mmcount gapcount
>> >>>         qstart qend hstart hend evalue bits percsim qlen hlen  
>> qgap
>> >>> hgap);
>> >>>
>> >>> print $outfh "#",uc(join("", map{ sprintf("%-10s",$_) }  
>> @fields)),
>> >>> "\n" if
>> >>> $header;
>> >>>
>> >>> while (<>) {
>> >>>
>> >>>         chomp;
>> >>>         if ($_=~/^>/ || $_=~/^\#/) {next;}
>> >>>         my @hits = split(/\d+>>/, $_);
>> >>>         @hits= split("\n>>", $hits[0]);
>> >>>
>> >>>         my $hit = shift @hits;
>> >>>
>> >>>         ($data{'qname'}, $data{'qlen'} ) = ($hit=~ (/(\S+)\,\s(\d
>> >>> +)/));
>> >>>
>> >>>         foreach my $align (@hits) {
>> >>>
>> >>>             my @details= split ("\n>", $align);
>> >>>            my $detail = shift @details;
>> >>>             ($data{'hname'}) = ($detail =~ (/^(\S+)\s/));
>> >>>             $detail=~ /\;\s(?:fa|sw)\_bits\:\s+(\S+)/;
>> >>>             $data{'bits'}=$1;
>> >>>             $detail=~ /\;\s(?:fa|sw)\_expect\:\s+(\S+)/;
>> >>>             $data{'evalue'}=$1;
>> >>>
>> >>>             my $term = quotemeta("; sw_score");
>> >>>             $term =~ s/\\ /\\s/;
>> >>>             $detail=~ /$term\s+(\S+)/;
>> >>>             $data{'score'}=$1;
>> >>>
>> >>>             $term = quotemeta("; sw_ident:");
>> >>>             $term =~ s/\\ /\\s/;
>> >>>             $detail=~ /$term\s+(\S+)/;
>> >>>             $data{'percid'}=$1;
>> >>>
>> >>>             $term = quotemeta("; sw_sim:");
>> >>>             $term =~ s/\\ /\\s/;
>> >>>             $detail=~ /$term\s+(\S+)/;
>> >>>             $data{'percsim'}=$1;
>> >>>
>> >>>             $term = quotemeta("; sw_overlap:");
>> >>>             $term =~ s/\\ /\\s/;
>> >>>             $detail=~ /$term\s+(\S+)/;
>> >>>             $data{'alen'}=$1;
>> >>>
>> >>>             $detail = shift @details;
>> >>>
>> >>>             $term = quotemeta("; al_start:");
>> >>>             $term =~ s/\\ /\\s/;
>> >>>             $detail=~ /$term\s+(\S+)/;
>> >>>             $data{'qstart'}=$1;
>> >>>
>> >>>             $term = quotemeta("; al_stop:");
>> >>>             $term =~ s/\\ /\\s/;
>> >>>             $detail=~ /$term\s+(\S+)/;
>> >>>             $data{'qend'}=$1;
>> >>>
>> >>>             $term = quotemeta("; al_display_start:");
>> >>>             $term =~ s/\\ /\\s/;
>> >>>             my $lakis ='';
>> >>>             $detail=~ /$term.+\s\-*([\-\w\s]+)/g;
>> >>>
>> >>>             $data{'qgap'}=($1 =~ tr/\-//);
>> >>>
>> >>>             $detail = shift @details;
>> >>>
>> >>>             $term = quotemeta("; sq_len:");
>> >>>             $term =~ s/\\ /\\s/;
>> >>>             $detail=~ /$term\s+(\S+)/;
>> >>>             $data{'hlen'}=$1;
>> >>>
>> >>>             $term = quotemeta("; al_start:");
>> >>>             $term =~ s/\\ /\\s/;
>> >>>             $detail=~ /$term\s+(\S+)/;
>> >>>             $data{'hstart'}=$1;
>> >>>
>> >>>             $term = quotemeta("; al_stop:");
>> >>>             $term =~ s/\\ /\\s/;
>> >>>             $detail=~ /$term\s+(\S+)/;
>> >>>             $data{'hend'}=$1;
>> >>>
>> >>>             $term = quotemeta("; al_display_start:");
>> >>>             $term =~ s/\\ /\\s/;
>> >>>             $detail=~ /$term.+\s\-*([\-\w\s]+)/g;
>> >>>             $data{'hgap'}=($1 =~ tr/-//);
>> >>>             $data{'gapcount'} = $data{'qgap'} + $data{'hgap'};
>> >>>             $data{'mmcount'} = $data{'alen'} - ( int($data
>> >>> {'percid'} *
>> >>> $data{'alen'}) + $data{'gapcount'});
>> >>>
>> >>> for ( $data{'percid'}, $data{'percsim'} ) {
>> >>>     $_ = sprintf("%.2f",$_*100);
>> >>> }
>> >>>
>> >>>             print $outfh join( "\t",map { $data{$_} }  
>> @fields),"\n"
>> >>>         }
>> >>>
>> >>> }
>> >>>
>> >>> <----------------- CODE ENDS HERE ---------------------->
>> >>>
>> >>> --
>> >>>
>> >>> *Ioannis Kirmitzoglou*, MSc
>> >>> PhD. Student,
>> >>> Bioinformatics Research Laboratory
>> >>> Department of Biological Sciences
>> >>> University of Cyprus
>> >>> _______________________________________________
>> >>> Bioperl-l mailing list
>> >>> Bioperl-l at lists.open-bio.org
>> >>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> >>
>> >> --
>> >> ===========================================================
>> >> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
>> >> ===========================================================
>> >>
>> >>
>> >>
>> >>
>> >>
>> >> _______________________________________________
>> >> Bioperl-l mailing list
>> >> Bioperl-l at lists.open-bio.org
>> >> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> >
>> > --
>> > Jason Stajich
>> > jason at bioperl.org
>> > http://jason.open-bio.org/
>> >
>> >
>> > _______________________________________________
>> > Bioperl-l mailing list
>> > Bioperl-l at lists.open-bio.org
>> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>> Christopher Fields
>> Postdoctoral Researcher
>> Lab of Dr. Robert Switzer
>> Dept of Biochemistry
>> University of Illinois Urbana-Champaign
>>
>>
>>
>>

--
Jason Stajich
jason at bioperl.org
http://jason.open-bio.org/