[Bioperl-l] Changes in FASTA output format

William R. Pearson wrp at virginia.edu
Fri Mar 30 18:05:15 UTC 2007


The next major revision of the FASTA program package will have some  
major improvements to the strategy for calculating statistical  
significance, particularly when a small library is being searched  
(high scoring sequences will be shuffled and used to estimate a  
second set of statistical parameters).

As a result, I am considering some changes in FASTA output.

(1) I would like to expand the line that shows the algorithm and  
scoring matrix parameters to multiple lines.  Currently it looks like:

  Smith-Waterman (SSE2, Michael Farrar 2006) (6.0 Mar 2007) function  
[BL50 matrix (15:-5)], open/ext: -12/-2
Scan time:  2.140

I would like to allow at least two lines here, one for the algorithm  
and version, a second for the scoring parameters:

  Smith-Waterman (SSE2, Michael Farrar 2006) (6.0 Mar 2007) function
  BL50 matrix (15:-5), open/ext: -12/-2
Scan time:  2.140

I could even imagine tagging the lines:

  Algorithm:  Smith-Waterman (SSE2, Michael Farrar 2006) (6.0 Mar 2007)
  Parameters:  BL50 matrix (15:-5), open/ext: -12/-2
Scan time:  2.140

I don't think this would break many FASTA parsers, but I wanted to  
check.

(2)  I am also thinking about displaying multiple E()-values,  
depending on whether they are calculated from the similarity search  
or the shuffled high scores, e.g., going from:

The best scores are:                                       s-w bits E 
(231210)
gi|121716|sp|P10649|GSTM1_MOUSE Glutathione S-tran ( 218) 1497 349.6  
6.1e-96
gi|121717|sp|P04905|GSTM1_RAT Glutathione S-transf ( 218) 1413 330.4  
3.8e-90
gi|399829|sp|Q00285|GSTMU_CRILO Glutathione S-tran ( 218) 1354 316.9  
4.5e-86

To:

The best scores are:                                      s-w bits E 
(231210) ES()
gi|121716|sp|P10649|GSTM1_MOUSE Glutathione S-tran ( 218) 1497 349.6  
6.1e-96 5.5e-95
gi|121717|sp|P04905|GSTM1_RAT Glutathione S-transf ( 218) 1413 330.4  
3.8e-90 2.2e-89
gi|399829|sp|Q00285|GSTMU_CRILO Glutathione S-tran ( 218) 1354 316.9  
4.5e-86 8.3e-85

I think this output would break many more FASTA parsers, and one  
option would be (initially) to add it only to the alignment output.

Naturally, initially it will be easy to revert to the classic format.

I would appreciate any comments on the problems these changes might  
cause.

Bill Pearson







More information about the Bioperl-l mailing list