Bioperl: Re: Bio::Tools::Blast

Steven E. Brenner brenner@hyper.stanford.edu
Thu, 27 Aug 1998 18:14:07 -0700


I'm not sure why people seem to think that the FASTA format isn't defined.
Bill Pearson does define it in his FASTA package (albeit not as precisely
as one might like). It consists of the following items in sequence:

1) '>' character
2) identifier string without whitespace
3) whitespace other than CR/LF
4) description -- free text.  optional
5) CR and/or LF
6) sequence, possibly including CR/LF


In practice, there are some additional conventions, which aren't documented
1) whitespace often is permitted (but not desired) between the '>' and the
   identifier
2) the sequence is broken into 60 characters per line
3) one can terminate the sequence with a '*' character, but this is
   not desirable
4) once can store mutiple alignments by putting dashes in the
   sequences


NCBI has added additional information to the FASTA format, by using
structured identifiers, which indicate what source database an entry came
from. However, this format is entirely backwards compatible with the
standard FASTA format.  I have seen this structure documented somewhere on
their web site.

I think that there are more than enough file formats out there, and I
would be highly reluctant to introduce a new one, unless it served a very
specific need.  For uses where little information is needed besides the
sequence and identifier, I think that FASTA has the benefits of simplicity
and convenience.  Moreover, it is ubiquitous: virtually every database I
know of is available in compliant FASTA format, in part because it is so
easy to accurately produce. (PIR is a notable exception).

Steve





On Thu, 27 Aug 1998, Lincoln Stein wrote:

> Why don't we come up with a better sequence file standard that
> encapsulates these ideas of identifier, description, source database,
> etc., rather than relying on ad hoc conventions in the FASTA format?
> To deal with legacy FASTA files we could have a little reusable
> conversion filter to do the up-front work.
> 
> Lincoln
> 
> Georg Fuellen writes:
>  > 
>  > Eitan wrote,
>  > > I would like to warn all Bio::PreSeq::parse_fasta() users. Some fasta 
>  > 
>  > Unless I'm mistaken, there is no reason for any warning, see below.
>  > 
>  > > databases (such as RepBase, if I'm not mistaken) are using non \S letters 
>  > > in their naming scheme. Most fasta parsers fail when they see
>  > > >gb|AC000254 blah blah blah
>  > > >AC000254_1 blash blah blah
>  > 
>  > I get:
>  >   DB<54> $head = ">gb|AC000254 blah blah blah"          
>  > 
>  >   DB<55> ($id, $desc) = $head =~ /^>[ \t]*(\S*)[ \t]*(.*)$/;
>  > 
>  >   DB<56> x ($id, $desc)                                     
>  > 0  'gb|AC000254'
>  > 1  'blah blah blah'
>  > 
>  > I think this works as it should ?!
>  > Non \S letters (i.e. whitespace letters, since \S matches
>  > a non-whitespace character) are [ \t\n\r\f]. (that's not the same as \W)
>  > When I worked on the fasta pattern-matchings, the expression
>  > used above seemed to be the most general I could come up with--
>  > and it seems to me that the unusual case of ``identifyers with
>  > space'' can only be dealt with by using a special flag that
>  > says ``use the whole header as an id'' which means that we
>  > assume the description is part of the id.
>  > However, I'd suggest the user should take care of this case.
>  > 
>  > > In my case I overcome this with sed 's/^>gb|//' etc. or with perl 
>  > > scripts. It may pose a serious problem though if you want the 
>  > > Bio::PreSeq package to be universal.
>  > 
>  > Please respond if I'm overlooking the problem that you seem to see.
>  > 
>  > best wishes,
>  > Georg Fuellen,
>  > Univ. Bielefeld, Research Group in Practical Comp. Science
>  > http://www.techfak.uni-bielefeld.de/bcd/welcome.html
>  > 
>  > > 
>  > >        Eitan.  
>  > > 
>  > > 
>  > > ======================================================================
>  > > Eitan Rubin,
>  > > Plant Genetics, Weizmann Inst of Science, Rehovot, Israel.  
>  > > EMail: bcrubin@dapsas1.weizmann.ac.il
>  > > Tel: (00972)-(8)9342421 Fax: (00972)-(8)9344181
>  > > EitanR@BioMOO (http://bioinfo.weizmann.ac.il/BioMOO) - visit 
>  > >                             the 
>  > > GCG help desk
>  > > 
>  > > in vivo -> in vitro -> in silico
>  > > ======================================================================
>  > > 
>  > > On Wed, 26 Aug 1998, Steve A. Chervitz wrote:
>  > > 
>  > > > 
>  > > > Lincoln, 
>  > > > 
>  > > > Spaces are not permitted in identifiers in Blast.pm. In the Fasta
>  > > > files I've seen, a space is used to separate the identifier from the 
>  > > > description line. Here's how Bio::PreSeq::parse_fasta() grabs the 
>  > > > identifier and description:
>  > > > 
>  > > > ($self->{"id"}, $self->{"desc"}) = $head =~ /^>[ \t]*(\S*)[ \t]*(.*)$/;
>  > > > 
>  > > > BTW, I just updated the Blast distribution (now 0.061). It includes   
>  > > > an important memory management fix that helps when crunching lots of 
>  > > > reports. 
>  > > > 
>  > > > Steve Chervitz
>  > > > sac@genome.stanford.edu
>  > > > 
>  > > > 
>  > > > On 26 Aug 1998, Lincoln Stein wrote:
>  > > > 
>  > > > > Hi Steve,
>  > > > > 
>  > > > > Does Blast.pm not deal correctly with sequence identifiers that
>  > > > > contain spaces?  I just tried to blast a database made from
>  > > > > identifiers like this:
>  > > > > 
>  > > > > >notch4 exon #1
>  > > > > atgcagccccagttgctgctgctgctgctcttgccactcaatttccctgtcatcctgacc
>  > > > > agag
>  > > > > 
>  > > > > >notch4 exon #2
>  > > > > agcttctgtgtggaggatccccagagccctgtgccaacggaggcacctgcctgaggctat
>  > > > > ctcggggacaagggatctgcca
>  > > > > 
>  > > > > >notch4 exon #3
>  > > > > gtgtgcccctggatttctgggtgagacttgccagtttcctgacccctgcagggataccca
>  > > > > actctgcaagaatggtggcagctgccaagccctgctccccacacccccaagctcccgtag
>  > > > > tcctacttctccactgacccctcacttctcctgcacctgcccctctggcttcaccggtga
>  > > > > tcgatgccaaacccatctggaagagctctgtccaccttctttctgttccaacgggggtca
>  > > > > ctgctatgttcaggcctcaggccgcccacagtgctcctgcgagcctgggtggacag
>  > > > > 
>  > > > > but I only got "notch4" as the hit.  When I changed the spaces to
>  > > > > dots, I got the full identifier.
>  > 
>  > 
> -- 
> ========================================================================
> Lincoln D. Stein                           Cold Spring Harbor Laboratory
> lstein@cshl.org			                  Cold Spring Harbor, NY
> ========================================================================
> =========== Bioperl Project Mailing List Message Footer =======
> Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
> For info about how to (un)subscribe, where messages are archived, etc:
> http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
> ====================================================================
> 

=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================