Bioperl: Re: Bio::Tools::Blast

Lincoln Stein lstein@cshl.org
Thu, 27 Aug 1998 10:43:36 -0400


Why don't we come up with a better sequence file standard that
encapsulates these ideas of identifier, description, source database,
etc., rather than relying on ad hoc conventions in the FASTA format?
To deal with legacy FASTA files we could have a little reusable
conversion filter to do the up-front work.

Lincoln

Georg Fuellen writes:
 > 
 > Eitan wrote,
 > > I would like to warn all Bio::PreSeq::parse_fasta() users. Some fasta 
 > 
 > Unless I'm mistaken, there is no reason for any warning, see below.
 > 
 > > databases (such as RepBase, if I'm not mistaken) are using non \S letters 
 > > in their naming scheme. Most fasta parsers fail when they see
 > > >gb|AC000254 blah blah blah
 > > >AC000254_1 blash blah blah
 > 
 > I get:
 >   DB<54> $head = ">gb|AC000254 blah blah blah"          
 > 
 >   DB<55> ($id, $desc) = $head =~ /^>[ \t]*(\S*)[ \t]*(.*)$/;
 > 
 >   DB<56> x ($id, $desc)                                     
 > 0  'gb|AC000254'
 > 1  'blah blah blah'
 > 
 > I think this works as it should ?!
 > Non \S letters (i.e. whitespace letters, since \S matches
 > a non-whitespace character) are [ \t\n\r\f]. (that's not the same as \W)
 > When I worked on the fasta pattern-matchings, the expression
 > used above seemed to be the most general I could come up with--
 > and it seems to me that the unusual case of ``identifyers with
 > space'' can only be dealt with by using a special flag that
 > says ``use the whole header as an id'' which means that we
 > assume the description is part of the id.
 > However, I'd suggest the user should take care of this case.
 > 
 > > In my case I overcome this with sed 's/^>gb|//' etc. or with perl 
 > > scripts. It may pose a serious problem though if you want the 
 > > Bio::PreSeq package to be universal.
 > 
 > Please respond if I'm overlooking the problem that you seem to see.
 > 
 > best wishes,
 > Georg Fuellen,
 > Univ. Bielefeld, Research Group in Practical Comp. Science
 > http://www.techfak.uni-bielefeld.de/bcd/welcome.html
 > 
 > > 
 > >        Eitan.  
 > > 
 > > 
 > > ======================================================================
 > > Eitan Rubin,
 > > Plant Genetics, Weizmann Inst of Science, Rehovot, Israel.  
 > > EMail: bcrubin@dapsas1.weizmann.ac.il
 > > Tel: (00972)-(8)9342421 Fax: (00972)-(8)9344181
 > > EitanR@BioMOO (http://bioinfo.weizmann.ac.il/BioMOO) - visit 
 > >                             the 
 > > GCG help desk
 > > 
 > > in vivo -> in vitro -> in silico
 > > ======================================================================
 > > 
 > > On Wed, 26 Aug 1998, Steve A. Chervitz wrote:
 > > 
 > > > 
 > > > Lincoln, 
 > > > 
 > > > Spaces are not permitted in identifiers in Blast.pm. In the Fasta
 > > > files I've seen, a space is used to separate the identifier from the 
 > > > description line. Here's how Bio::PreSeq::parse_fasta() grabs the 
 > > > identifier and description:
 > > > 
 > > > ($self->{"id"}, $self->{"desc"}) = $head =~ /^>[ \t]*(\S*)[ \t]*(.*)$/;
 > > > 
 > > > BTW, I just updated the Blast distribution (now 0.061). It includes   
 > > > an important memory management fix that helps when crunching lots of 
 > > > reports. 
 > > > 
 > > > Steve Chervitz
 > > > sac@genome.stanford.edu
 > > > 
 > > > 
 > > > On 26 Aug 1998, Lincoln Stein wrote:
 > > > 
 > > > > Hi Steve,
 > > > > 
 > > > > Does Blast.pm not deal correctly with sequence identifiers that
 > > > > contain spaces?  I just tried to blast a database made from
 > > > > identifiers like this:
 > > > > 
 > > > > >notch4 exon #1
 > > > > atgcagccccagttgctgctgctgctgctcttgccactcaatttccctgtcatcctgacc
 > > > > agag
 > > > > 
 > > > > >notch4 exon #2
 > > > > agcttctgtgtggaggatccccagagccctgtgccaacggaggcacctgcctgaggctat
 > > > > ctcggggacaagggatctgcca
 > > > > 
 > > > > >notch4 exon #3
 > > > > gtgtgcccctggatttctgggtgagacttgccagtttcctgacccctgcagggataccca
 > > > > actctgcaagaatggtggcagctgccaagccctgctccccacacccccaagctcccgtag
 > > > > tcctacttctccactgacccctcacttctcctgcacctgcccctctggcttcaccggtga
 > > > > tcgatgccaaacccatctggaagagctctgtccaccttctttctgttccaacgggggtca
 > > > > ctgctatgttcaggcctcaggccgcccacagtgctcctgcgagcctgggtggacag
 > > > > 
 > > > > but I only got "notch4" as the hit.  When I changed the spaces to
 > > > > dots, I got the full identifier.
 > 
 > 
-- 
========================================================================
Lincoln D. Stein                           Cold Spring Harbor Laboratory
lstein@cshl.org			                  Cold Spring Harbor, NY
========================================================================
=========== Bioperl Project Mailing List Message Footer =======
Project URL: http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/
For info about how to (un)subscribe, where messages are archived, etc:
http://www.techfak.uni-bielefeld.de/bcd/Perl/Bio/vsns-bcd-perl.html
====================================================================