[Bioperl-l] Sequence Validation

Thu Jun 12 13:33:20 EDT 2003

This is because of the optimization code, validate incurrs a
performance hit which (we've assumed) most people don't want.

You have (at least) two options, call validate_seq in your loop
$seq->validate_seq

 OR

Set the sequence factory object to use slower factory object.

This code is all you need
# you can make type be 'Bio::PrimarySeq' if you want smaller
# objects which will never have seq features attached
# or use Bio::Seq if you will be attachinh features, annotations,
# etc and potentially writing it out in a rich seq format (genbank,etc)

use Bio::Seq::SeqFactory;
my $sf = Bio::Seq::SeqFactory->new(-type => 'Bio::Seq');
my $in = new Bio::SeqIO(...
	                -seqfactory => $sf);

By default the fasta parser uses SeqFastaSpeedFactory which does a direct
creation of objects bypassing the validation and even the object hierarchy
to achieve as must speed as possible.

-jason

On Thu, 12 Jun 2003, Matthew Laird wrote:

> I was also surprised it didn't throw an exception.  Here is the sequence:
> >gi|28199678|ref|NP_779992.1|
> ARKISLIGLATMSMLAFNTSAFALGTASSNSGASGKHWSVVGGAALVQPK
> 5JNGKNAAQNTVKFGGDVAPTLSVTYYINDNVGFELWGITKKLSYTAKTDAS
>
> (notice how there's also a J in there right after the 5)
>
> And here is the code fragment:
> my $in = Bio::SeqIO->new(-file => "test.fas" , '-format' => 'Fasta');
> my $i = 0; my $seq;
> while ( $seq = $in->next_seq() ) {
>     print "type: " . $seq->alphabet . "\n";
>     print "$i: " . $seq->seq() . "\n";
>     $i++;
> }
>
> A very simple script just to test how Bio::SeqIO works.  In the printed
> sequence the 5 actually shows up, so it seems by using Bio::SeqIO the
> sequence doesn't get put through the same validation process as you've
> demonstrated.
>
> This is all using BioPerl 1.2.1 just for reference.
>
> On Thu, 12 Jun 2003, Brian Osborne wrote:
>
> > Matthew,
> >
> > How did you put the number into the sequence exactly? It's clear from
> > PrimarySeq::validate_seq that numbers aren't allowed, so this happens:
> >
> > ~/scripts>perl -e 'use Bio::Seq; $seqobj = Bio::Seq->new(-seq=>"aaa1aaa");'
> >
> > -------------------- WARNING ---------------------
> > MSG: seq doesn't validate, mismatch is 1
> > ---------------------------------------------------
> >
> > ------------- EXCEPTION  -------------
> > MSG: Attempting to set the sequence to [aaa1aaa] which does not look healthy
> > STACK Bio::PrimarySeq::seq
> > /usr/lib/perl5/site_perl/5.8.0/Bio/PrimarySeq.pm:264
> > STACK Bio::PrimarySeq::new
> > /usr/lib/perl5/site_perl/5.8.0/Bio/PrimarySeq.pm:214
> > STACK Bio::Seq::new /usr/lib/perl5/site_perl/5.8.0/Bio/Seq.pm:498
> > STACK toplevel -e:1
> >
> > And validate_seq() contains:
> >
> > if((CORE::length($seqstr) > 0) && ($seqstr !~ /^([A-Za-z\-\.\*\?]+)$/)) {
> >      $self->warn("seq doesn't validate, mismatch is " .
> >                 ($seqstr =~ /([^A-Za-z\-\.\*\?]+)/g));
> >      return 0;
> >  }
> >
> >
> > Brian O.
> >
> >
> > -----Original Message-----
> > From: bioperl-l-bounces at portal.open-bio.org
> > [mailto:bioperl-l-bounces at portal.open-bio.org]On Behalf Of Matthew Laird
> > Sent: Wednesday, June 11, 2003 1:55 PM
> > To: Jason Stajich
> > Cc: bioperl-l at portal.open-bio.org
> > Subject: Re: [Bioperl-l] Sequence Validation
> >
> > Ahh, thank you.  Using 1.2.1 works just fine, it seems we had 1.0.1
> > installed.
> >
> > The next issue in validation I've noticed (in my attempts to break things)
> > is the alphabet function in Bio:Seq.  I tried putting a 'J' and the
> > number '5' into a sequence and it was stilled reported as a protein
> > sequence.  Is this not the correct method to ensure a sequence uses only
> > the allowed characters?  validate_seq() seems to general for the task.  Or
> > again, would writing a quick little homebrew function be the easiest?
> >
> > Thanks again.
> >
> > On Wed, 11 Jun 2003, Jason Stajich wrote:
> >
> > > Which version of bioperl are you using? 1.2 branch and the main-trunk code
> > > (soon to be 1.3 branch)  parse that seqeunce just fine for me, although
> > > could be linefeeds are causing problems I guess.
> > >
> > > use Bio::SeqIO;
> > > my $in = new Bio::SeqIO(-fh => \*DATA);
> > > my $seq = $in->next_seq;
> > > print $seq->display_id, "\n";
> > > print $seq->seq(), "\n";
> > > __DATA__
> > > >
> > > BRKISLIGLATMSMLAFNTSAFALGTASSNSGASGKHWSVVGGAALVQPK
> > > NGKNAAQNTVKFGGDVAPTLSVTYYINDNVGFELWGITKKLSYTAKTDAS
> > >
> > >
> > > As for validating, SeqIO will throw an error if something is unparseable,
> > > what we have suggested to people in the past is to use a eval block for
> > > these.
> > >
> > > If you still want a validator I would suggest a small lightweight method
> > > which given a string will attempt to guess the format and/or validate it
> > > rather than relying on SeqIO for this just yet.
> > >
> > > Eventually we could think of a supporting a validator slot in SeqIO to use
> > > this type of method I guess although it would be an additional
> > > performance hit.
> > >
> > > -jason
> > >
> > > On Wed, 11 Jun 2003, Matthew Laird wrote:
> > >
> > > > Hello, I hope this is the correct place to ask this...
> > > >
> > > > I've been looking through the BioPerl documentation and the mailing list
> > > > archives and am wondering if there is anything built to do sequence
> > > > validation.
> > > >
> > > > What I mean is this, there are functions as I see to do things such as
> > > > read in FASTA files (Bio::SeqIO) but how would one test if the file is
> > > > valid?  We're attempting to create a web interface where people can
> > submit
> > > > sequences for analysis, however people could submit faulty formatted
> > > > files.  Example:
> > > > >
> > > > BRKISLIGLATMSMLAFNTSAFALGTASSNSGASGKHWSVVGGAALVQPK
> > > > NGKNAAQNTVKFGGDVAPTLSVTYYINDNVGFELWGITKKLSYTAKTDAS
> > > >
> > > > Bio:SeqIO doesn't throw any error on this, what it does do is begin at
> > the
> > > > line starting with "NGKN" as the beginning of the sequence.  Yes this
> > > > sequence violates the FASTA format, but in web interfaces you can't be
> > > > sure people will submit a perfectly formatted file.
> > > >
> > > > Can anyone point me in the direction of a module which will validate the
> > > > file as it's read for both format and that only allowed sequence letters
> > > > are included?  Or is this something which needs to be written?  Ideally
> > > > this should work for multiple formats as well.
> > > >
> > > > If such a module doesn't exist I suppose I'll begin working on one and
> > > > submit the results to the collective since this seems like such a useful
> > > > tool.
> > > >
> > > > Thanks.
> > > >
> > > >
> > >
> > > --
> > > Jason Stajich
> > > Duke University
> > > jason at cgt.mc.duke.edu
> > >
> >
> > --
> > Matthew Laird
> > SysAdmin/Web Developer, Brinkman Laboratory, MBB Dept.
> > Simon Fraser University
> >
> >
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at portal.open-bio.org
> > http://portal.open-bio.org/mailman/listinfo/bioperl-l
> >
> >
> >
>
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu