[Bioperl-l] Sequence Validation

Matthew Laird lairdm at sfu.ca
Thu Jun 12 10:07:00 EDT 2003


I was also surprised it didn't throw an exception.  Here is the sequence:
>gi|28199678|ref|NP_779992.1|
ARKISLIGLATMSMLAFNTSAFALGTASSNSGASGKHWSVVGGAALVQPK
5JNGKNAAQNTVKFGGDVAPTLSVTYYINDNVGFELWGITKKLSYTAKTDAS

(notice how there's also a J in there right after the 5)

And here is the code fragment:
my $in = Bio::SeqIO->new(-file => "test.fas" , '-format' => 'Fasta');
my $i = 0; my $seq;
while ( $seq = $in->next_seq() ) {
    print "type: " . $seq->alphabet . "\n";
    print "$i: " . $seq->seq() . "\n";
    $i++;
}

A very simple script just to test how Bio::SeqIO works.  In the printed 
sequence the 5 actually shows up, so it seems by using Bio::SeqIO the 
sequence doesn't get put through the same validation process as you've 
demonstrated.

This is all using BioPerl 1.2.1 just for reference.

On Thu, 12 Jun 2003, Brian Osborne wrote:

> Matthew,
> 
> How did you put the number into the sequence exactly? It's clear from
> PrimarySeq::validate_seq that numbers aren't allowed, so this happens:
> 
> ~/scripts>perl -e 'use Bio::Seq; $seqobj = Bio::Seq->new(-seq=>"aaa1aaa");'
> 
> -------------------- WARNING ---------------------
> MSG: seq doesn't validate, mismatch is 1
> ---------------------------------------------------
> 
> ------------- EXCEPTION  -------------
> MSG: Attempting to set the sequence to [aaa1aaa] which does not look healthy
> STACK Bio::PrimarySeq::seq
> /usr/lib/perl5/site_perl/5.8.0/Bio/PrimarySeq.pm:264
> STACK Bio::PrimarySeq::new
> /usr/lib/perl5/site_perl/5.8.0/Bio/PrimarySeq.pm:214
> STACK Bio::Seq::new /usr/lib/perl5/site_perl/5.8.0/Bio/Seq.pm:498
> STACK toplevel -e:1
> 
> And validate_seq() contains:
> 
> if((CORE::length($seqstr) > 0) && ($seqstr !~ /^([A-Za-z\-\.\*\?]+)$/)) {
>      $self->warn("seq doesn't validate, mismatch is " .
>                 ($seqstr =~ /([^A-Za-z\-\.\*\?]+)/g));
>      return 0;
>  }
> 
> 
> Brian O.
> 
> 
> -----Original Message-----
> From: bioperl-l-bounces at portal.open-bio.org
> [mailto:bioperl-l-bounces at portal.open-bio.org]On Behalf Of Matthew Laird
> Sent: Wednesday, June 11, 2003 1:55 PM
> To: Jason Stajich
> Cc: bioperl-l at portal.open-bio.org
> Subject: Re: [Bioperl-l] Sequence Validation
> 
> Ahh, thank you.  Using 1.2.1 works just fine, it seems we had 1.0.1
> installed.
> 
> The next issue in validation I've noticed (in my attempts to break things)
> is the alphabet function in Bio:Seq.  I tried putting a 'J' and the
> number '5' into a sequence and it was stilled reported as a protein
> sequence.  Is this not the correct method to ensure a sequence uses only
> the allowed characters?  validate_seq() seems to general for the task.  Or
> again, would writing a quick little homebrew function be the easiest?
> 
> Thanks again.
> 
> On Wed, 11 Jun 2003, Jason Stajich wrote:
> 
> > Which version of bioperl are you using? 1.2 branch and the main-trunk code
> > (soon to be 1.3 branch)  parse that seqeunce just fine for me, although
> > could be linefeeds are causing problems I guess.
> >
> > use Bio::SeqIO;
> > my $in = new Bio::SeqIO(-fh => \*DATA);
> > my $seq = $in->next_seq;
> > print $seq->display_id, "\n";
> > print $seq->seq(), "\n";
> > __DATA__
> > >
> > BRKISLIGLATMSMLAFNTSAFALGTASSNSGASGKHWSVVGGAALVQPK
> > NGKNAAQNTVKFGGDVAPTLSVTYYINDNVGFELWGITKKLSYTAKTDAS
> >
> >
> > As for validating, SeqIO will throw an error if something is unparseable,
> > what we have suggested to people in the past is to use a eval block for
> > these.
> >
> > If you still want a validator I would suggest a small lightweight method
> > which given a string will attempt to guess the format and/or validate it
> > rather than relying on SeqIO for this just yet.
> >
> > Eventually we could think of a supporting a validator slot in SeqIO to use
> > this type of method I guess although it would be an additional
> > performance hit.
> >
> > -jason
> >
> > On Wed, 11 Jun 2003, Matthew Laird wrote:
> >
> > > Hello, I hope this is the correct place to ask this...
> > >
> > > I've been looking through the BioPerl documentation and the mailing list
> > > archives and am wondering if there is anything built to do sequence
> > > validation.
> > >
> > > What I mean is this, there are functions as I see to do things such as
> > > read in FASTA files (Bio::SeqIO) but how would one test if the file is
> > > valid?  We're attempting to create a web interface where people can
> submit
> > > sequences for analysis, however people could submit faulty formatted
> > > files.  Example:
> > > >
> > > BRKISLIGLATMSMLAFNTSAFALGTASSNSGASGKHWSVVGGAALVQPK
> > > NGKNAAQNTVKFGGDVAPTLSVTYYINDNVGFELWGITKKLSYTAKTDAS
> > >
> > > Bio:SeqIO doesn't throw any error on this, what it does do is begin at
> the
> > > line starting with "NGKN" as the beginning of the sequence.  Yes this
> > > sequence violates the FASTA format, but in web interfaces you can't be
> > > sure people will submit a perfectly formatted file.
> > >
> > > Can anyone point me in the direction of a module which will validate the
> > > file as it's read for both format and that only allowed sequence letters
> > > are included?  Or is this something which needs to be written?  Ideally
> > > this should work for multiple formats as well.
> > >
> > > If such a module doesn't exist I suppose I'll begin working on one and
> > > submit the results to the collective since this seems like such a useful
> > > tool.
> > >
> > > Thanks.
> > >
> > >
> >
> > --
> > Jason Stajich
> > Duke University
> > jason at cgt.mc.duke.edu
> >
> 
> --
> Matthew Laird
> SysAdmin/Web Developer, Brinkman Laboratory, MBB Dept.
> Simon Fraser University
> 
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
> 
> 
> 

-- 
Matthew Laird
SysAdmin/Web Developer, Brinkman Laboratory, MBB Dept.
Simon Fraser University





More information about the Bioperl-l mailing list