[Bioperl-l] validate_seq() note

Brian Osborne brian_osborne at cognia.com
Thu Jun 12 10:22:41 EDT 2003


Bioperl-l,

In PrimarySeq::validate_seq:

if((CORE::length($seqstr) > 0) && ($seqstr !~ /^([A-Za-z\-\.\*\?]+)$/)) {
     $self->warn("seq doesn't validate, mismatch is " .
                ($seqstr =~ /([^A-Za-z\-\.\*\?]+)/g));
     return 0;
 }

I don't think it's necessary to escape the "-", "+", "?", or the "." in the
brackets. On the other hand it appears to be harmless. I think that the only
things that need to be escaped in the brackets are "[" and "]", generally
speaking, correct me if I'm wrong. So a valid sequence now is "aaaa\?aaaa",
or "gggg\-gggg". "aaaa\aaa" is not though, of course. When one creates a Seq
object with "aaaa\+aaa" and prints it one sees "aaaa+aaa", and its length()
is correct. Apparently no harm done according to these tests. Presumably I'm
bringing up something that's been discussed before...

Brian O.


-----Original Message-----
From: bioperl-l-bounces at portal.open-bio.org
[mailto:bioperl-l-bounces at portal.open-bio.org]On Behalf Of Brian Osborne
Sent: Thursday, June 12, 2003 8:56 AM
To: Matthew Laird; Jason Stajich
Cc: bioperl-l at portal.open-bio.org
Subject: RE: [Bioperl-l] Sequence Validation

Matthew,

How did you put the number into the sequence exactly? It's clear from
PrimarySeq::validate_seq that numbers aren't allowed, so this happens:

~/scripts>perl -e 'use Bio::Seq; $seqobj = Bio::Seq->new(-seq=>"aaa1aaa");'

-------------------- WARNING ---------------------
MSG: seq doesn't validate, mismatch is 1
---------------------------------------------------

------------- EXCEPTION  -------------
MSG: Attempting to set the sequence to [aaa1aaa] which does not look healthy
STACK Bio::PrimarySeq::seq
/usr/lib/perl5/site_perl/5.8.0/Bio/PrimarySeq.pm:264
STACK Bio::PrimarySeq::new
/usr/lib/perl5/site_perl/5.8.0/Bio/PrimarySeq.pm:214
STACK Bio::Seq::new /usr/lib/perl5/site_perl/5.8.0/Bio/Seq.pm:498
STACK toplevel -e:1

And validate_seq() contains:

if((CORE::length($seqstr) > 0) && ($seqstr !~ /^([A-Za-z\-\.\*\?]+)$/)) {
     $self->warn("seq doesn't validate, mismatch is " .
                ($seqstr =~ /([^A-Za-z\-\.\*\?]+)/g));
     return 0;
 }


Brian O.


-----Original Message-----
From: bioperl-l-bounces at portal.open-bio.org
[mailto:bioperl-l-bounces at portal.open-bio.org]On Behalf Of Matthew Laird
Sent: Wednesday, June 11, 2003 1:55 PM
To: Jason Stajich
Cc: bioperl-l at portal.open-bio.org
Subject: Re: [Bioperl-l] Sequence Validation

Ahh, thank you.  Using 1.2.1 works just fine, it seems we had 1.0.1
installed.

The next issue in validation I've noticed (in my attempts to break things)
is the alphabet function in Bio:Seq.  I tried putting a 'J' and the
number '5' into a sequence and it was stilled reported as a protein
sequence.  Is this not the correct method to ensure a sequence uses only
the allowed characters?  validate_seq() seems to general for the task.  Or
again, would writing a quick little homebrew function be the easiest?

Thanks again.

On Wed, 11 Jun 2003, Jason Stajich wrote:

> Which version of bioperl are you using? 1.2 branch and the main-trunk code
> (soon to be 1.3 branch)  parse that seqeunce just fine for me, although
> could be linefeeds are causing problems I guess.
>
> use Bio::SeqIO;
> my $in = new Bio::SeqIO(-fh => \*DATA);
> my $seq = $in->next_seq;
> print $seq->display_id, "\n";
> print $seq->seq(), "\n";
> __DATA__
> >
> BRKISLIGLATMSMLAFNTSAFALGTASSNSGASGKHWSVVGGAALVQPK
> NGKNAAQNTVKFGGDVAPTLSVTYYINDNVGFELWGITKKLSYTAKTDAS
>
>
> As for validating, SeqIO will throw an error if something is unparseable,
> what we have suggested to people in the past is to use a eval block for
> these.
>
> If you still want a validator I would suggest a small lightweight method
> which given a string will attempt to guess the format and/or validate it
> rather than relying on SeqIO for this just yet.
>
> Eventually we could think of a supporting a validator slot in SeqIO to use
> this type of method I guess although it would be an additional
> performance hit.
>
> -jason
>
> On Wed, 11 Jun 2003, Matthew Laird wrote:
>
> > Hello, I hope this is the correct place to ask this...
> >
> > I've been looking through the BioPerl documentation and the mailing list
> > archives and am wondering if there is anything built to do sequence
> > validation.
> >
> > What I mean is this, there are functions as I see to do things such as
> > read in FASTA files (Bio::SeqIO) but how would one test if the file is
> > valid?  We're attempting to create a web interface where people can
submit
> > sequences for analysis, however people could submit faulty formatted
> > files.  Example:
> > >
> > BRKISLIGLATMSMLAFNTSAFALGTASSNSGASGKHWSVVGGAALVQPK
> > NGKNAAQNTVKFGGDVAPTLSVTYYINDNVGFELWGITKKLSYTAKTDAS
> >
> > Bio:SeqIO doesn't throw any error on this, what it does do is begin at
the
> > line starting with "NGKN" as the beginning of the sequence.  Yes this
> > sequence violates the FASTA format, but in web interfaces you can't be
> > sure people will submit a perfectly formatted file.
> >
> > Can anyone point me in the direction of a module which will validate the
> > file as it's read for both format and that only allowed sequence letters
> > are included?  Or is this something which needs to be written?  Ideally
> > this should work for multiple formats as well.
> >
> > If such a module doesn't exist I suppose I'll begin working on one and
> > submit the results to the collective since this seems like such a useful
> > tool.
> >
> > Thanks.
> >
> >
>
> --
> Jason Stajich
> Duke University
> jason at cgt.mc.duke.edu
>

--
Matthew Laird
SysAdmin/Web Developer, Brinkman Laboratory, MBB Dept.
Simon Fraser University


_______________________________________________
Bioperl-l mailing list
Bioperl-l at portal.open-bio.org
http://portal.open-bio.org/mailman/listinfo/bioperl-l


_______________________________________________
Bioperl-l mailing list
Bioperl-l at portal.open-bio.org
http://portal.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list