[Bioperl-l] Sequence Validation

Matthew Laird lairdm at sfu.ca
Wed Jun 11 11:54:56 EDT 2003


Ahh, thank you.  Using 1.2.1 works just fine, it seems we had 1.0.1 
installed.

The next issue in validation I've noticed (in my attempts to break things) 
is the alphabet function in Bio:Seq.  I tried putting a 'J' and the 
number '5' into a sequence and it was stilled reported as a protein 
sequence.  Is this not the correct method to ensure a sequence uses only 
the allowed characters?  validate_seq() seems to general for the task.  Or 
again, would writing a quick little homebrew function be the easiest?

Thanks again.

On Wed, 11 Jun 2003, Jason Stajich wrote:

> Which version of bioperl are you using? 1.2 branch and the main-trunk code
> (soon to be 1.3 branch)  parse that seqeunce just fine for me, although
> could be linefeeds are causing problems I guess.
> 
> use Bio::SeqIO;
> my $in = new Bio::SeqIO(-fh => \*DATA);
> my $seq = $in->next_seq;
> print $seq->display_id, "\n";
> print $seq->seq(), "\n";
> __DATA__
> >
> BRKISLIGLATMSMLAFNTSAFALGTASSNSGASGKHWSVVGGAALVQPK
> NGKNAAQNTVKFGGDVAPTLSVTYYINDNVGFELWGITKKLSYTAKTDAS
> 
> 
> As for validating, SeqIO will throw an error if something is unparseable,
> what we have suggested to people in the past is to use a eval block for
> these.
> 
> If you still want a validator I would suggest a small lightweight method
> which given a string will attempt to guess the format and/or validate it
> rather than relying on SeqIO for this just yet.
> 
> Eventually we could think of a supporting a validator slot in SeqIO to use
> this type of method I guess although it would be an additional
> performance hit.
> 
> -jason
> 
> On Wed, 11 Jun 2003, Matthew Laird wrote:
> 
> > Hello, I hope this is the correct place to ask this...
> >
> > I've been looking through the BioPerl documentation and the mailing list
> > archives and am wondering if there is anything built to do sequence
> > validation.
> >
> > What I mean is this, there are functions as I see to do things such as
> > read in FASTA files (Bio::SeqIO) but how would one test if the file is
> > valid?  We're attempting to create a web interface where people can submit
> > sequences for analysis, however people could submit faulty formatted
> > files.  Example:
> > >
> > BRKISLIGLATMSMLAFNTSAFALGTASSNSGASGKHWSVVGGAALVQPK
> > NGKNAAQNTVKFGGDVAPTLSVTYYINDNVGFELWGITKKLSYTAKTDAS
> >
> > Bio:SeqIO doesn't throw any error on this, what it does do is begin at the
> > line starting with "NGKN" as the beginning of the sequence.  Yes this
> > sequence violates the FASTA format, but in web interfaces you can't be
> > sure people will submit a perfectly formatted file.
> >
> > Can anyone point me in the direction of a module which will validate the
> > file as it's read for both format and that only allowed sequence letters
> > are included?  Or is this something which needs to be written?  Ideally
> > this should work for multiple formats as well.
> >
> > If such a module doesn't exist I suppose I'll begin working on one and
> > submit the results to the collective since this seems like such a useful
> > tool.
> >
> > Thanks.
> >
> >
> 
> --
> Jason Stajich
> Duke University
> jason at cgt.mc.duke.edu
> 

-- 
Matthew Laird
SysAdmin/Web Developer, Brinkman Laboratory, MBB Dept.
Simon Fraser University




More information about the Bioperl-l mailing list