[Bioperl-l] Sequence Validation

Jason Stajich jason at cgt.duhs.duke.edu
Wed Jun 11 14:27:43 EDT 2003


Which version of bioperl are you using? 1.2 branch and the main-trunk code
(soon to be 1.3 branch)  parse that seqeunce just fine for me, although
could be linefeeds are causing problems I guess.

use Bio::SeqIO;
my $in = new Bio::SeqIO(-fh => \*DATA);
my $seq = $in->next_seq;
print $seq->display_id, "\n";
print $seq->seq(), "\n";
__DATA__
>
BRKISLIGLATMSMLAFNTSAFALGTASSNSGASGKHWSVVGGAALVQPK
NGKNAAQNTVKFGGDVAPTLSVTYYINDNVGFELWGITKKLSYTAKTDAS


As for validating, SeqIO will throw an error if something is unparseable,
what we have suggested to people in the past is to use a eval block for
these.

If you still want a validator I would suggest a small lightweight method
which given a string will attempt to guess the format and/or validate it
rather than relying on SeqIO for this just yet.

Eventually we could think of a supporting a validator slot in SeqIO to use
this type of method I guess although it would be an additional
performance hit.

-jason

On Wed, 11 Jun 2003, Matthew Laird wrote:

> Hello, I hope this is the correct place to ask this...
>
> I've been looking through the BioPerl documentation and the mailing list
> archives and am wondering if there is anything built to do sequence
> validation.
>
> What I mean is this, there are functions as I see to do things such as
> read in FASTA files (Bio::SeqIO) but how would one test if the file is
> valid?  We're attempting to create a web interface where people can submit
> sequences for analysis, however people could submit faulty formatted
> files.  Example:
> >
> BRKISLIGLATMSMLAFNTSAFALGTASSNSGASGKHWSVVGGAALVQPK
> NGKNAAQNTVKFGGDVAPTLSVTYYINDNVGFELWGITKKLSYTAKTDAS
>
> Bio:SeqIO doesn't throw any error on this, what it does do is begin at the
> line starting with "NGKN" as the beginning of the sequence.  Yes this
> sequence violates the FASTA format, but in web interfaces you can't be
> sure people will submit a perfectly formatted file.
>
> Can anyone point me in the direction of a module which will validate the
> file as it's read for both format and that only allowed sequence letters
> are included?  Or is this something which needs to be written?  Ideally
> this should work for multiple formats as well.
>
> If such a module doesn't exist I suppose I'll begin working on one and
> submit the results to the collective since this seems like such a useful
> tool.
>
> Thanks.
>
>

--
Jason Stajich
Duke University
jason at cgt.mc.duke.edu


More information about the Bioperl-l mailing list