[Bioperl-l] How does '-alphabet' help? Is there any function which could remove "wrong" characters?

Sun Oct 13 04:09:40 UTC 2013

A more interesting question is: should there be (at least, should there be within Bioperl)?  The assumption made for creating a Bio::Seq is that the characters passed are valid *prior* to creating the instance, and that the parsers (or user, if a parser isn't used) are generally in charge of dealing with such issues.  Some attempts to set up simple validation of strings are used within Bio::Tools::IUPAC and I think Bio::Tools::SeqPattern, if you want to delve into that code.

For removing carriage returns, just use 'chomp':

    my $text = <INFILE>;
    chomp $text;

As for dealing with non-valid characters, it depends on what you mean by 'non-valid'.  All letters are valid IUPAC for protein seqs, and ACGTUMRWSYKVHDBXN are valid IUPAC nucleotide characters (we won't include other possible symbols for gaps, frameshifts, etc for simplicity).  You may want to leave out ambiguous characters for your case.  You could maybe generate a regex from Bio::Tools::IUPAC for valid chars and use the inverse of that to 'clean' a sequence, but a straightforward way is to simply generate your valid string of chars and replace everything not matching to it, as Jing's example does.

chris

On Oct 12, 2013, at 10:24 PM, Vasily Aushev <vaushev at gmail.com> wrote:

> well, in this particular case, this is the format of input file which I
> can't change: it is not Fasta format but just the sequence in one (first)
> line of the file.
> But I am interested in more general question - is there a function which
> removes all invalid characters from the string.
> 
> 
> On Sun, Oct 13, 2013 at 6:56 AM, Jing Yu <logust79 at googlemail.com> wrote:
> 
>> Hi,
>> 
>> my $text = <INFILE>; only reads a line.
>> 
>> Why not just do:
>> 
>> my $seq1 = Bio::Seq->new(-file => 'yourfile', -format => 'Fasta');
>> 
>> 
>> On 13 Oct 2013, at 10:48, Vasily Aushev <vaushev at gmail.com> wrote:
>> 
>>> in my very simple script, I am reading the sequence from the file by
>>> my $text = <INFILE>;
>>> and then making a new sequence object:
>>> my $seq1 = Bio::Seq->new(-seq => $text, -alphabet => 'dna' );
>>> After spending some time, I found that the 'carriage return' character
>>> (0x0D) which occurs at the end of my string (it's a Windows file) causes
>>> problems (exceptions) on further processing. I thought that defining the
>>> -alphabet for the sequence object should remove this "wrong" character,
>> but
>>> it's not the case. So, my question - is there any function for removing
>> all
>>> characters which are not part of defined alphabet?
>>> 
>>> Thanks in advance!
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>> 
>>