[Bioperl-l] RE: SeqIO fails on masked sequences

Wes Barris wes.barris at csiro.au
Sun Jan 9 18:42:33 EST 2005


Hilmar Lapp wrote:
> You should not require by default that all sequences in one file be of 
> the same type (alphabet). We never have required this, nor documented 
> that it is a (not enforced) requirement, and so there may be people out 
> there relying on this 'feature'.

Mixing both DNA and protein sequences in one file and then attempting
to process it seems like kind of a bizarre thing to want to do.  If
the alphabet is explicitly specified, isn't there a way to make that
take precedence?

> 
>     -hilmar
> 
> On Friday, January 7, 2005, at 03:39  AM, Nathan Haigh wrote:
> 
>> There appears to be an anomaly with Bio::Seq::fasta. If the SeqIO 
>> object's alphabet is set, next_seq() results in this being undef
>> and then proceeds to guess the alphabet again, therefore this like the 
>> following do not work:
>>
>> my $seq_in  = Bio::SeqIO->new(-format=>$format, -fh => \*DATA);
>>
>> $seq_in->alphabet('protein');
>>
>> Should setting the SeqIO object's alphabet be honoured even if it is 
>> set to the wrong type or the sequences are not of that
>> alphabet?
>>
>>
>>
>> I have a bug fix, that allows you to set the alphabet through the 
>> SeqIO object, but it doesn't do any sort of checking to see if all
>> the seqs in the object are of the correct type. Essentially, the 
>> alphabet is set in one of the following ways:
>>
>> 1) if the SeqIO object is set using e.g. $seq_in->alphabet('dna'); all 
>> the seqs that belong to the $seq_in object obtain their
>> alphabet from the SeqIO object, dna in this case, irrespective of 
>> whether or not it is actually protein.
>>
>> 2) If alphabet has not been set in this way, the first sequence is 
>> used to guess the alphabet of the SeqIO object, from which all
>> the sequences obtain their alphabet.
>>
>>
>>
>> Possible limitations:
>>
>> 1)     all seqs in the SeqIO object can only be of the same type - no 
>> testing done to see if this is not the case.
>>
>>
>>
>> Does this sound ok and reasonable?
>>
>> Nathan
>>
>>
>>
>> -----Original Message-----
>> From: Brian Osborne [mailto:brian_osborne at cognia.com]
>> Sent: 06 January 2005 12:25
>> To: nathanhaigh at ukonline.co.uk
>> Subject: RE: SeqIO fails on masked sequences
>>
>>
>>
>> Nathan,
>>
>>
>>
>> The idea is that a sequence with a high proportion of X is more likely 
>> to be DNA than protein. The examples I had in mind are
>> unfinished genomic sequence, and there are countless entries in 
>> Genbank/EMBL like this. So, someone wrote in and said that their
>> genomic sequence was being characterized as protein since the fraction 
>> [gatc] was less than 85%, it was mostly X. By contrast, there
>> are no protein sequences with X in them in these public databases, if 
>> I'm not mistaken. So I maintain that in the world of public
>> databases this is the way to go.
>>
>>
>>
>> Now if you venture into the world of sequence analysis it's going to 
>> be a different story, since you'll likely mask protein with X,
>> not N, obviously. May I ask, if this person knows his/her sequence is 
>> protein then why doesn't s/he set its alphabet to "protein"?
>> Or why don't they mask with A or Z or O or something?
>>
>>
>>
>> They'll be problems either way. What is one's reference? Public 
>> sequence or the less well-defined set of possible sequences?
>>
>>
>>
>> Brian O.
>>
>> -----Original Message-----
>> From: Nathan Haigh [mailto:nathanhaigh at ukonline.co.uk]
>> Sent: Wednesday, January 05, 2005 7:38 PM
>> To: 'Brian Osborne'
>> Subject: FW: SeqIO fails on masked sequences
>>
>> You committed a change to Bio::PrimarySeq where 'X' was added to the 
>> class of characters that are stripped out of sequences in the
>> _guess_alphabet subroutine. Do you know why sequences containing X 
>> were causing a problem, and why X was added to the class of
>> chars?
>>
>>
>>
>> It's causing a problem for someone who has a sequence that containes 
>> all masked chars (i.e. all X's), which should still be
>> "guessable" as protein.
>>
>>
>>
>> Cheers
>>
>> Nathan
>>
>> ---
>> avast! Antivirus: Outbound message clean.
>> Virus Database (VPS): 0501-0, 04/01/2005
>> Tested on: 06/01/2005 00:36:20
>> avast! is copyright (c) 2000-2003 ALWIL Software.
>> http://www.avast.com
>>
>>
>>
>> ---
>> avast! Antivirus: Inbound message clean.
>> Virus Database (VPS): 0501-0, 04/01/2005
>> Tested on: 07/01/2005 00:35:30
>> avast! is copyright (c) 2000-2003 ALWIL Software.
>> http://www.avast.com
>>
>>
>>
>>
>> ---
>> avast! Antivirus: Outbound message clean.
>> Virus Database (VPS): 0501-0, 04/01/2005
>> Tested on: 07/01/2005 11:39:14
>> avast! is copyright (c) 2000-2003 ALWIL Software.
>> http://www.avast.com
>>
>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at portal.open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>
>>


-- 
Wes Barris
E-Mail: Wes.Barris at csiro.au


More information about the Bioperl-l mailing list