[Bioperl-l] RE: SeqIO fails on masked sequences

Sun Jan 9 20:05:13 EST 2005

Nathan Haigh wrote:

>>-----Original Message-----
>>From: Wes Barris [mailto:wes.barris at csiro.au]
>>Sent: 09 January 2005 23:43
>>To: Hilmar Lapp
>>Cc: nathanhaigh at ukonline.co.uk; 'Bioperl list'; 'Brian Osborne'
>>Subject: Re: [Bioperl-l] RE: SeqIO fails on masked sequences
>>
>>Hilmar Lapp wrote:
>>
>>>You should not require by default that all sequences in one file be of
>>>the same type (alphabet). We never have required this, nor documented
>>>that it is a (not enforced) requirement, and so there may be people out
>>>there relying on this 'feature'.
>>
>>Mixing both DNA and protein sequences in one file and then attempting
>>to process it seems like kind of a bizarre thing to want to do.  If
>>the alphabet is explicitly specified, isn't there a way to make that
>>take precedence?
> 
> 
> Why are you then able to set the alphabet of a SeqIO object if whenever you call next_seq() it trys to guess the alphabet of the
> sequence anyway? It seems more logical to me, that the user can specify the alphabet without worrying about bioperl guessing it, and
> getting it wrong, or not setting it at all.

I am guessing that you meant to direct this question to Hilmar because
I agree with you.  If one specifies the alphabet, bioperl should not
subsequently try to guess it.

> 
> 
>>>    -hilmar
>>>
>>>On Friday, January 7, 2005, at 03:39  AM, Nathan Haigh wrote:
>>>
>>>
>>>>There appears to be an anomaly with Bio::Seq::fasta. If the SeqIO
>>>>object's alphabet is set, next_seq() results in this being undef
>>>>and then proceeds to guess the alphabet again, therefore this like the
>>>>following do not work:
>>>>
>>>>my $seq_in  = Bio::SeqIO->new(-format=>$format, -fh => \*DATA);
>>>>
>>>>$seq_in->alphabet('protein');
>>>>
>>>>Should setting the SeqIO object's alphabet be honoured even if it is
>>>>set to the wrong type or the sequences are not of that
>>>>alphabet?
>>>>
>>>>
>>>>
>>>>I have a bug fix, that allows you to set the alphabet through the
>>>>SeqIO object, but it doesn't do any sort of checking to see if all
>>>>the seqs in the object are of the correct type. Essentially, the
>>>>alphabet is set in one of the following ways:
>>>>
>>>>1) if the SeqIO object is set using e.g. $seq_in->alphabet('dna'); all
>>>>the seqs that belong to the $seq_in object obtain their
>>>>alphabet from the SeqIO object, dna in this case, irrespective of
>>>>whether or not it is actually protein.
>>>>
>>>>2) If alphabet has not been set in this way, the first sequence is
>>>>used to guess the alphabet of the SeqIO object, from which all
>>>>the sequences obtain their alphabet.
>>>>
>>>>
>>>>
>>>>Possible limitations:
>>>>
>>>>1)     all seqs in the SeqIO object can only be of the same type - no
>>>>testing done to see if this is not the case.
>>>>
>>>>
>>>>
>>>>Does this sound ok and reasonable?
>>>>
>>>>Nathan
>>>>
>>>>
>>>>
>>>>-----Original Message-----
>>>>From: Brian Osborne [mailto:brian_osborne at cognia.com]
>>>>Sent: 06 January 2005 12:25
>>>>To: nathanhaigh at ukonline.co.uk
>>>>Subject: RE: SeqIO fails on masked sequences
>>>>
>>>>
>>>>
>>>>Nathan,
>>>>
>>>>
>>>>
>>>>The idea is that a sequence with a high proportion of X is more likely
>>>>to be DNA than protein. The examples I had in mind are
>>>>unfinished genomic sequence, and there are countless entries in
>>>>Genbank/EMBL like this. So, someone wrote in and said that their
>>>>genomic sequence was being characterized as protein since the fraction
>>>>[gatc] was less than 85%, it was mostly X. By contrast, there
>>>>are no protein sequences with X in them in these public databases, if
>>>>I'm not mistaken. So I maintain that in the world of public
>>>>databases this is the way to go.
>>>>
>>>>
>>>>
>>>>Now if you venture into the world of sequence analysis it's going to
>>>>be a different story, since you'll likely mask protein with X,
>>>>not N, obviously. May I ask, if this person knows his/her sequence is
>>>>protein then why doesn't s/he set its alphabet to "protein"?
>>>>Or why don't they mask with A or Z or O or something?
>>>>
>>>>
>>>>
>>>>They'll be problems either way. What is one's reference? Public
>>>>sequence or the less well-defined set of possible sequences?
>>>>
>>>>
>>>>
>>>>Brian O.
>>>>
>>>>-----Original Message-----
>>>>From: Nathan Haigh [mailto:nathanhaigh at ukonline.co.uk]
>>>>Sent: Wednesday, January 05, 2005 7:38 PM
>>>>To: 'Brian Osborne'
>>>>Subject: FW: SeqIO fails on masked sequences
>>>>
>>>>You committed a change to Bio::PrimarySeq where 'X' was added to the
>>>>class of characters that are stripped out of sequences in the
>>>>_guess_alphabet subroutine. Do you know why sequences containing X
>>>>were causing a problem, and why X was added to the class of
>>>>chars?
>>>>
>>>>
>>>>
>>>>It's causing a problem for someone who has a sequence that containes
>>>>all masked chars (i.e. all X's), which should still be
>>>>"guessable" as protein.
>>>>
>>>>
>>>>
>>>>Cheers
>>>>
>>>>Nathan
>>>>
>>>>---
>>>>avast! Antivirus: Outbound message clean.
>>>>Virus Database (VPS): 0501-0, 04/01/2005
>>>>Tested on: 06/01/2005 00:36:20
>>>>avast! is copyright (c) 2000-2003 ALWIL Software.
>>>>http://www.avast.com
>>>>
>>>>
>>>>
>>>>---
>>>>avast! Antivirus: Inbound message clean.
>>>>Virus Database (VPS): 0501-0, 04/01/2005
>>>>Tested on: 07/01/2005 00:35:30
>>>>avast! is copyright (c) 2000-2003 ALWIL Software.
>>>>http://www.avast.com
>>>>
>>>>
>>>>
>>>>
>>>>---
>>>>avast! Antivirus: Outbound message clean.
>>>>Virus Database (VPS): 0501-0, 04/01/2005
>>>>Tested on: 07/01/2005 11:39:14
>>>>avast! is copyright (c) 2000-2003 ALWIL Software.
>>>>http://www.avast.com
>>>>
>>>>
>>>>
>>>>
>>>>_______________________________________________
>>>>Bioperl-l mailing list
>>>>Bioperl-l at portal.open-bio.org
>>>>http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>>>
>>>>
>>
>>
>>--
>>Wes Barris
>>E-Mail: Wes.Barris at csiro.au
>>---
>>avast! Antivirus: Inbound message clean.
>>Virus Database (VPS): 0501-1, 07/01/2005
>>Tested on: 10/01/2005 00:20:13
>>avast! is copyright (c) 2000-2003 ALWIL Software.
>>http://www.avast.com
>>
>>
> 
> 
> ---
> avast! Antivirus: Outbound message clean.
> Virus Database (VPS): 0501-1, 07/01/2005
> Tested on: 10/01/2005 00:30:15
> avast! is copyright (c) 2000-2003 ALWIL Software.
> http://www.avast.com
> 
> 
> 

-- 
Wes Barris
E-Mail: Wes.Barris at csiro.au