[Bioperl-l] RE: SeqIO fails on masked sequences

Hilmar Lapp hlapp at gmx.net
Sat Jan 8 02:37:36 EST 2005


You should not require by default that all sequences in one file be of 
the same type (alphabet). We never have required this, nor documented 
that it is a (not enforced) requirement, and so there may be people out 
there relying on this 'feature'.

	-hilmar

On Friday, January 7, 2005, at 03:39  AM, Nathan Haigh wrote:

> There appears to be an anomaly with Bio::Seq::fasta. If the SeqIO 
> object's alphabet is set, next_seq() results in this being undef
> and then proceeds to guess the alphabet again, therefore this like the 
> following do not work:
>
> my $seq_in  = Bio::SeqIO->new(-format=>$format, -fh => \*DATA);
>
> $seq_in->alphabet('protein');
>
> Should setting the SeqIO object's alphabet be honoured even if it is 
> set to the wrong type or the sequences are not of that
> alphabet?
>
>
>
> I have a bug fix, that allows you to set the alphabet through the 
> SeqIO object, but it doesn't do any sort of checking to see if all
> the seqs in the object are of the correct type. Essentially, the 
> alphabet is set in one of the following ways:
>
> 1) if the SeqIO object is set using e.g. $seq_in->alphabet('dna'); all 
> the seqs that belong to the $seq_in object obtain their
> alphabet from the SeqIO object, dna in this case, irrespective of 
> whether or not it is actually protein.
>
> 2) If alphabet has not been set in this way, the first sequence is 
> used to guess the alphabet of the SeqIO object, from which all
> the sequences obtain their alphabet.
>
>
>
> Possible limitations:
>
> 1)     all seqs in the SeqIO object can only be of the same type - no 
> testing done to see if this is not the case.
>
>
>
> Does this sound ok and reasonable?
>
> Nathan
>
>
>
> -----Original Message-----
> From: Brian Osborne [mailto:brian_osborne at cognia.com]
> Sent: 06 January 2005 12:25
> To: nathanhaigh at ukonline.co.uk
> Subject: RE: SeqIO fails on masked sequences
>
>
>
> Nathan,
>
>
>
> The idea is that a sequence with a high proportion of X is more likely 
> to be DNA than protein. The examples I had in mind are
> unfinished genomic sequence, and there are countless entries in 
> Genbank/EMBL like this. So, someone wrote in and said that their
> genomic sequence was being characterized as protein since the fraction 
> [gatc] was less than 85%, it was mostly X. By contrast, there
> are no protein sequences with X in them in these public databases, if 
> I'm not mistaken. So I maintain that in the world of public
> databases this is the way to go.
>
>
>
> Now if you venture into the world of sequence analysis it's going to 
> be a different story, since you'll likely mask protein with X,
> not N, obviously. May I ask, if this person knows his/her sequence is 
> protein then why doesn't s/he set its alphabet to "protein"?
> Or why don't they mask with A or Z or O or something?
>
>
>
> They'll be problems either way. What is one's reference? Public 
> sequence or the less well-defined set of possible sequences?
>
>
>
> Brian O.
>
> -----Original Message-----
> From: Nathan Haigh [mailto:nathanhaigh at ukonline.co.uk]
> Sent: Wednesday, January 05, 2005 7:38 PM
> To: 'Brian Osborne'
> Subject: FW: SeqIO fails on masked sequences
>
> You committed a change to Bio::PrimarySeq where 'X' was added to the 
> class of characters that are stripped out of sequences in the
> _guess_alphabet subroutine. Do you know why sequences containing X 
> were causing a problem, and why X was added to the class of
> chars?
>
>
>
> It's causing a problem for someone who has a sequence that containes 
> all masked chars (i.e. all X's), which should still be
> "guessable" as protein.
>
>
>
> Cheers
>
> Nathan
>
> ---
> avast! Antivirus: Outbound message clean.
> Virus Database (VPS): 0501-0, 04/01/2005
> Tested on: 06/01/2005 00:36:20
> avast! is copyright (c) 2000-2003 ALWIL Software.
> http://www.avast.com
>
>
>
> ---
> avast! Antivirus: Inbound message clean.
> Virus Database (VPS): 0501-0, 04/01/2005
> Tested on: 07/01/2005 00:35:30
> avast! is copyright (c) 2000-2003 ALWIL Software.
> http://www.avast.com
>
>
>
>
> ---
> avast! Antivirus: Outbound message clean.
> Virus Database (VPS): 0501-0, 04/01/2005
> Tested on: 07/01/2005 11:39:14
> avast! is copyright (c) 2000-2003 ALWIL Software.
> http://www.avast.com
>
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at portal.open-bio.org
> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------




More information about the Bioperl-l mailing list