[Bioperl-l] RE: SeqIO fails on masked sequences

Nathan Haigh nathanhaigh at ukonline.co.uk
Fri Jan 7 06:39:16 EST 2005


There appears to be an anomaly with Bio::Seq::fasta. If the SeqIO object's alphabet is set, next_seq() results in this being undef
and then proceeds to guess the alphabet again, therefore this like the following do not work:

my $seq_in  = Bio::SeqIO->new(-format=>$format, -fh => \*DATA);

$seq_in->alphabet('protein');

Should setting the SeqIO object's alphabet be honoured even if it is set to the wrong type or the sequences are not of that
alphabet?

 

I have a bug fix, that allows you to set the alphabet through the SeqIO object, but it doesn't do any sort of checking to see if all
the seqs in the object are of the correct type. Essentially, the alphabet is set in one of the following ways:

1) if the SeqIO object is set using e.g. $seq_in->alphabet('dna'); all the seqs that belong to the $seq_in object obtain their
alphabet from the SeqIO object, dna in this case, irrespective of whether or not it is actually protein.

2) If alphabet has not been set in this way, the first sequence is used to guess the alphabet of the SeqIO object, from which all
the sequences obtain their alphabet.

 

Possible limitations:

1)     all seqs in the SeqIO object can only be of the same type - no testing done to see if this is not the case.

 

Does this sound ok and reasonable?

Nathan

 

-----Original Message-----
From: Brian Osborne [mailto:brian_osborne at cognia.com] 
Sent: 06 January 2005 12:25
To: nathanhaigh at ukonline.co.uk
Subject: RE: SeqIO fails on masked sequences

 

Nathan,

 

The idea is that a sequence with a high proportion of X is more likely to be DNA than protein. The examples I had in mind are
unfinished genomic sequence, and there are countless entries in Genbank/EMBL like this. So, someone wrote in and said that their
genomic sequence was being characterized as protein since the fraction [gatc] was less than 85%, it was mostly X. By contrast, there
are no protein sequences with X in them in these public databases, if I'm not mistaken. So I maintain that in the world of public
databases this is the way to go.

 

Now if you venture into the world of sequence analysis it's going to be a different story, since you'll likely mask protein with X,
not N, obviously. May I ask, if this person knows his/her sequence is protein then why doesn't s/he set its alphabet to "protein"?
Or why don't they mask with A or Z or O or something?

 

They'll be problems either way. What is one's reference? Public sequence or the less well-defined set of possible sequences?

 

Brian O.

-----Original Message-----
From: Nathan Haigh [mailto:nathanhaigh at ukonline.co.uk]
Sent: Wednesday, January 05, 2005 7:38 PM
To: 'Brian Osborne'
Subject: FW: SeqIO fails on masked sequences

You committed a change to Bio::PrimarySeq where 'X' was added to the class of characters that are stripped out of sequences in the
_guess_alphabet subroutine. Do you know why sequences containing X were causing a problem, and why X was added to the class of
chars?

 

It's causing a problem for someone who has a sequence that containes all masked chars (i.e. all X's), which should still be
"guessable" as protein.

 

Cheers

Nathan

---
avast! Antivirus: Outbound message clean.
Virus Database (VPS): 0501-0, 04/01/2005
Tested on: 06/01/2005 00:36:20
avast! is copyright (c) 2000-2003 ALWIL Software.
http://www.avast.com



---
avast! Antivirus: Inbound message clean.
Virus Database (VPS): 0501-0, 04/01/2005
Tested on: 07/01/2005 00:35:30
avast! is copyright (c) 2000-2003 ALWIL Software.
http://www.avast.com




---
avast! Antivirus: Outbound message clean.
Virus Database (VPS): 0501-0, 04/01/2005
Tested on: 07/01/2005 11:39:14
avast! is copyright (c) 2000-2003 ALWIL Software.
http://www.avast.com






More information about the Bioperl-l mailing list