[Bioperl-l] Swiss-Prot species

Karger, Amir AKarger@CuraGen.com
Thu, 12 Jul 2001 15:54:18 -0400


I was wondering if people have any thoughts about the ability of Swiss-prot
sequences to come from multiple species. There are a number of entries in
Swiss-Prot that are shared by Rat and Mouse, for example (and some by even
more than two species).

Just to make things more complicated, it appears that sometimes one species
has two common names, or something.

This means there are two possible screwups, which are shown by different
values of $swissThingy->species->common_name. (After the (sub)species name,
read_swissprot_Species matches /\(.*\)/ .)
If there are two common names, you get something like 

    "Pumpkin) (Winter squash".

If there are two species, you get something like

    Mouse), and Rattus norvegicus (Rat 

Even if the species thing can't be totally fixed, it seems like in the
interim it would be slightly better to do a non-greedy match, /\(.*?\)/ .
That yields the first thing in parentheses, which may be incomplete but at
least isn't nonsensical.

The problem with wanting to be any more complicated is that for the
two-common-name problem, you'd have to extend Bio::Species functionality.
And for the two-species problem, you'd have to extend Bio::SeqI
functionality (or maybe Bio::Seq::RichSeq, if you think only Rich Seqs will
have >1 species). I guess you would need to change the methods so they use
wantarray, or something, since all the old code would expect to get back
just one Bio::Species from the species method, for example. Yuck. Anyway, I
don't feel like I have enough of a handle on the biology here (OR bioperl)
to know what sort of solution would be right. But I thought I'd mention it
so I can keep the rest of you up at night :)

Amir Karger
Curagen Corporation