[Bioperl-l] Bio::Species changes

Hilmar Lapp hlapp@gnf.org
Sun, 6 Oct 2002 23:00:17 -0700


I committed a number of changes. These are the changes concerning 
Bio::Species:

- there is now a second way of calling classification:
	$species->classification(\@classif_array, "FORCE");
In set mode, the first parameter can now be an array reference. If 
it is, and a second parameter is present and evaluates to TRUE, no 
name validation whatsoever will be done.

- new method variant() (get/set)
This will hold the (potentially literal) information regarding the 
variant of the species, like a strain, isolate, variety, etc. I 
modified the swissprot parser to extract this information properly. 
By potentially literal I mean that e.g. swissprot gives this in a 
form like '(strain A/Equine/Fontainebleau/76)' and except for the 
enclosing parentheses the parser will not change this string (i.e., 
'strain' remains there).

I'm now at 65k sequences of swissprot rel. 40 into biosql, and 
there's still some fall-out, for which I'll commit fixes soon. Other 
than that, the results look pretty good meanwhile.

	-hilmar

On Saturday, October 5, 2002, at 07:16 AM, Ewan Birney wrote:

>
>
> On Fri, 4 Oct 2002, Hilmar Lapp wrote:
>
>> I thought I take on the smoothest ride first by dumping swissprot 
>> (rel. 40) into biosql. This turned out to be painful, and here's 
>> why and what I did and what else we need to do to relieve some of 
>> the pain.
>>
>> I'm writing this line at the end because it became a long email. 
>> So I thought I better put a summary here.
>>
>> <summary>
>> - Species and classification names choke the parser. I have a 
>> solution to fix the problem once and for all.
>
> great. I approve. I also was doing this.
>
>> - Swissprot entry to species is a n:n relationship. The parser 
>> screwed up the species. I have a short-term fix, but generally 
>> speaking this is a total nightmare.
>
> It is a long standing gripe i have with swissprot and I doubt
> they are going to change their spots in a hurry. I consider this to
> be "insane" but can't talk swissprot out of this.
>
>> - 'Common name' isn't always a common name, but sometimes a strain 
>> or isolate, which is crucial for identifying the species. I 
>> propose a solution.
>> - Virus classification scheme is not handled properly, and I don't 
>> know how it should be. Need an expert.
>> </summary>
>>
>> Read on to share my pain.
>>
>> 1) Species names not conforming to what we think in Bioperl they 
>> should should conform to. There are endless variants with ever new 
>> non-letter characters being used even in species name, especially 
>> for viruses and bacteria. What's really painful about this is that 
>> our name validators throw an exception (Elia, you were so right) 
>> and the parser chokes.
>>
>> I honestly see no point in us trying to keep up with the fancy 
>> names of viruses and bacteria classifications, if in the end we 
>> have to trust the sources anyway. So, I decided to fix this 
>> problem once and for all by doing exactly that: 
>> $species->classification() in addition to the traditional array of 
>> strings will now also accept another form of being called in set 
>> mode: if the first argument is a reference to an array, the second 
>> argument is checked whether it evaluates to true. If it does, no 
>> name validation whatsoever is done. I.e., 'trust the caller.' I 
>> modified the swissprot parser accordingly. I.e., trust swissprot 
>> species and classification names, however weird they may read.
>>
>> It works for me. Does anyone have a problem with me committing 
>> this? I also suggest that we modify the genbank, embl, etc parsers 
>> accordingly.
>>
>
<snip>
>>
>> 3) Identifiability of a species. (Full) Binomial is not enough as it
>> turns out, as for microorganisms different strains and/or isolates get
>> different NCBI_TaxIDs. Also, the term in parentheses on the OS line in
>> these cases does not indicate a common name (which is supposedly
>> redundant with the binomial in terms of identifiability), but the name
>> of the strain or isolate, and then therefore is a key part of the
>> species' name (i.e., it's semantically overloaded). I propose the
>> following to fix this.
>>
>> 	- add an attribute variant() to Bio::Species, holding the
>> un-interpreted value in parentheses if it appears not to be the common
>> name. (e.g. 'isolate Gambia', 'PYSG', or 'strain PSG').
>>
>> 	- pass the value in parentheses either to variant() to
>> common_name(), depending on some magical logic ...
>
>
> Sounds good.
>

--
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------