[BioSQL-l] multiple species for a sequence

Sun Mar 2 13:00:50 EST 2008

On Mar 2, 2008, at 11:33 AM, Hilmar Lapp wrote:

> On Mar 1, 2008, at 8:16 PM, Chris Fields wrote:
>
>> I'm looking at a bioperl bug I filed a while back that deals with  
>> multiple species in a sequence file, such as found for AJ428955:
>>
>> ID   AJ428955; SV 1; linear; mRNA; STD; VRL; 8027 BP.
>> XX
>> AC   AJ428955;
>> XX
>> DT   09-JUL-2002 (Rel. 72, Created)
>> DT   15-APR-2005 (Rel. 83, Last updated, Version 4)
>> XX
>> DE   Hepatitis GB virus B subgenomic replicon neoRepB
>> XX
>> KW   core-neo fusion protein; core-neo gene; polyprotein.
>> XX
>> OS   Hepatitis GB virus B
>> OC   Viruses; ssRNA positive-strand viruses, no DNA stage;  
>> Flaviviridae.
>> XX
>> OS   Encephalomyocarditis virus
>> OC   Viruses; ssRNA positive-strand viruses, no DNA stage;  
>> Picornaviridae;
>> OC   Cardiovirus.
>>
>> ...
>>
>> We could probably add support in bioperl fairly easily (Bio::Seq  
>> could just return an array or the first species object based on  
>> context), but would BioSQL support sequences like this?
>
> No it wouldn't. There may only be one species (taxon) per sequence.
>
> There has been a lot of discussion about this in the past mostly  
> driven by the former SwissProt peculiarity of collapsing sequences  
> by sequence identity into a single record. We held out and  
> eventually UniProt dropped this practice.

I'm unsure how often these pop up.  The behavior of both EMBL and  
GenBank parsers assumes one species (as does Bio::Seq); the embl  
parser picks up both and just replaces the first with the second:

...
DE   Hepatitis GB virus B subgenomic replicon neoRepB
XX
KW   core-neo fusion protein; core-neo gene; polyprotein.
XX
OS   Encephalomyocarditis virus
OC   Viruses; ssRNA positive-strand viruses, no DNA stage;  
Picornaviridae;
OC   Cardiovirus.
XX
RN   [1]
...

> I guess we never quite decided what to do about chimeric sequences  
> like the above. Note that the GenBank record gives this differently:
>
> http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=21727885
>
> Here, there's one taxon (ORGANISM line) reference, but two localized  
> 'source' features in the feature table. (I'm actually not 100% sure  
> what the genbank parser would do with this - i.e., whether the  
> second source feature will override the taxon_id found in the  
> first.) Because seqfeatures (in BioSQL) don't have a link to taxon,  
> you wouldn't be able to hit the sequence by its second (chimeric)  
> taxon if that were your query criteria (though you could store it  
> fine, and if you queried by dbxrefs of features of type 'source',  
> you would find it).

The genbank parser gets the taxon and tax ID correct; I would think  
when it hit the next source feature key it would assign the wrong tax  
ID to the species object but maybe there's a secondary check.  Both  
output the source in feature tables just fine.

> At the end of the day, BioSQL will evolve (hopefully) quickly to  
> support what the Bio* toolkits support, and will be much slower to  
> change in ways that Bio* wouldn't be able to take advantage of  
> anyway. At least that's my current vision of it, and of course is up  
> for debate as to whether that's a useful vision as much as anything  
> else.
>
> So, as you say, right now BioPerl, and AFAIAA any of the other Bio*  
> toolkits, doesn't support more than one species per sequence, but as  
> soon as that changes, there's a clear need for BioSQL to follow along.
>
> Does that make sense?
>
> 	-hilmar

Yes.  I think we could add in support for multiple species fairly  
easily but I'll probably hold off on anything until after a 1.6  
release (i.e. push it to the next developer series, which gives us  
more time to think on how to implement this in a BioSQL-friendly way).

chris