[BioSQL-l] multiple species for a sequence

Sun Mar 2 17:33:23 UTC 2008

On Mar 1, 2008, at 8:16 PM, Chris Fields wrote:

> I'm looking at a bioperl bug I filed a while back that deals with  
> multiple species in a sequence file, such as found for AJ428955:
>
> ID   AJ428955; SV 1; linear; mRNA; STD; VRL; 8027 BP.
> XX
> AC   AJ428955;
> XX
> DT   09-JUL-2002 (Rel. 72, Created)
> DT   15-APR-2005 (Rel. 83, Last updated, Version 4)
> XX
> DE   Hepatitis GB virus B subgenomic replicon neoRepB
> XX
> KW   core-neo fusion protein; core-neo gene; polyprotein.
> XX
> OS   Hepatitis GB virus B
> OC   Viruses; ssRNA positive-strand viruses, no DNA stage;  
> Flaviviridae.
> XX
> OS   Encephalomyocarditis virus
> OC   Viruses; ssRNA positive-strand viruses, no DNA stage;  
> Picornaviridae;
> OC   Cardiovirus.
>
> ...
>
> We could probably add support in bioperl fairly easily (Bio::Seq  
> could just return an array or the first species object based on  
> context), but would BioSQL support sequences like this?

No it wouldn't. There may only be one species (taxon) per sequence.

There has been a lot of discussion about this in the past mostly  
driven by the former SwissProt peculiarity of collapsing sequences by  
sequence identity into a single record. We held out and eventually  
UniProt dropped this practice.

I guess we never quite decided what to do about chimeric sequences  
like the above. Note that the GenBank record gives this differently:

http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=21727885

Here, there's one taxon (ORGANISM line) reference, but two localized  
'source' features in the feature table. (I'm actually not 100% sure  
what the genbank parser would do with this - i.e., whether the second  
source feature will override the taxon_id found in the first.)  
Because seqfeatures (in BioSQL) don't have a link to taxon, you  
wouldn't be able to hit the sequence by its second (chimeric) taxon  
if that were your query criteria (though you could store it fine, and  
if you queried by dbxrefs of features of type 'source', you would  
find it).

At the end of the day, BioSQL will evolve (hopefully) quickly to  
support what the Bio* toolkits support, and will be much slower to  
change in ways that Bio* wouldn't be able to take advantage of  
anyway. At least that's my current vision of it, and of course is up  
for debate as to whether that's a useful vision as much as anything  
else.

So, as you say, right now BioPerl, and AFAIAA any of the other Bio*  
toolkits, doesn't support more than one species per sequence, but as  
soon as that changes, there's a clear need for BioSQL to follow along.

Does that make sense?

	-hilmar

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================