[BioSQL-l] multiple species for a sequence
Chris Fields
cjfields at uiuc.edu
Sun Mar 2 13:00:50 EST 2008
On Mar 2, 2008, at 11:33 AM, Hilmar Lapp wrote:
> On Mar 1, 2008, at 8:16 PM, Chris Fields wrote:
>
>> I'm looking at a bioperl bug I filed a while back that deals with
>> multiple species in a sequence file, such as found for AJ428955:
>>
>> ID AJ428955; SV 1; linear; mRNA; STD; VRL; 8027 BP.
>> XX
>> AC AJ428955;
>> XX
>> DT 09-JUL-2002 (Rel. 72, Created)
>> DT 15-APR-2005 (Rel. 83, Last updated, Version 4)
>> XX
>> DE Hepatitis GB virus B subgenomic replicon neoRepB
>> XX
>> KW core-neo fusion protein; core-neo gene; polyprotein.
>> XX
>> OS Hepatitis GB virus B
>> OC Viruses; ssRNA positive-strand viruses, no DNA stage;
>> Flaviviridae.
>> XX
>> OS Encephalomyocarditis virus
>> OC Viruses; ssRNA positive-strand viruses, no DNA stage;
>> Picornaviridae;
>> OC Cardiovirus.
>>
>> ...
>>
>> We could probably add support in bioperl fairly easily (Bio::Seq
>> could just return an array or the first species object based on
>> context), but would BioSQL support sequences like this?
>
> No it wouldn't. There may only be one species (taxon) per sequence.
>
> There has been a lot of discussion about this in the past mostly
> driven by the former SwissProt peculiarity of collapsing sequences
> by sequence identity into a single record. We held out and
> eventually UniProt dropped this practice.
I'm unsure how often these pop up. The behavior of both EMBL and
GenBank parsers assumes one species (as does Bio::Seq); the embl
parser picks up both and just replaces the first with the second:
...
DE Hepatitis GB virus B subgenomic replicon neoRepB
XX
KW core-neo fusion protein; core-neo gene; polyprotein.
XX
OS Encephalomyocarditis virus
OC Viruses; ssRNA positive-strand viruses, no DNA stage;
Picornaviridae;
OC Cardiovirus.
XX
RN [1]
...
> I guess we never quite decided what to do about chimeric sequences
> like the above. Note that the GenBank record gives this differently:
>
> http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=21727885
>
> Here, there's one taxon (ORGANISM line) reference, but two localized
> 'source' features in the feature table. (I'm actually not 100% sure
> what the genbank parser would do with this - i.e., whether the
> second source feature will override the taxon_id found in the
> first.) Because seqfeatures (in BioSQL) don't have a link to taxon,
> you wouldn't be able to hit the sequence by its second (chimeric)
> taxon if that were your query criteria (though you could store it
> fine, and if you queried by dbxrefs of features of type 'source',
> you would find it).
The genbank parser gets the taxon and tax ID correct; I would think
when it hit the next source feature key it would assign the wrong tax
ID to the species object but maybe there's a secondary check. Both
output the source in feature tables just fine.
> At the end of the day, BioSQL will evolve (hopefully) quickly to
> support what the Bio* toolkits support, and will be much slower to
> change in ways that Bio* wouldn't be able to take advantage of
> anyway. At least that's my current vision of it, and of course is up
> for debate as to whether that's a useful vision as much as anything
> else.
>
> So, as you say, right now BioPerl, and AFAIAA any of the other Bio*
> toolkits, doesn't support more than one species per sequence, but as
> soon as that changes, there's a clear need for BioSQL to follow along.
>
> Does that make sense?
>
> -hilmar
Yes. I think we could add in support for multiple species fairly
easily but I'll probably hold off on anything until after a 1.6
release (i.e. push it to the next developer series, which gives us
more time to think on how to implement this in a BioSQL-friendly way).
chris
More information about the BioSQL-l
mailing list