[BioSQL-l] multiple species for a sequence
Hilmar Lapp
hlapp at gmx.net
Sun Mar 2 17:33:23 UTC 2008
On Mar 1, 2008, at 8:16 PM, Chris Fields wrote:
> I'm looking at a bioperl bug I filed a while back that deals with
> multiple species in a sequence file, such as found for AJ428955:
>
> ID AJ428955; SV 1; linear; mRNA; STD; VRL; 8027 BP.
> XX
> AC AJ428955;
> XX
> DT 09-JUL-2002 (Rel. 72, Created)
> DT 15-APR-2005 (Rel. 83, Last updated, Version 4)
> XX
> DE Hepatitis GB virus B subgenomic replicon neoRepB
> XX
> KW core-neo fusion protein; core-neo gene; polyprotein.
> XX
> OS Hepatitis GB virus B
> OC Viruses; ssRNA positive-strand viruses, no DNA stage;
> Flaviviridae.
> XX
> OS Encephalomyocarditis virus
> OC Viruses; ssRNA positive-strand viruses, no DNA stage;
> Picornaviridae;
> OC Cardiovirus.
>
> ...
>
> We could probably add support in bioperl fairly easily (Bio::Seq
> could just return an array or the first species object based on
> context), but would BioSQL support sequences like this?
No it wouldn't. There may only be one species (taxon) per sequence.
There has been a lot of discussion about this in the past mostly
driven by the former SwissProt peculiarity of collapsing sequences by
sequence identity into a single record. We held out and eventually
UniProt dropped this practice.
I guess we never quite decided what to do about chimeric sequences
like the above. Note that the GenBank record gives this differently:
http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=21727885
Here, there's one taxon (ORGANISM line) reference, but two localized
'source' features in the feature table. (I'm actually not 100% sure
what the genbank parser would do with this - i.e., whether the second
source feature will override the taxon_id found in the first.)
Because seqfeatures (in BioSQL) don't have a link to taxon, you
wouldn't be able to hit the sequence by its second (chimeric) taxon
if that were your query criteria (though you could store it fine, and
if you queried by dbxrefs of features of type 'source', you would
find it).
At the end of the day, BioSQL will evolve (hopefully) quickly to
support what the Bio* toolkits support, and will be much slower to
change in ways that Bio* wouldn't be able to take advantage of
anyway. At least that's my current vision of it, and of course is up
for debate as to whether that's a useful vision as much as anything
else.
So, as you say, right now BioPerl, and AFAIAA any of the other Bio*
toolkits, doesn't support more than one species per sequence, but as
soon as that changes, there's a clear need for BioSQL to follow along.
Does that make sense?
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
More information about the BioSQL-l
mailing list