[BioSQL-l] multiple species for a sequence
Mark Schreiber
markjschreiber at gmail.com
Tue Mar 4 21:06:17 EST 2008
BioJava doesn't support multiple taxa per sequence. It's something to
consider though.
Philosophically you really have to wonder about he meaning of species
when you have a chimera : ) Should it not be a hybrid species all on
it's own? I wonder what they will do when Craig Venter produces
Craigus ventus...
- Mark
On Mon, Mar 3, 2008 at 2:00 AM, Chris Fields <cjfields at uiuc.edu> wrote:
>
>
> On Mar 2, 2008, at 11:33 AM, Hilmar Lapp wrote:
>
> > On Mar 1, 2008, at 8:16 PM, Chris Fields wrote:
> >
> >> I'm looking at a bioperl bug I filed a while back that deals with
> >> multiple species in a sequence file, such as found for AJ428955:
> >>
> >> ID AJ428955; SV 1; linear; mRNA; STD; VRL; 8027 BP.
> >> XX
> >> AC AJ428955;
> >> XX
> >> DT 09-JUL-2002 (Rel. 72, Created)
> >> DT 15-APR-2005 (Rel. 83, Last updated, Version 4)
> >> XX
> >> DE Hepatitis GB virus B subgenomic replicon neoRepB
> >> XX
> >> KW core-neo fusion protein; core-neo gene; polyprotein.
> >> XX
> >> OS Hepatitis GB virus B
> >> OC Viruses; ssRNA positive-strand viruses, no DNA stage;
> >> Flaviviridae.
> >> XX
> >> OS Encephalomyocarditis virus
> >> OC Viruses; ssRNA positive-strand viruses, no DNA stage;
> >> Picornaviridae;
> >> OC Cardiovirus.
> >>
> >> ...
> >>
> >> We could probably add support in bioperl fairly easily (Bio::Seq
> >> could just return an array or the first species object based on
> >> context), but would BioSQL support sequences like this?
> >
> > No it wouldn't. There may only be one species (taxon) per sequence.
> >
> > There has been a lot of discussion about this in the past mostly
> > driven by the former SwissProt peculiarity of collapsing sequences
> > by sequence identity into a single record. We held out and
> > eventually UniProt dropped this practice.
>
> I'm unsure how often these pop up. The behavior of both EMBL and
> GenBank parsers assumes one species (as does Bio::Seq); the embl
> parser picks up both and just replaces the first with the second:
>
> ...
>
> DE Hepatitis GB virus B subgenomic replicon neoRepB
> XX
> KW core-neo fusion protein; core-neo gene; polyprotein.
> XX
>
> OS Encephalomyocarditis virus
> OC Viruses; ssRNA positive-strand viruses, no DNA stage;
> Picornaviridae;
> OC Cardiovirus.
> XX
> RN [1]
> ...
>
>
> > I guess we never quite decided what to do about chimeric sequences
> > like the above. Note that the GenBank record gives this differently:
> >
> > http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=21727885
> >
> > Here, there's one taxon (ORGANISM line) reference, but two localized
> > 'source' features in the feature table. (I'm actually not 100% sure
> > what the genbank parser would do with this - i.e., whether the
> > second source feature will override the taxon_id found in the
> > first.) Because seqfeatures (in BioSQL) don't have a link to taxon,
> > you wouldn't be able to hit the sequence by its second (chimeric)
> > taxon if that were your query criteria (though you could store it
> > fine, and if you queried by dbxrefs of features of type 'source',
> > you would find it).
>
> The genbank parser gets the taxon and tax ID correct; I would think
> when it hit the next source feature key it would assign the wrong tax
> ID to the species object but maybe there's a secondary check. Both
> output the source in feature tables just fine.
>
>
> > At the end of the day, BioSQL will evolve (hopefully) quickly to
> > support what the Bio* toolkits support, and will be much slower to
> > change in ways that Bio* wouldn't be able to take advantage of
> > anyway. At least that's my current vision of it, and of course is up
> > for debate as to whether that's a useful vision as much as anything
> > else.
> >
> > So, as you say, right now BioPerl, and AFAIAA any of the other Bio*
> > toolkits, doesn't support more than one species per sequence, but as
> > soon as that changes, there's a clear need for BioSQL to follow along.
> >
> > Does that make sense?
> >
> > -hilmar
>
> Yes. I think we could add in support for multiple species fairly
> easily but I'll probably hold off on anything until after a 1.6
> release (i.e. push it to the next developer series, which gives us
> more time to think on how to implement this in a BioSQL-friendly way).
>
> chris
>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
>
More information about the BioSQL-l
mailing list