[BioSQL-l] multiple species for a sequence

Mark Schreiber markjschreiber at gmail.com
Wed Mar 5 02:06:17 UTC 2008


BioJava doesn't support multiple taxa per sequence.  It's something to
consider though.

Philosophically you really have to wonder about he meaning of species
when you have a chimera : )  Should it not be a hybrid species all on
it's own?  I wonder what they will do when Craig Venter produces
Craigus ventus...

- Mark

On Mon, Mar 3, 2008 at 2:00 AM, Chris Fields <cjfields at uiuc.edu> wrote:
>
>
>  On Mar 2, 2008, at 11:33 AM, Hilmar Lapp wrote:
>
>  > On Mar 1, 2008, at 8:16 PM, Chris Fields wrote:
>  >
>  >> I'm looking at a bioperl bug I filed a while back that deals with
>  >> multiple species in a sequence file, such as found for AJ428955:
>  >>
>  >> ID   AJ428955; SV 1; linear; mRNA; STD; VRL; 8027 BP.
>  >> XX
>  >> AC   AJ428955;
>  >> XX
>  >> DT   09-JUL-2002 (Rel. 72, Created)
>  >> DT   15-APR-2005 (Rel. 83, Last updated, Version 4)
>  >> XX
>  >> DE   Hepatitis GB virus B subgenomic replicon neoRepB
>  >> XX
>  >> KW   core-neo fusion protein; core-neo gene; polyprotein.
>  >> XX
>  >> OS   Hepatitis GB virus B
>  >> OC   Viruses; ssRNA positive-strand viruses, no DNA stage;
>  >> Flaviviridae.
>  >> XX
>  >> OS   Encephalomyocarditis virus
>  >> OC   Viruses; ssRNA positive-strand viruses, no DNA stage;
>  >> Picornaviridae;
>  >> OC   Cardiovirus.
>  >>
>  >> ...
>  >>
>  >> We could probably add support in bioperl fairly easily (Bio::Seq
>  >> could just return an array or the first species object based on
>  >> context), but would BioSQL support sequences like this?
>  >
>  > No it wouldn't. There may only be one species (taxon) per sequence.
>  >
>  > There has been a lot of discussion about this in the past mostly
>  > driven by the former SwissProt peculiarity of collapsing sequences
>  > by sequence identity into a single record. We held out and
>  > eventually UniProt dropped this practice.
>
>  I'm unsure how often these pop up.  The behavior of both EMBL and
>  GenBank parsers assumes one species (as does Bio::Seq); the embl
>  parser picks up both and just replaces the first with the second:
>
>  ...
>
> DE   Hepatitis GB virus B subgenomic replicon neoRepB
>  XX
>  KW   core-neo fusion protein; core-neo gene; polyprotein.
>  XX
>
> OS   Encephalomyocarditis virus
>  OC   Viruses; ssRNA positive-strand viruses, no DNA stage;
>  Picornaviridae;
>  OC   Cardiovirus.
>  XX
>  RN   [1]
>  ...
>
>
>  > I guess we never quite decided what to do about chimeric sequences
>  > like the above. Note that the GenBank record gives this differently:
>  >
>  > http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=21727885
>  >
>  > Here, there's one taxon (ORGANISM line) reference, but two localized
>  > 'source' features in the feature table. (I'm actually not 100% sure
>  > what the genbank parser would do with this - i.e., whether the
>  > second source feature will override the taxon_id found in the
>  > first.) Because seqfeatures (in BioSQL) don't have a link to taxon,
>  > you wouldn't be able to hit the sequence by its second (chimeric)
>  > taxon if that were your query criteria (though you could store it
>  > fine, and if you queried by dbxrefs of features of type 'source',
>  > you would find it).
>
>  The genbank parser gets the taxon and tax ID correct; I would think
>  when it hit the next source feature key it would assign the wrong tax
>  ID to the species object but maybe there's a secondary check.  Both
>  output the source in feature tables just fine.
>
>
>  > At the end of the day, BioSQL will evolve (hopefully) quickly to
>  > support what the Bio* toolkits support, and will be much slower to
>  > change in ways that Bio* wouldn't be able to take advantage of
>  > anyway. At least that's my current vision of it, and of course is up
>  > for debate as to whether that's a useful vision as much as anything
>  > else.
>  >
>  > So, as you say, right now BioPerl, and AFAIAA any of the other Bio*
>  > toolkits, doesn't support more than one species per sequence, but as
>  > soon as that changes, there's a clear need for BioSQL to follow along.
>  >
>  > Does that make sense?
>  >
>  >       -hilmar
>
>  Yes.  I think we could add in support for multiple species fairly
>  easily but I'll probably hold off on anything until after a 1.6
>  release (i.e. push it to the next developer series, which gives us
>  more time to think on how to implement this in a BioSQL-friendly way).
>
>  chris
>
>
> _______________________________________________
>  BioSQL-l mailing list
>  BioSQL-l at lists.open-bio.org
>  http://lists.open-bio.org/mailman/listinfo/biosql-l
>



More information about the BioSQL-l mailing list