From p.j.a.cock at googlemail.com Tue May 24 07:36:57 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 May 2011 12:36:57 +0100 Subject: [BioSQL-l] Handling four strand states (as in GFF3) in BioSQL Message-ID: Dear all, This email was triggered from a Biopython discussion about how to represent the four strand states in GFF3 (+, -, ? and .) in Biopython (we are using +1, -1, 0 and None). See e.g. http://lists.open-bio.org/pipermail/biopython/2011-April/007194.html http://lists.open-bio.org/pipermail/biopython/2011-May/007299.html The GFF3 spec defines strand as follows, see: http://www.sequenceontology.org/gff3.shtml > Column 7: "strand" > The strand of the feature. + for positive strand (relative to the > landmark), - for minus strand, and . for features that are not > stranded. In addition, ? can be used for features whose > strandedness is relevant, but unknown. The BioSQL schema uses a tiny int (not null) for the strand, so three states -1, 0 and +1 are fine - but not a fourth state of not applicable (which would map nicely to null). Currently I presume all the BioSQL libraries use 0 in the BioSQL database for anything other than a +1 or -1 strand, effectively covering "non-stranded" and "stranded but unknown" in one group. If we want to extend BioSQL to allow four strand states as in GFF3, the simplest solution could be to allow null for this column. Then: GFF3 "+" (forward) becomes +1 in BioSQL GFF3 "-" (reverse) becomes -1 in BioSQL GFF3 "?" (stranded but unknown) becomes 0 in BioSQL GFF3 "." (not stranded) becomes NULL in BioSQL On the other hand, this fine distinction is of limited utility. e.g. For storing protein records in BioSQL, we can just continue to use zero in the database as the feature strand. Is this worth changing? Peter From p.j.a.cock at googlemail.com Tue May 24 11:36:57 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 24 May 2011 12:36:57 +0100 Subject: [BioSQL-l] Handling four strand states (as in GFF3) in BioSQL Message-ID: Dear all, This email was triggered from a Biopython discussion about how to represent the four strand states in GFF3 (+, -, ? and .) in Biopython (we are using +1, -1, 0 and None). See e.g. http://lists.open-bio.org/pipermail/biopython/2011-April/007194.html http://lists.open-bio.org/pipermail/biopython/2011-May/007299.html The GFF3 spec defines strand as follows, see: http://www.sequenceontology.org/gff3.shtml > Column 7: "strand" > The strand of the feature. + for positive strand (relative to the > landmark), - for minus strand, and . for features that are not > stranded. In addition, ? can be used for features whose > strandedness is relevant, but unknown. The BioSQL schema uses a tiny int (not null) for the strand, so three states -1, 0 and +1 are fine - but not a fourth state of not applicable (which would map nicely to null). Currently I presume all the BioSQL libraries use 0 in the BioSQL database for anything other than a +1 or -1 strand, effectively covering "non-stranded" and "stranded but unknown" in one group. If we want to extend BioSQL to allow four strand states as in GFF3, the simplest solution could be to allow null for this column. Then: GFF3 "+" (forward) becomes +1 in BioSQL GFF3 "-" (reverse) becomes -1 in BioSQL GFF3 "?" (stranded but unknown) becomes 0 in BioSQL GFF3 "." (not stranded) becomes NULL in BioSQL On the other hand, this fine distinction is of limited utility. e.g. For storing protein records in BioSQL, we can just continue to use zero in the database as the feature strand. Is this worth changing? Peter