[Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL
Hilmar Lapp
hlapp at gmx.net
Sun May 17 11:21:59 EDT 2009
On May 17, 2009, at 8:40 AM, Peter wrote:
> On 5/17/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>>
>> On May 16, 2009, at 7:28 PM, Peter wrote:
>>>> That could be changed to an XML string:
>>>>
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <gene_names>
>>>> <gene_name>
>>>> <Name>GC1QBP</Name>
>>>> <Synonyms>HABP1</Synonyms>
>>>> <Synonyms>SF2P32</Synonyms>
>>>> <Synonyms>C1QBP</Synonyms>
>>>> </gene_name>
>>>> </gene_names>
>>>>
>>>> Thinking about this we should attempt to coalesce around a standard
>>>> instead of forcing the other Bio* to a specific format.
>
> [...] Here you have mapped RecName and AltName fields in the DE
> lines to
> Name and Synonyms (shouldn't that be Synonym singular?).
The example is for the GN lines in SwissProt, not the DE lines.
> [...]
> On 5/17/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>> Not necessarily. If you have a flat serialization (such as XML) the
>> nested
>> structure isn't needed. Of course that's not a fully normalized
>> relational
>> representation, but if you had one, how often would it be used, how
>> efficient would those queries be (SQL is poor at nested or
>> recursive data
>> structures), and how much pain would it be to write the object-
>> relational
>> mappings?
>
> In this example, searching the database using one of the SwissProt
> AltNames (synonyms), or filtering on the Flags sounds like a
> reasonable request - but this would be very difficult if the data is
> stored inside XML strings.
Actually no. Modern full-text indexers (inside or outside the
database) can index XML text columns right away and very well. In
fact, for the last project that I built a full-text search for (on top
of a BioSQL database) I did that by writing custom XML documents to a
separate table for each record I wanted indexed. Oracle's full text
indexer did the rest. I also built a separate identifier/name/
accession index that pulled all the gene names, symbols, accession
numbers, identifiers etc into a single table for indexing.
What I mean is, a fully normalized relational representation,
especially if nested, is often not the most efficient data structure
for efficient searching and filtering.
-hilmar
--
===========================================================
: Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net :
===========================================================
More information about the Biopython-dev
mailing list