From biopython at maubp.freeserve.co.uk  Thu May 14 14:20:47 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 14 May 2009 19:20:47 +0100
Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in
	BioSQL
Message-ID: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>

Hi,

This is cross-posted between biopython-dev and biosql-l as it regards
parsing the description (DE) lines in SwissProt files and how they are
stored in BioSQL.  This follows from an earlier discussion on
biopython-dev

Older SwissProt files just had one or two DE lines, and it made sense
to treat this as a simple string mapped onto the description field in
the bioentry table in BioSQL.  This appears to what happens with
BioPerl 1.5.x and in Biopython (although the details regarding white
space differ).  However, newer SwissProt files have many DE lines with
additional structure.  The example Michiel gave earlier on the
biopython-dev list was:

http://www.uniprot.org/uniprot/Q9XHP0.txt

This has the following DE lines:

DE   RecName: Full=11S globulin seed storage protein 2;
DE   AltName: Full=11S globulin seed storage protein II;
DE   AltName: Full=Alpha-globulin;
DE   Contains:
DE     RecName: Full=11S globulin seed storage protein 2 acidic chain;
DE     AltName: Full=11S globulin seed storage protein II acidic chain;
DE   Contains:
DE     RecName: Full=11S globulin seed storage protein 2 basic chain;
DE     AltName: Full=11S globulin seed storage protein II basic chain;
DE   Flags: Precursor;

I had to fight with perl to get my old copy of BioPerl working again
(some week reference thing), but I managed, and then loaded this file
into my test BioSQL database with:

$ perl load_seqdatabase.pl --dbname biosql_test --dbuser root --dbpass
XXX --namespace biosql_test --format swiss Q9XHP0.txt

Then I looked at the resulting description in the main bioentry table:

$ mysql --user=root -p biosql_test -e 'SELECT description FROM
bioentry WHERE accession="Q9XHP0";'

This is stored as one huge long string (without the newlines, I'm not
sure if BioPerl strips those in parsing the file, or when loading it
into the database):

RecName: Full=11S globulin seed storage protein 2; AltName: Full=11S
globulin seed storage protein II; AltName: Full=Alpha-globulin;
Contains: RecName: Full=11S globulin seed storage protein 2 acidic
chain; AltName: Full=11S globulin seed storage protein II acidic
chain; Contains: RecName: Full=11S globulin seed storage protein 2
basic chain; AltName: Full=11S globulin seed storage protein II basic
chain; Flags: Precursor;

For Biopython, I emptied the database then did:

>>> from Bio import SeqIO
>>> from BioSQL import BioSeqDatabase
>>> server = BioSeqDatabase.open_database(driver="MySQLdb", user="root", passwd = "XXX", host = "localhost", db="biosql_test")
>>> db = server["biosql-test"] #namespace
>>> db.load(SeqIO.parse(open("Q9XHP0.txt"), "swiss"))
1
>>> server.commit()

As before, I looked in the table with mysql.  Again - this stores the
full description from the DE line, although with the newlines
embedded.  So, Biopython is consistent with my old copy of BioPerl
(1.5.x) if we ignore the white space.

However, how does this look in BioPerl 1.6?  If this is the same, are
there any plans to change this?  For Biopython we have discussed
recording most of the DE information under the annotations instead
(keyed off RecName, AltName, Contains, Flags), but I would like to be
consistent with BioPerl+BioSQL.

Thanks

Peter

From biopython at maubp.freeserve.co.uk  Sat May 16 07:53:07 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 16 May 2009 12:53:07 +0100
Subject: [BioSQL-l] Recording "nucleotide" in the sequence table?
Message-ID: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com>

Hi all,

You may recall a year ago or so, we talked about how BioPerl and
Biopython used lower case alphabet names ("dna", "rna", "protein")
while BioJava was inconsistent and used upper (or even mixed case).

http://lists.open-bio.org/pipermail/biopython/2007-November/003894.html
http://lists.open-bio.org/pipermail/biojava-l/2007-November/006034.html
http://lists.open-bio.org/pipermail/biosql-l/2008-March/001185.html

You'll notice that thread was split over several mailing lists (and
looking back, I think I missed some posts as I only read the Biopython
and BioSQL lists).

Anyway, this lead to the following proposal:

http://www.biosql.org/wiki/Enhancement_Requests#Check_constraint_on_biosequence.alphabet

In Biopython we also use "unknown" for sequences which are not known
to be "dna", "rna", "protein".  I presume this was copying BioPerl.

In a recent bug report (Bug 2829) it was pointed out that we
(Biopython) don't attempt to record nucleotide alphabets in BioSQL
(i.e. a sequence which could be DNA or RNA but we don't know which),
they just get "unknown" as their biosequence.alphabet entry.

Is there any precedent in BioPerl, BioJava or BioRuby for how to
handle this?  If not, I'd like to introduce and agree on "nucleotide"
for this situation.

Peter

From biopython at maubp.freeserve.co.uk  Sat May 16 08:12:01 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 16 May 2009 13:12:01 +0100
Subject: [BioSQL-l] BioSQL at BOSC 2009?
Message-ID: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com>

Hi,

Will any of the key BioSQL people from the Bio* projects be at BOSC
(and ISMB) this year? http://open-bio.org/wiki/BOSC_2009

There will be several people from Biopython there this year, including
me and Brad Chapman who are both familiar with BioSQL.  This would be
a nice opportunity for further improving BioSQL compatibility between
the Bio* projects - something that has been suggested in the past,
e.g.

http://lists.open-bio.org/pipermail/biopython/2007-November/003893.html
http://lists.open-bio.org/pipermail/biojava-l/2007-November/006037.html

I don't follow the BioPerl, BioJava or BioRuby mailing lists - and I
doubt many of their developers follow the Biopython mailing lists.
So, rather than having any BioSQL compatibility discussions split over
individual Bio* project specific mailing lists, it seems using the
BioSQL mailing list is most appropriate.

I have CC'd a few key people just in case they are not on the BioSQL
mailing list, if I have missed anyone please forward this to them and
ask them to sign up.

Thanks,

Peter

From markjschreiber at gmail.com  Sat May 16 10:58:19 2009
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Sat, 16 May 2009 22:58:19 +0800
Subject: [BioSQL-l] Recording "nucleotide" in the sequence table?
In-Reply-To: <93b45ca50905160755o4e5c9520n55bc5b84774f277a@mail.gmail.com>
References: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com>
	<93b45ca50905160755o4e5c9520n55bc5b84774f277a@mail.gmail.com>
Message-ID: <93b45ca50905160758j7c9f1d78k9ec49008d10f2e4f@mail.gmail.com>

I don't think you can do this with certainty. If you don't know the source
alphabet then an amino acid sequence could look like dna if it is only using
acgt and some of the ambiguity codes.

If it is a long sequence it will become increasingly unlikey it is amino
acid but never certain.

On 16 May 2009, 7:54 PM, "Peter" <biopython at maubp.freeserve.co.uk> wrote:

Hi all,

You may recall a year ago or so, we talked about how BioPerl and
Biopython used lower case alphabet names ("dna", "rna", "protein")
while BioJava was inconsistent and used upper (or even mixed case).

http://lists.open-bio.org/pipermail/biopython/2007-November/003894.html
http://lists.open-bio.org/pipermail/biojava-l/2007-November/006034.html
http://lists.open-bio.org/pipermail/biosql-l/2008-March/001185.html

You'll notice that thread was split over several mailing lists (and
looking back, I think I missed some posts as I only read the Biopython
and BioSQL lists).

Anyway, this lead to the following proposal:

http://www.biosql.org/wiki/Enhancement_Requests#Check_constraint_on_biosequence.alphabet

In Biopython we also use "unknown" for sequences which are not known
to be "dna", "rna", "protein".  I presume this was copying BioPerl.

In a recent bug report (Bug 2829) it was pointed out that we
(Biopython) don't attempt to record nucleotide alphabets in BioSQL
(i.e. a sequence which could be DNA or RNA but we don't know which),
they just get "unknown" as their biosequence.alphabet entry.

Is there any precedent in BioPerl, BioJava or BioRuby for how to
handle this?  If not, I'd like to introduce and agree on "nucleotide"
for this situation.

Peter
_______________________________________________
BioSQL-l mailing list
BioSQL-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biosql-l

From hlapp at gmx.net  Sat May 16 11:17:39 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 16 May 2009 11:17:39 -0400
Subject: [BioSQL-l] BioSQL at BOSC 2009?
In-Reply-To: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com>
References: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com>
Message-ID: <1BD503B3-D805-4882-87DD-820138792DB2@gmx.net>


On May 16, 2009, at 8:12 AM, Peter wrote:

> Will any of the key BioSQL people from the Bio* projects be at BOSC
> (and ISMB) this year? http://open-bio.org/wiki/BOSC_2009

Yes, I'll be there (though I am not presenting this year).

> [...] This would be a nice opportunity for further improving BioSQL  
> compatibility between the Bio* projects - something that has been  
> suggested in the past,

Indeed, excellent idea. Should we plan for a BoF?

	-hilmar

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Sat May 16 12:48:40 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 16 May 2009 12:48:40 -0400
Subject: [BioSQL-l] Recording "nucleotide" in the sequence table?
In-Reply-To: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com>
References: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com>
Message-ID: <48684763-5657-40F3-BE75-31E8CDE42613@gmx.net>


On May 16, 2009, at 7:53 AM, Peter wrote:

> In a recent bug report (Bug 2829) it was pointed out that we
> (Biopython) don't attempt to record nucleotide alphabets in BioSQL
> (i.e. a sequence which could be DNA or RNA but we don't know which),
> they just get "unknown" as their biosequence.alphabet entry.

I'm assuming that you do know that it's not protein, right? I.e.,  
assigning alphabet "unknown" isn't exactly right.

> Is there any precedent in BioPerl, BioJava or BioRuby for how to
> handle this?  If not, I'd like to introduce and agree on "nucleotide"
> for this situation.


So which letters (symbols) does the "nucleotide" alphabet contain?

Getting back to Mark's question, how do you know that it's either dna  
or rna but not protein? Is the problem that the user can't tell you  
whether it's dna or rna but they know it's not protein, or is it that  
the user doesn't say anything and all you have is the symbols of the  
sequence, which are a, c, g, and t only.

In BioPerl we'll guess the alphabet if the user doesn't say what it  
is, and at present if what we're seeing are the symbols a, c, g, and t  
only, then the guess is dna. If we're seeing u rather than t, we guess  
it's rna. An "unknown" alphabet would be for the user to expressly  
choose.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Sat May 16 16:25:21 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 16 May 2009 21:25:21 +0100
Subject: [BioSQL-l] Recording "nucleotide" in the sequence table?
In-Reply-To: <48684763-5657-40F3-BE75-31E8CDE42613@gmx.net>
References: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com>
	<48684763-5657-40F3-BE75-31E8CDE42613@gmx.net>
Message-ID: <320fb6e00905161325m497882ddge853e1525acdfede@mail.gmail.com>

Hilmar wrote:
>  I'm assuming that you do know that it's not protein, right?
>  I.e., assigning alphabet "unknown" isn't exactly right.

Yes, if the sequence is using the generic nucleotide alphabet this
means it is NOT protein, and could be DNA or RNA.  So yes,
downgrading a "nucleotide" alphabet to just "unknown" when
storing it in BioSQL (as we do now) is losing information - hence
me starting this thread.

> > Is there any precedent in BioPerl, BioJava or BioRuby for how to
> > handle this?  If not, I'd like to introduce and agree on "nucleotide"
> > for this situation.
>
>  So which letters (symbols) does the "nucleotide" alphabet contain?

Potentially anything - although I would expect the standard (ambiguous)
letters using in RNA or DNA, plus perhaps gap symbols.

> Getting back to Mark's question, how do you know that it's either dna or
> rna but not protein?

We know because the user (or parser) has explicitly used the generic
nucleotide alphabet, this means it is not protein, and is either
DNA or RNA. From the point of loading the sequence into BioSQL,
we don't know or care where the sequence came from - we just get
given the data with a declared alphabet.

> Is the problem that the user can't tell you whether it's dna or
> rna but they know it's not protein, or is it that the user doesn't
> say anything and all you have is the symbols of the sequence,
> which are a, c, g, and t only.

In the situation I'm talking about, either the user has explicitly
picked the alphabet, or perhaps one of our parsers has done so.
This would be because the user don't know, of the file format
doesn't specify this information.  This is admittedly a corner
case - generally there will be either be T or U entries in the
sequence so DNA or RNA can be deduced unambiguously.

> In BioPerl we'll guess the alphabet if the user doesn't say what it is, and
> at present if what we're seeing are the symbols a, c, g, and t only, then
> the guess is dna. If we're seeing u rather than t, we guess it's rna. An
> "unknown" alphabet would be for the user to expressly choose.

What would BioPerl do with the nucleotide sequence GCGCGCGA?
Presumably you guess, thus record either "dna" or "rna" in BioSQL,
so the issue of wanting to record "nucleotide" never arises.

In python "guessing" is discouraged.  If we have a nucleotide sequence
like GCGCGCGA, this could be DNA or RNA - you can't tell.  Our
nucleotide alphabet covers this situation , although another strong
reason for having it is as a common base class for the RNA and
DNA alphabets.

On 5/16/09, Mark Schreiber <markjschreiber at gmail.com> wrote:
> I don't think you can do this with certainty. If you don't know the source
> alphabet then an amino acid sequence could look like dna if it is only
> using acgt and some of the ambiguity codes.
>
> If it is a long sequence it will become increasingly unlikey it is amino
> acid but never certain.

The python answer is don't guess. If you read in a FASTA file with
Biopython it will by default be given a generic alphabet, unless you
explicitly specify otherwise (and in BioSQL the alphabet will be
stored as "unknown").  i.e. the onus is on the user to be explicit.

Peter

From biopython at maubp.freeserve.co.uk  Sat May 16 17:23:04 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 16 May 2009 22:23:04 +0100
Subject: [BioSQL-l] BioSQL at BOSC 2009?
In-Reply-To: <1BD503B3-D805-4882-87DD-820138792DB2@gmx.net>
References: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com>
	<1BD503B3-D805-4882-87DD-820138792DB2@gmx.net>
Message-ID: <320fb6e00905161423l4df26525hbc9a824419c7a370@mail.gmail.com>

On 5/16/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>
>  On May 16, 2009, at 8:12 AM, Peter wrote:
>
> > Will any of the key BioSQL people from the Bio* projects be at BOSC
> > (and ISMB) this year? http://open-bio.org/wiki/BOSC_2009
> >
>
>  Yes, I'll be there (though I am not presenting this year).
>
> > [...] This would be a nice opportunity for further improving BioSQL
> > compatibility between the Bio* projects - something that has been
> > suggested in the past,
>
>  Indeed, excellent idea. Should we plan for a BoF?

If you want to do this as a formal BoF, then sure.

Brad and I (plus other Biopython folk like Tiago and Bartek, who I
believe are not so interested in BioSQL) are already talking about a
Bioython BoF/hackathon session at BOSC. It would be easier if that
didn't overlap with a BioSQL session ;)  (but not impossible - Brad
and I can perhaps split our time?)

I will be staying for all of ISMB, and I think Brad is about for the
Monday and maybe Tuesday, so that might be an alternative for
scheduling.

Peter

From hlapp at gmx.net  Sat May 16 17:57:15 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 16 May 2009 17:57:15 -0400
Subject: [BioSQL-l] BioSQL at BOSC 2009?
In-Reply-To: <320fb6e00905161423l4df26525hbc9a824419c7a370@mail.gmail.com>
References: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com>
	<1BD503B3-D805-4882-87DD-820138792DB2@gmx.net>
	<320fb6e00905161423l4df26525hbc9a824419c7a370@mail.gmail.com>
Message-ID: <74D4CC78-FC7B-4595-9D24-EB6B3ED43318@gmx.net>


On May 16, 2009, at 5:23 PM, Peter wrote:

> I will be staying for all of ISMB


I am too. Should we doodle something once the program is out?

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Sat May 16 18:10:43 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 16 May 2009 18:10:43 -0400
Subject: [BioSQL-l] Recording "nucleotide" in the sequence table?
In-Reply-To: <320fb6e00905161325m497882ddge853e1525acdfede@mail.gmail.com>
References: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com>
	<48684763-5657-40F3-BE75-31E8CDE42613@gmx.net>
	<320fb6e00905161325m497882ddge853e1525acdfede@mail.gmail.com>
Message-ID: <9A3473A7-2125-486C-BAAE-27820FF19D8D@gmx.net>

I think we'll have to define carefully what we mean by "generic  
nucleotide alphabet". (Normally I hear nucleotide used as the type of  
a sequence, but not its alphabet.)

A nucleotide alphabet in the way you describe it also can't really be  
the "base class" for either a DNA or RNA alphabet, can it? Typically  
in OOP, derived classes expand on a base class, not restrict it. So  
isn't there potential for confusion?

What you are essentially talking about is the case when a sequence  
contains only A, C, and G. In that case, we don't know either that  
it's not protein, do we?

> [...] In python "guessing" is discouraged.  If we have a nucleotide  
> sequence
> like GCGCGCGA, this could be DNA or RNA - you can't tell.

And how do you tell it's nucleotide to begin with?

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Sat May 16 18:34:57 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 16 May 2009 18:34:57 -0400
Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in
	BioSQL
In-Reply-To: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
Message-ID: <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>

Don't you love SwissProt (or UniProt as we must call it now I  
suppose). They (understandably) try to squeeze ever more annotation  
into the existing tags, rather than adding new tags.

So, of the following structure:

DE   RecName: Full=11S globulin seed storage protein 2;
DE   AltName: Full=11S globulin seed storage protein II;
DE   AltName: Full=Alpha-globulin;
DE   Contains:
DE     RecName: Full=11S globulin seed storage protein 2 acidic chain;
DE     AltName: Full=11S globulin seed storage protein II acidic chain;
DE   Contains:
DE     RecName: Full=11S globulin seed storage protein 2 basic chain;
DE     AltName: Full=11S globulin seed storage protein II basic chain;
DE   Flags: Precursor;

really only the first line, with the 'RecName: Full=' removed, is the  
description line as we know it. The rest, I would say, is annotation,  
such as two alternative names, amino acid chains contained in the full  
record (shouldn't this be feature annotation, really? and indeed it is  
- why it needs to be repeated here is beyond me) and their names as  
well as alternative names, and the fact that the sequence is a  
precursor form.

Leaving all this in one string has the advantage that we can round- 
trip it (and there is probably hardly any other way to accomplish  
that), but clearly in terms of semantics this isn't the sequence  
description as we know it anymore.

Does anyone else think too that completely changing the semantics of  
sequence annotation fields is a bad idea? <sigh/>

My inclination from a BioPerl perspective is to extract the part  
following 'RecName: Full=' as the description, and attach the rest as  
annotation. We could in fact use the TagTree class for this. I'm cross- 
posting to BioPerl too to gather what other BioPerl'ers think about  
this.

	-hilmar

On May 14, 2009, at 2:20 PM, Peter wrote:

> Hi,
>
> This is cross-posted between biopython-dev and biosql-l as it regards
> parsing the description (DE) lines in SwissProt files and how they are
> stored in BioSQL.  This follows from an earlier discussion on
> biopython-dev
>
> Older SwissProt files just had one or two DE lines, and it made sense
> to treat this as a simple string mapped onto the description field in
> the bioentry table in BioSQL.  This appears to what happens with
> BioPerl 1.5.x and in Biopython (although the details regarding white
> space differ).  However, newer SwissProt files have many DE lines with
> additional structure.  The example Michiel gave earlier on the
> biopython-dev list was:
>
> http://www.uniprot.org/uniprot/Q9XHP0.txt
>
> This has the following DE lines:
>
> DE   RecName: Full=11S globulin seed storage protein 2;
> DE   AltName: Full=11S globulin seed storage protein II;
> DE   AltName: Full=Alpha-globulin;
> DE   Contains:
> DE     RecName: Full=11S globulin seed storage protein 2 acidic chain;
> DE     AltName: Full=11S globulin seed storage protein II acidic  
> chain;
> DE   Contains:
> DE     RecName: Full=11S globulin seed storage protein 2 basic chain;
> DE     AltName: Full=11S globulin seed storage protein II basic chain;
> DE   Flags: Precursor;
>
> I had to fight with perl to get my old copy of BioPerl working again
> (some week reference thing), but I managed, and then loaded this file
> into my test BioSQL database with:
>
> $ perl load_seqdatabase.pl --dbname biosql_test --dbuser root --dbpass
> XXX --namespace biosql_test --format swiss Q9XHP0.txt
>
> Then I looked at the resulting description in the main bioentry table:
>
> $ mysql --user=root -p biosql_test -e 'SELECT description FROM
> bioentry WHERE accession="Q9XHP0";'
>
> This is stored as one huge long string (without the newlines, I'm not
> sure if BioPerl strips those in parsing the file, or when loading it
> into the database):
>
> RecName: Full=11S globulin seed storage protein 2; AltName: Full=11S
> globulin seed storage protein II; AltName: Full=Alpha-globulin;
> Contains: RecName: Full=11S globulin seed storage protein 2 acidic
> chain; AltName: Full=11S globulin seed storage protein II acidic
> chain; Contains: RecName: Full=11S globulin seed storage protein 2
> basic chain; AltName: Full=11S globulin seed storage protein II basic
> chain; Flags: Precursor;
>
> For Biopython, I emptied the database then did:
>
>>>> from Bio import SeqIO
>>>> from BioSQL import BioSeqDatabase
>>>> server = BioSeqDatabase.open_database(driver="MySQLdb",  
>>>> user="root", passwd = "XXX", host = "localhost", db="biosql_test")
>>>> db = server["biosql-test"] #namespace
>>>> db.load(SeqIO.parse(open("Q9XHP0.txt"), "swiss"))
> 1
>>>> server.commit()
>
> As before, I looked in the table with mysql.  Again - this stores the
> full description from the DE line, although with the newlines
> embedded.  So, Biopython is consistent with my old copy of BioPerl
> (1.5.x) if we ignore the white space.
>
> However, how does this look in BioPerl 1.6?  If this is the same, are
> there any plans to change this?  For Biopython we have discussed
> recording most of the DE information under the annotations instead
> (keyed off RecName, AltName, Contains, Flags), but I would like to be
> consistent with BioPerl+BioSQL.
>
> Thanks
>
> Peter
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Sat May 16 19:06:41 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 17 May 2009 00:06:41 +0100
Subject: [BioSQL-l] Recording "nucleotide" in the sequence table?
In-Reply-To: <9A3473A7-2125-486C-BAAE-27820FF19D8D@gmx.net>
References: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com>
	<48684763-5657-40F3-BE75-31E8CDE42613@gmx.net>
	<320fb6e00905161325m497882ddge853e1525acdfede@mail.gmail.com>
	<9A3473A7-2125-486C-BAAE-27820FF19D8D@gmx.net>
Message-ID: <320fb6e00905161606l5fdb0862mf25a45dad07dac8@mail.gmail.com>

On 5/16/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>
>  I think we'll have to define carefully what we mean by "generic nucleotide
> alphabet". (Normally I hear nucleotide used as the type of a sequence, but
> not its alphabet.)

In Biopython the type of a sequence (e.g. DNA, RNA or Protein) is
recorded by an alphabet object (which may also record the expected
range of letters).

>  A nucleotide alphabet in the way you describe it also can't really be the
> "base class" for either a DNA or RNA alphabet, can it? Typically in OOP,
> derived classes expand on a base class, not restrict it. So isn't there
> potential for confusion?

Well, that's how it was done for the Biopython alphabet classes.
I'm simplifying slightly, but at the top level we have a generic
alphabet, which has as children generic protein and generic
nucleotide (which has as its children generic dna and generic
rna).  Each of these then has IUPAC subclasses which are further
restrictions where the valid letters are proscribed.

> What you are essentially talking about is the case when a sequence
> contains only A, C, and G. In that case, we don't know either that
> it's not protein, do we?
>
> > [...] In python "guessing" is discouraged.  If we have a nucleotide
> > sequence like GCGCGCGA, this could be DNA or RNA - you can't
> > tell.
>
> And how do you tell it's nucleotide to begin with?

That is the whole point.  When deciding what to record in the
biosequence.alphabet field in BioSQL we (Bioython) can only
go by what the alphabet associated with the sequence object.
Whoever created the sequence specified the alphabet based
on meta data, external knowledge, or guessed. If this was
done by a parser, then the file format itself may have
specified the sequence type.

If none of BioPerl, BioJava and BioRuby have an analogous
sequence representation for a nucleotide sequence which
might be DNA or RNA, then perhaps the current situation
with only "protein", "dna", "rna" and "unknown" in the
biosequence.alphabet field in BioSQL is sufficient.

Peter

From biopython at maubp.freeserve.co.uk  Sat May 16 19:14:54 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 17 May 2009 00:14:54 +0100
Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in
	BioSQL
In-Reply-To: <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
Message-ID: <320fb6e00905161614p5a2d964fs6de110bb915ed066@mail.gmail.com>

On 5/16/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>
> Don't you love SwissProt (or UniProt as we must call it now I suppose).
> They (understandably) try to squeeze ever more annotation into the existing
> tags, rather than adding new tags.
>
>  So, of the following structure:
>
>  DE   RecName: Full=11S globulin seed storage protein 2;
>  DE   AltName: Full=11S globulin seed storage protein II;
>  DE   AltName: Full=Alpha-globulin;
>  DE   Contains:
>  DE     RecName: Full=11S globulin seed storage protein 2 acidic chain;
>  DE     AltName: Full=11S globulin seed storage protein II acidic chain;
>  DE   Contains:
>  DE     RecName: Full=11S globulin seed storage protein 2 basic chain;
>  DE     AltName: Full=11S globulin seed storage protein II basic chain;
>  DE   Flags: Precursor;
>
>  really only the first line, with the 'RecName: Full=' removed, is the
> description line as we know it. The rest, I would say, is annotation, such
> as two alternative names, amino acid chains contained in the full record
> (shouldn't this be feature annotation, really? and indeed it is - why it
> needs to be repeated here is beyond me) and their names as well as
> alternative names, and the fact that the sequence is a precursor form.
>
>  Leaving all this in one string has the advantage that we can round-trip it
> (and there is probably hardly any other way to accomplish that), but clearly
> in terms of semantics this isn't the sequence description as we know it
> anymore.
>
>  Does anyone else think too that completely changing the semantics of
> sequence annotation fields is a bad idea? <sigh/>

+1
That's pretty much what I thought on seeing this the first time.

>  My inclination from a BioPerl perspective is to extract the part following
> 'RecName: Full=' as the description, and attach the rest as annotation. We
> could in fact use the TagTree class for this. I'm cross-posting to BioPerl
> too to gather what other BioPerl'ers think about this.

Am I right to infer that currently BioPerl 1.6.x, like BioPerl 1.5.x just
treats the DE lines as only big long string?

Could you translate your idea about the TagTree class into something
concrete with BioSQL tables and fields for me? I'm not familiar with
the TagTree (or Perl).

Over on the Biopython list we'd talked about storing this annotation in
a nested structured.  However, in order to use the BioSQL annotations
mechanisms, I think a simple flat structure is required :(

Peter

From cjfields at illinois.edu  Sat May 16 19:16:05 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Sat, 16 May 2009 18:16:05 -0500
Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in
	BioSQL
In-Reply-To: <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
Message-ID: <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>


On May 16, 2009, at 5:34 PM, Hilmar Lapp wrote:

> Don't you love SwissProt (or UniProt as we must call it now I  
> suppose). They (understandably) try to squeeze ever more annotation  
> into the existing tags, rather than adding new tags.
>
> So, of the following structure:
>
> DE   RecName: Full=11S globulin seed storage protein 2;
> DE   AltName: Full=11S globulin seed storage protein II;
> DE   AltName: Full=Alpha-globulin;
> DE   Contains:
> DE     RecName: Full=11S globulin seed storage protein 2 acidic chain;
> DE     AltName: Full=11S globulin seed storage protein II acidic  
> chain;
> DE   Contains:
> DE     RecName: Full=11S globulin seed storage protein 2 basic chain;
> DE     AltName: Full=11S globulin seed storage protein II basic chain;
> DE   Flags: Precursor;
>
> really only the first line, with the 'RecName: Full=' removed, is  
> the description line as we know it. The rest, I would say, is  
> annotation, such as two alternative names, amino acid chains  
> contained in the full record (shouldn't this be feature annotation,  
> really? and indeed it is - why it needs to be repeated here is  
> beyond me) and their names as well as alternative names, and the  
> fact that the sequence is a precursor form.
>
> Leaving all this in one string has the advantage that we can round- 
> trip it (and there is probably hardly any other way to accomplish  
> that), but clearly in terms of semantics this isn't the sequence  
> description as we know it anymore.
>
> Does anyone else think too that completely changing the semantics of  
> sequence annotation fields is a bad idea? <sigh/>
>
> My inclination from a BioPerl perspective is to extract the part  
> following 'RecName: Full=' as the description, and attach the rest  
> as annotation. We could in fact use the TagTree class for this. I'm  
> cross-posting to BioPerl too to gather what other BioPerl'ers think  
> about this.
>
> 	-hilmar

This is much like the GN issues we've run into before, and we *could*  
set this up using TagTree or similar.  In the latter case of gene name  
the data is stored in a text tree as follows:

gene_names:
   gene_name:
     Name: GC1QBP
     Synonyms: HABP1
     Synonyms: SF2P32
     Synonyms: C1QBP

That could be changed to an XML string:

<?xml version="1.0" encoding="UTF-8"?>
<gene_names>
   <gene_name>
     <Name>GC1QBP</Name>
     <Synonyms>HABP1</Synonyms>
     <Synonyms>SF2P32</Synonyms>
     <Synonyms>C1QBP</Synonyms>
   </gene_name>
</gene_names>

Thinking about this we should attempt to coalesce around a standard  
instead of forcing the other Bio*  to a specific format.

chris

From biopython at maubp.freeserve.co.uk  Sat May 16 19:28:43 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 17 May 2009 00:28:43 +0100
Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in
	BioSQL
In-Reply-To: <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
	<071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>
Message-ID: <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com>

On 5/17/09, Chris Fields <cjfields at illinois.edu> wrote:
>
> On May 16, 2009, at 5:34 PM, Hilmar Lapp wrote:
> > My inclination from a BioPerl perspective is to extract the part following
> > 'RecName: Full=' as the description, and attach the rest as annotation. We
> > could in fact use the TagTree class for this. I'm cross-posting to BioPerl
> > too to gather what other BioPerl'ers think about this.
> >
> >        -hilmar
> >
>
> This is much like the GN issues we've run into before, and we *could* set
> this up using TagTree or similar.  In the latter case of gene name the data
> is stored in a text tree as follows:
>
>  gene_names:
>   gene_name:
>     Name: GC1QBP
>     Synonyms: HABP1
>     Synonyms: SF2P32
>     Synonyms: C1QBP
>
>  That could be changed to an XML string:
>
>  <?xml version="1.0" encoding="UTF-8"?>
>  <gene_names>
>   <gene_name>
>     <Name>GC1QBP</Name>
>     <Synonyms>HABP1</Synonyms>
>     <Synonyms>SF2P32</Synonyms>
>     <Synonyms>C1QBP</Synonyms>
>   </gene_name>
>  </gene_names>
>
> Thinking about this we should attempt to coalesce around a standard instead
> of forcing the other Bio*  to a specific format.

How would you record this in BioSQL?  As an XML string for an annotation value?

Brad has suggested JSON might be useful for this kind of thing (see
also per-letter-annotation discussion).

Peter

From hlapp at gmx.net  Sat May 16 19:37:14 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 16 May 2009 19:37:14 -0400
Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in
	BioSQL
In-Reply-To: <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
	<071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>
	<320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com>
Message-ID: <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net>


On May 16, 2009, at 7:28 PM, Peter wrote:

>> That could be changed to an XML string:
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <gene_names>
>>  <gene_name>
>>    <Name>GC1QBP</Name>
>>    <Synonyms>HABP1</Synonyms>
>>    <Synonyms>SF2P32</Synonyms>
>>    <Synonyms>C1QBP</Synonyms>
>>  </gene_name>
>> </gene_names>
>>
>> Thinking about this we should attempt to coalesce around a standard  
>> instead
>> of forcing the other Bio*  to a specific format.
>
> How would you record this in BioSQL?  As an XML string for an  
> annotation value?

Yes. A TagTree object can be serialized to XML, and the XML can be  
stored as the annotation value in BioSQL. As the XML can be read back  
in, it allows full round-tripping.

> Brad has suggested JSON might be useful for this kind of thing (see
> also per-letter-annotation discussion).

JSON could be another serialization format, but XML is equally or  
better supported in all languages except JavaScript. Furthermore, you  
could just send the XML to the browser and have an XSLT (either  
directly, or indirectly through JavaScript doing the transformation)  
do the rendering.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Sat May 16 19:42:17 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 16 May 2009 19:42:17 -0400
Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in
	BioSQL
In-Reply-To: <320fb6e00905161614p5a2d964fs6de110bb915ed066@mail.gmail.com>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
	<320fb6e00905161614p5a2d964fs6de110bb915ed066@mail.gmail.com>
Message-ID: <8CD4EED1-A689-447F-8F6E-8D2204DD4E86@gmx.net>


On May 16, 2009, at 7:14 PM, Peter wrote:

> Am I right to infer that currently BioPerl 1.6.x, like BioPerl 1.5.x  
> just
> treats the DE lines as only big long string?

Yes.

> Could you translate your idea about the TagTree class into something
> concrete with BioSQL tables and fields for me? [...] Over on the  
> Biopython list we'd talked about storing this annotation in a nested  
> structured.

That's more or less what TagTree is.

>  However, in order to use the BioSQL annotations mechanisms, I think  
> a simple flat structure is required :(

Not necessarily. If you have a flat serialization (such as XML) the  
nested structure isn't needed. Of course that's not a fully normalized  
relational representation, but if you had one, how often would it be  
used, how efficient would those queries be (SQL is poor at nested or  
recursive data structures), and how much pain would it be to write the  
object-relational mappings?

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Sun May 17 08:40:47 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 17 May 2009 13:40:47 +0100
Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in
	BioSQL
In-Reply-To: <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
	<071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>
	<320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com>
	<0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net>
Message-ID: <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com>

On 5/17/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>
>  On May 16, 2009, at 7:28 PM, Peter wrote:
> > > That could be changed to an XML string:
> > >
> > > <?xml version="1.0" encoding="UTF-8"?>
> > > <gene_names>
> > >  <gene_name>
> > >   <Name>GC1QBP</Name>
> > >   <Synonyms>HABP1</Synonyms>
> > >   <Synonyms>SF2P32</Synonyms>
> > >   <Synonyms>C1QBP</Synonyms>
> > >  </gene_name>
> > > </gene_names>
> > >
> > > Thinking about this we should attempt to coalesce around a standard
> > > instead of forcing the other Bio*  to a specific format.

Absolutely - some common standard should be agreed.

Would you envision doing this for other structured fields, inventing a
new mini XML format each time?  That seems open ended and likely to
cause a lot of work keeping all the Bio* project synchronised.

Here you have mapped RecName and AltName fields in the DE lines to
Name and Synonyms (shouldn't that be Synonym singular?).  I also don't
get why you have used a gene_name entry inside a gene_names list.
Would you hold the contains information and the flags information from
the DE lines in separate XML entries?

I would have gone for something much closer to the original DE line
markup i.e. using the field names UniProt use, RecName and AltName,
rather than mapping these to Name and Synonym.

> > How would you record this in BioSQL?  As an XML string for an annotation
> > value?
>
> Yes. A TagTree object can be serialized to XML, and the XML can be stored
> as the annotation value in BioSQL. As the XML can be read back in, it allows
> full round-tripping.

Assuming you stored all the DE markup, then yes, a round trip back to
the SwissProt file could be possible.  And, depending on the details
of the XML structure used, it would be possible to represent this in a
python structure too.

> > Brad has suggested JSON might be useful for this kind of thing (see
> > also per-letter-annotation discussion).
>
> JSON could be another serialization format, but XML is equally or better
> supported in all languages except JavaScript. Furthermore, you could just
> send the XML to the browser and have an XSLT (either directly, or indirectly
> through JavaScript doing the transformation) do the rendering.

I have no strong preference for either XML or JSON (but would rather
avoid them if they are not really needed).  For other types of
annotation there may be a clearer advantage for one over the other,
e.g. per letter annotation like the secondary structure of a protein
sequence, or the quality scores of a nucleotide contig.

On 5/17/09, Hilmar Lapp <hlapp at gmx.net> wrote:
> Not necessarily. If you have a flat serialization (such as XML) the nested
> structure isn't needed. Of course that's not a fully normalized relational
> representation, but if you had one, how often would it be used, how
> efficient would those queries be (SQL is poor at nested or recursive data
> structures), and how much pain would it be to write the object-relational
> mappings?

In this example, searching the database using one of the SwissProt
AltNames (synonyms), or filtering on the Flags sounds like a
reasonable request - but this would be very difficult if the data is
stored inside XML strings.

Of course, because the RecName and AltName entries are top level, we
could just record them as normal - simple strings in the annotations
table.  This seems much nicer.  Likewise the "Flags: Precursor;" line.
 i.e. listing the tag/value pairs which could be used in the
bioentry_qualifier_value table:

AltName = "Full=11S globulin seed storage protein II"
AltName = "Full=Alpha-globulin"
Flags = "Precursor"

(the RecName field, "Full=11S globulin seed storage protein 2", could
be used for the bioentry.description instead)

The above are all pretty easy.  We only need to consider nesting (or
something like XML or JSON) for some of the DE information, in the
example discussed the Contains lines.  Even this could be even be done
by storing each contains entry as a single long string (holding both
the name and synonyms) directly from the DE line itself, something
like this:

Contains = "RecName: Full=11S globulin seed storage protein 2 acidic
chain;\nAltName: Full=11S globulin seed storage protein II acidic
chain;"
Contains = "RecName: Full=11S globulin seed storage protein 2 basic
chain;\nAltName: Full=11S globulin seed storage protein II basic
chain;"

Peter

From sanjay.harke at gmail.com  Sun May 17 09:17:14 2009
From: sanjay.harke at gmail.com (Sanjay Harke)
Date: Sun, 17 May 2009 18:47:14 +0530
Subject: [BioSQL-l] BioSQL-l Digest, Vol 62, Issue 3
In-Reply-To: <mailman.15416.1242564062.2782.biosql-l@lists.open-bio.org>
References: <mailman.15416.1242564062.2782.biosql-l@lists.open-bio.org>
Message-ID: <31bb4380905170617k47951f83ia5bed32577a02956@mail.gmail.com>

Dear peter,

Kindly guide me for developing the connectivity of BioSql to Bioperl?

sanjay

From hlapp at gmx.net  Sun May 17 10:56:29 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 17 May 2009 10:56:29 -0400
Subject: [BioSQL-l] BioSQL-l Digest, Vol 62, Issue 3
In-Reply-To: <31bb4380905170617k47951f83ia5bed32577a02956@mail.gmail.com>
References: <mailman.15416.1242564062.2782.biosql-l@lists.open-bio.org>
	<31bb4380905170617k47951f83ia5bed32577a02956@mail.gmail.com>
Message-ID: <B83D0C6C-9728-4DBA-A07D-E274E504A795@gmx.net>


http://dx.doi.org/10.1038/npre.2007.1233.1

On May 17, 2009, at 9:17 AM, Sanjay Harke wrote:

> Dear peter,
>
> Kindly guide me for developing the connectivity of BioSql to Bioperl?
>
> sanjay
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Sun May 17 11:21:59 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 17 May 2009 11:21:59 -0400
Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in
	BioSQL
In-Reply-To: <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
	<071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>
	<320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com>
	<0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net>
	<320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com>
Message-ID: <A8AB4BCB-9CD3-428D-AF10-899AD8055EC7@gmx.net>


On May 17, 2009, at 8:40 AM, Peter wrote:

> On 5/17/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>>
>> On May 16, 2009, at 7:28 PM, Peter wrote:
>>>> That could be changed to an XML string:
>>>>
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <gene_names>
>>>> <gene_name>
>>>>  <Name>GC1QBP</Name>
>>>>  <Synonyms>HABP1</Synonyms>
>>>>  <Synonyms>SF2P32</Synonyms>
>>>>  <Synonyms>C1QBP</Synonyms>
>>>> </gene_name>
>>>> </gene_names>
>>>>
>>>> Thinking about this we should attempt to coalesce around a standard
>>>> instead of forcing the other Bio*  to a specific format.
>
> [...] Here you have mapped RecName and AltName fields in the DE  
> lines to
> Name and Synonyms (shouldn't that be Synonym singular?).

The example is for the GN lines in SwissProt, not the DE lines.

> [...]
> On 5/17/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>> Not necessarily. If you have a flat serialization (such as XML) the  
>> nested
>> structure isn't needed. Of course that's not a fully normalized  
>> relational
>> representation, but if you had one, how often would it be used, how
>> efficient would those queries be (SQL is poor at nested or  
>> recursive data
>> structures), and how much pain would it be to write the object- 
>> relational
>> mappings?
>
> In this example, searching the database using one of the SwissProt
> AltNames (synonyms), or filtering on the Flags sounds like a
> reasonable request - but this would be very difficult if the data is
> stored inside XML strings.

Actually no. Modern full-text indexers (inside or outside the  
database) can index XML text columns right away and very well. In  
fact, for the last project that I built a full-text search for (on top  
of a BioSQL database) I did that by writing custom XML documents to a  
separate table for each record I wanted indexed. Oracle's full text  
indexer did the rest. I also built a separate identifier/name/ 
accession index that pulled all the gene names, symbols, accession  
numbers, identifiers etc into a single table for indexing.

What I mean is, a fully normalized relational representation,  
especially if nested, is often not the most efficient data structure  
for efficient searching and filtering.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Mon May 18 06:03:52 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 18 May 2009 11:03:52 +0100
Subject: [BioSQL-l] Recording "nucleotide" in the sequence table?
In-Reply-To: <320fb6e00905161606l5fdb0862mf25a45dad07dac8@mail.gmail.com>
References: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com>
	<48684763-5657-40F3-BE75-31E8CDE42613@gmx.net>
	<320fb6e00905161325m497882ddge853e1525acdfede@mail.gmail.com>
	<9A3473A7-2125-486C-BAAE-27820FF19D8D@gmx.net>
	<320fb6e00905161606l5fdb0862mf25a45dad07dac8@mail.gmail.com>
Message-ID: <320fb6e00905180303m19d0c6e0hdc22ff550e518c6c@mail.gmail.com>

On Sun, May 17, 2009 at 12:06 AM, Peter wrote:
> If none of BioPerl, BioJava and BioRuby have an analogous
> sequence representation for a nucleotide sequence which
> might be DNA or RNA, then perhaps the current situation
> with only "protein", "dna", "rna" and "unknown" in the
> biosequence.alphabet field in BioSQL is sufficient.

The original Biopython bug reporter (Bug 2829, David Wyllie)
has replied on the bug.  In his case, rather than using the
generic nucleotide alphabet, he can be a bit more explicit
since he does actually know his sequence is DNA, and this
does get recorded in BioSQL fine.

Given the "nucleotide" alphabet is a corner case in Biopython,
and has no analogue in BioPerl, the status quo is fine. i.e.
The biosequence.alphabet field should contain "dna", "rna",
"protein" or "unknown" (in lower case).

Thanks for your thoughts everyone.

Peter

From michael.watson at bbsrc.ac.uk  Mon May 18 08:45:19 2009
From: michael.watson at bbsrc.ac.uk (michael watson (IAH-C))
Date: Mon, 18 May 2009 13:45:19 +0100
Subject: [BioSQL-l] Full text indexing/Searching in MySQL
Message-ID: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk>

Hi

 
Has anyone implemented full text indexing/searching for BioSQL in MySQL,
either using MySQL's full text features or any other solution?

 
Any tips, advice, documentation, code etc available?

 
Thanks

Mick

 
Head of Bioinformatics
Institute for Animal Health
Compton
Berks
RG20 7NN
01635 578411 

 
Please consider the environment and don't print this e-mail unless you
really need to.

The information contained in this message may be confidential or legally
privileged and is intended solely for the addressee. If you have
received this message in error please delete it & notify the originator
immediately.  Unauthorised use, disclosure, copying or alteration of
this message is forbidden & may be unlawful.  The contents of this
e-mail are the views of the sender and do not necessarily represent the
views of the Institute.   This email, and associated attachments, has
been checked locally for viruses but we can accept no responsibility
once it has left our systems.  Communications on Institute computers are
monitored to secure the effective operation of the systems and for other
lawful purposes.

 
The Institute for Animal Health is a company limited by guarantee,
registered in England no. 559784.  

The Institute is also a registered charity, Charity Commissioners
Reference No. 228824

 
From hlapp at gmx.net  Mon May 18 09:24:34 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Mon, 18 May 2009 09:24:34 -0400
Subject: [BioSQL-l] Full text indexing/Searching in MySQL
In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk>
Message-ID: <DB12094C-A6E1-4F79-B669-848807B96788@gmx.net>

I've done that using Oracle, not MySQL. I assume that's therefore not  
what you want to hear about and hence will shut up :)

	-hilmar

On May 18, 2009, at 8:45 AM, michael watson (IAH-C) wrote:

> Hi
>
>
>
> Has anyone implemented full text indexing/searching for BioSQL in  
> MySQL,
> either using MySQL's full text features or any other solution?
>
>
>
> Any tips, advice, documentation, code etc available?
>
>
>
> Thanks
>
> Mick
>
>
>
> Head of Bioinformatics
> Institute for Animal Health
> Compton
> Berks
> RG20 7NN
> 01635 578411
>
>
>
> Please consider the environment and don't print this e-mail unless you
> really need to.
>
> The information contained in this message may be confidential or  
> legally
> privileged and is intended solely for the addressee. If you have
> received this message in error please delete it & notify the  
> originator
> immediately.  Unauthorised use, disclosure, copying or alteration of
> this message is forbidden & may be unlawful.  The contents of this
> e-mail are the views of the sender and do not necessarily represent  
> the
> views of the Institute.   This email, and associated attachments, has
> been checked locally for viruses but we can accept no responsibility
> once it has left our systems.  Communications on Institute computers  
> are
> monitored to secure the effective operation of the systems and for  
> other
> lawful purposes.
>
>
>
> The Institute for Animal Health is a company limited by guarantee,
> registered in England no. 559784.
>
> The Institute is also a registered charity, Charity Commissioners
> Reference No. 228824
>
>
>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Mon May 18 09:26:40 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 18 May 2009 14:26:40 +0100
Subject: [BioSQL-l] Full text indexing/Searching in MySQL
In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk>
Message-ID: <320fb6e00905180626o4855aa06v6c6ae665885a3fce@mail.gmail.com>

On Mon, May 18, 2009 at 1:45 PM, michael watson (IAH-C)
<michael.watson at bbsrc.ac.uk> wrote:
>
> Hi
>
> Has anyone implemented full text indexing/searching for BioSQL in MySQL,
> either using MySQL's full text features or any other solution?
>
> Any tips, advice, documentation, code etc available?
>
> Thanks
>
> Mick

Hilmar mentioned he has done something like this on this thread,
where he was storing XML strings as annotation values:

http://lists.open-bio.org/pipermail/biosql-l/2009-May/001534.html

(You've probably read that - but just in case, worth mentioning).

Peter

From biopython at maubp.freeserve.co.uk  Mon May 18 09:38:03 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 18 May 2009 14:38:03 +0100
Subject: [BioSQL-l] [Biopython-dev] SwissProt DE lines and
	bioentry.description field in BioSQL
In-Reply-To: <A8AB4BCB-9CD3-428D-AF10-899AD8055EC7@gmx.net>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
	<071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>
	<320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com>
	<0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net>
	<320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com>
	<A8AB4BCB-9CD3-428D-AF10-899AD8055EC7@gmx.net>
Message-ID: <320fb6e00905180638q29de63c4if0627eff416c4481@mail.gmail.com>

On Sun, May 17, 2009 at 4:21 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
>
> On May 17, 2009, at 8:40 AM, Peter wrote:
>>
>> [...] Here you have mapped RecName and AltName fields in the DE lines to
>> Name and Synonyms (shouldn't that be Synonym singular?).
>
> The example is for the GN lines in SwissProt, not the DE lines.

Ah, that probably explains some of my confusion.

>> In this example, searching the database using one of the SwissProt
>> AltNames (synonyms), or filtering on the Flags sounds like a
>> reasonable request - but this would be very difficult if the data is
>> stored inside XML strings.
>
> Actually no. Modern full-text indexers (inside or outside the database) can
> index XML text columns right away and very well. In fact, for the last
> project that I built a full-text search for (on top of a BioSQL database) I
> did that by writing custom XML documents to a separate table for each
> record I wanted indexed. Oracle's full text indexer did the rest. I also built a
> separate identifier/name/accession index that pulled all the gene names,
> symbols, accession numbers, identifiers etc into a single table for
> indexing.

OK, when I said searching "would be very difficult if the data is
stored inside XML strings", maybe it wasn't so difficult for you - but
that still sounds complicated!

Sticking with the GN lines and the synonym, if this was stored as a
simple tag/value as usual in BioSQL, I would write my SQL statement to
search the annotation table where the term id was that associated with
a GN synonym, and the annotation value was "HABP1".  Simple.

Using the XML approach, are you suggesting you could do a full text
search on the annotation value field, looking for any rows where the
field contains "<Synonyms>HABP1</Synonyms>", where the term id matches
the GN lines' XML string? This sounds simplistic and probably rather
slow - presumably why you resorted to the more complicated indexing
scheme described above?

> What I mean is, a fully normalized relational representation, especially if
> nested, is often not the most efficient data structure for efficient
> searching and filtering.

OK.  But do we really need to worry about complex nested structures
for the SwissProt annotation (or in general)?

Peter

From jimp at compbio.dundee.ac.uk  Mon May 18 10:01:28 2009
From: jimp at compbio.dundee.ac.uk (James Procter)
Date: Mon, 18 May 2009 15:01:28 +0100
Subject: [BioSQL-l] BioSQL at BOSC 2009?
In-Reply-To: <74D4CC78-FC7B-4595-9D24-EB6B3ED43318@gmx.net>
References: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com>	<1BD503B3-D805-4882-87DD-820138792DB2@gmx.net>	<320fb6e00905161423l4df26525hbc9a824419c7a370@mail.gmail.com>
	<74D4CC78-FC7B-4595-9D24-EB6B3ED43318@gmx.net>
Message-ID: <4A116A38.9050705@compbio.dundee.ac.uk>

Hi all.

Hilmar Lapp wrote:
> On May 16, 2009, at 5:23 PM, Peter wrote:
> 
>> I will be staying for all of ISMB
Same here.
> 
> 
> I am too. Should we doodle something once the program is out?
I'll watch out for the URL if you post it to the list!

Jim.

-- 
-------------------------------------------------------------------
J. B. Procter  (ENFIN/VAMSAS)  Barton Bioinformatics Research Group
Phone/Fax:+44(0)1382 388734/345764  http://www.compbio.dundee.ac.uk
The University of Dundee is a Scottish Registered Charity, No. SC015096.

From roy.chaudhuri at gmail.com  Mon May 18 13:37:39 2009
From: roy.chaudhuri at gmail.com (Roy Chaudhuri)
Date: Mon, 18 May 2009 18:37:39 +0100
Subject: [BioSQL-l] Full text indexing/Searching in MySQL
In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk>
Message-ID: <4A119CE3.3080208@gmail.com>

Hi Mick,

> Has anyone implemented full text indexing/searching for BioSQL in MySQL,
> either using MySQL's full text features or any other solution?

I've kind of done this. The trouble is that full text is only 
implemented on the non-transactional MyISAM tables, not InnoDB (it has 
long been promised for InnoDB, but no sign yet). My hack solution was to 
parse out the fields I was interested in (feature tags such as gene and 
product) and include them in a separate MyISAM table, cross-referenced 
to BioSQL using seqfeature_id. This involves duplicating data (which is 
a bad thing), but should be okay if database updates are infrequent. I 
mimic atomic changes by building an updated version of the MyISAM table 
separately, then switching to use the new version at the same time as I 
commit the BioSQL updates.

There's also Sphinx (http://www.sphinxsearch.com), which is a plug-in 
that can implement full-text searches in InnoDB, but I haven't 
experimented with that so have no idea how well it works.

Cheers.
Roy.

From holland at eaglegenomics.com  Mon May 18 14:20:52 2009
From: holland at eaglegenomics.com (Richard Holland)
Date: Mon, 18 May 2009 19:20:52 +0100
Subject: [BioSQL-l] Full text indexing/Searching in MySQL
In-Reply-To: <4A119CE3.3080208@gmail.com>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk>
	<4A119CE3.3080208@gmail.com>
Message-ID: <1242670852.28726.2.camel@buzzybee>

There's also Lucene, which is a Java-based full-text indexer which can
be attached to all kinds of data sources, including MySQL databases:

http://lucene.apache.org/java/docs/

cheers,
Richard

On Mon, 2009-05-18 at 18:37 +0100, Roy Chaudhuri wrote:
> Hi Mick,
> 
> > Has anyone implemented full text indexing/searching for BioSQL in MySQL,
> > either using MySQL's full text features or any other solution?
> 
> I've kind of done this. The trouble is that full text is only 
> implemented on the non-transactional MyISAM tables, not InnoDB (it has 
> long been promised for InnoDB, but no sign yet). My hack solution was to 
> parse out the fields I was interested in (feature tags such as gene and 
> product) and include them in a separate MyISAM table, cross-referenced 
> to BioSQL using seqfeature_id. This involves duplicating data (which is 
> a bad thing), but should be okay if database updates are infrequent. I 
> mimic atomic changes by building an updated version of the MyISAM table 
> separately, then switching to use the new version at the same time as I 
> commit the BioSQL updates.
> 
> There's also Sphinx (http://www.sphinxsearch.com), which is a plug-in 
> that can implement full-text searches in InnoDB, but I haven't 
> experimented with that so have no idea how well it works.
> 
> Cheers.
> Roy.
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
-- 
Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From michael.watson at bbsrc.ac.uk  Tue May 19 04:17:32 2009
From: michael.watson at bbsrc.ac.uk (michael watson (IAH-C))
Date: Tue, 19 May 2009 09:17:32 +0100
Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors
Message-ID: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk>

Hi

 
I'm using:

 
biosql-1.0.1

bioperl-db-1.5.2_100

bioperl-1.5.2_102

 
When I run load_seqdatabase.pl on about 3000 GenBank sequences, I get:

 
Loading fmd_180509.gbk ...

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were
("AY312586S1","32307407","AY312586","Foot-and-mouth disease virus O
isolate O/SKR/2000 S fragment, complete

1,9762)

Duplicate entry 'AY312586-1-1' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
values were ("","1") FKs (324,3,4)

Duplicate entry '324-3-4-1' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were
("AY312586S2","32307408","AY312587","Foot-and-mouth disease virus O
isolate O/SKR/2000 L fragment, complete

1,9762)

Duplicate entry 'AY312587-1-1' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
values were ("","1") FKs (323,3,4)

Duplicate entry '323-3-4-1' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
values were ("","2") FKs (323,22,4)

Duplicate entry '323-22-4-2' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
values were ("","3") FKs (323,15,4)

Duplicate entry '323-15-4-3' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were
("AY312588S1","32307403","AY312588","Foot-and-mouth disease virus O
isolate O/SKR/2002 S fragment, complete

1,9762)

Duplicate entry 'AY312588-1-1' for key 2

 
---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
values were ("","1") FKs (326,3,4)

Duplicate entry '326-3-4-1' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were
("AY312588S2","32307404","AY312589","Foot-and-mouth disease virus O
isolate O/SKR/2002 L fragment, complete

1,9762)

Duplicate entry 'AY312589-1-1' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
values were ("","1") FKs (325,3,4)

Duplicate entry '325-3-4-1' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
values were ("","2") FKs (325,22,4)

Duplicate entry '325-22-4-2' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
values were ("","3") FKs (325,15,4)

Duplicate entry '325-15-4-3' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were
("S87919S2","247466","S87923","L [foot-and-mouth disease virus FMDV,
strain CS8, Genomic RNA, 10 nt, segmen

1,9754)

Duplicate entry 'S87923-1-1' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
values were ("","1") FKs (782,3,4)

Duplicate entry '782-3-4-1' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
values were ("","2") FKs (782,13,4)

Duplicate entry '782-13-4-2' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were
("S87919S1","247464","S87919","L [foot-and-mouth disease virus FMDV,
strain CS8, Genomic RNA, 35 nt, segmen

1,9754)

Duplicate entry 'S87919-1-1' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
values were ("","1") FKs (781,3,4)

Duplicate entry '781-3-4-1' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values
were ("","Direct Submission","Submitted (12-AUG-2004) National Center
for Biotechnology Information, NIH, 

C-E8D3CBBD80002FA1","1","8170","") FKs (<NULL>)

Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3

---------------------------------------------------

Could not store NC_011452:

------------- EXCEPTION  -------------

MSG: create: object (Bio::Annotation::Reference) failed to insert or to
be found by unique key

STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/BasePersistenceAdaptor.pm:206

STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/BasePersistenceAdaptor.pm:251

STACK Bio::DB::Persistent::PersistentObject::store
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/Persistent/PersistentObject.pm:271

STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/AnnotationCollectionAdaptor.pm:

STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/BasePersistenceAdaptor.pm:214

STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/BasePersistenceAdaptor.pm:251

STACK Bio::DB::Persistent::PersistentObject::store
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/Persistent/PersistentObject.pm:271

STACK Bio::DB::BioSQL::SeqAdaptor::store_children
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/SeqAdaptor.pm:224

STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/BasePersistenceAdaptor.pm:214

STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/BasePersistenceAdaptor.pm:251

STACK Bio::DB::Persistent::PersistentObject::store
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/Persistent/PersistentObject.pm:271

STACK (eval) load_seqdatabase.pl:622

STACK toplevel load_seqdatabase.pl:604

 
--------------------------------------

 
 at load_seqdatabase.pl line 635

 
Any clues?

 
Thanks

Mick

 
Head of Bioinformatics
Institute for Animal Health
Compton
Berks
RG20 7NN
01635 578411 

 
Please consider the environment and don't print this e-mail unless you
really need to.

The information contained in this message may be confidential or legally
privileged and is intended solely for the addressee. If you have
received this message in error please delete it & notify the originator
immediately.  Unauthorised use, disclosure, copying or alteration of
this message is forbidden & may be unlawful.  The contents of this
e-mail are the views of the sender and do not necessarily represent the
views of the Institute.   This email, and associated attachments, has
been checked locally for viruses but we can accept no responsibility
once it has left our systems.  Communications on Institute computers are
monitored to secure the effective operation of the systems and for other
lawful purposes.

 
The Institute for Animal Health is a company limited by guarantee,
registered in England no. 559784.  

The Institute is also a registered charity, Charity Commissioners
Reference No. 228824

 
From biopython at maubp.freeserve.co.uk  Tue May 19 05:31:05 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 19 May 2009 10:31:05 +0100
Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors
In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk>
Message-ID: <320fb6e00905190231t79ac1dc9j49585929e9b5304a@mail.gmail.com>

On Tue, May 19, 2009 at 9:17 AM, michael watson (IAH-C)
<michael.watson at bbsrc.ac.uk> wrote:
>
> Hi
>
> I'm using:
>
> biosql-1.0.1
> bioperl-db-1.5.2_100
> bioperl-1.5.2_102
>
> When I run load_seqdatabase.pl on about 3000 GenBank sequences,
> I get:
>
> Loading fmd_180509.gbk ...
> ...
> ---------------------------------------------------
>
> Could not store NC_011452:
>
> ------------- EXCEPTION ?-------------
>
> MSG: create: object (Bio::Annotation::Reference) failed to insert or to
> be found by unique key
>
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
> /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
> B/BioSQL/BasePersistenceAdaptor.pm:206
>
> ...
>
> STACK Bio::DB::Persistent::PersistentObject::store
> /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
> B/Persistent/PersistentObject.pm:271
>
> STACK (eval) load_seqdatabase.pl:622
>
> STACK toplevel load_seqdatabase.pl:604
>
> --------------------------------------
>
> ?at load_seqdatabase.pl line 635
>
> Any clues?

You got a lot of warning about feature keys (which I am guessing are
from different GenBank entries), but the failure seems to be from
something to do with the annotation in NC_011452.

Try downloading just NC_011452 in GenBank format, and testing that:
http://www.ncbi.nlm.nih.gov/nuccore/NC_011452

I would expect that to fail in the same way, and you would at least
have isolated the issue to a smaller test case. If it works, then
maybe the copy of NC_011452 in your file is corrupted somehow - check
for differences.

Peter


From hlapp at gmx.net  Tue May 19 08:25:25 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Tue, 19 May 2009 08:25:25 -0400
Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors
In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk>
Message-ID: <68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net>


On May 19, 2009, at 4:17 AM, michael watson (IAH-C) wrote:

> [...]
> -------------------- WARNING ---------------------
>
> MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values  
> were
> ("AY312586S1","32307407","AY312586","Foot-and-mouth disease virus O
> isolate O/SKR/2000 S fragment, complete
>
> 1,9762)
>
> Duplicate entry 'AY312586-1-1' for key 2
>
> ---------------------------------------------------

This suggests that a sequence with the above accession or GI number  
was already in the database, or occurs in the file twice.

If this situation is possible, you will have to pass the --lookup (or  
--flatlookup) flag to the script, and specify how you want updates to  
take place when they are necessary (options --noupdate, --remove, and  
--mergeobjs).

> -------------------- WARNING ---------------------
>
> MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
> values were ("","1") FKs (324,3,4)
>
> Duplicate entry '324-3-4-1' for key 2
> ---------------------------------------------------

I suspect that 324 is the primary key of the sequence record that  
raised the duplicate entry warning above. Can you check that?

If the insert is turned into an update, these warnings should go away  
too.

> [...]
> -------------------- WARNING ---------------------
>
> MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
> values were ("","1") FKs (323,3,4)
>
> Duplicate entry '323-3-4-1' for key 2
>
> ---------------------------------------------------

Similar to before, except 323 is probably the primary key for AY312587.

> [...]
> -------------------- WARNING ---------------------
>
> MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
> values were ("","1") FKs (325,3,4)
>
> Duplicate entry '325-3-4-1' for key 2
>
> ---------------------------------------------------

And if the order of messages is preserved correctly, 325 would be the  
primary key of AY312589.

> [...]
> -------------------- WARNING ---------------------
>
> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed,  
> values
> were ("","Direct Submission","Submitted (12-AUG-2004) National Center
> for Biotechnology Information, NIH,
>
> C-E8D3CBBD80002FA1","1","8170","") FKs (<NULL>)
>
> Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3
>
> ---------------------------------------------------

This one is odd. Can you check which existing entry you have with  
reference.crc = 'CRC-E8D3CBBD80002FA1'?

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From michael.watson at bbsrc.ac.uk  Wed May 20 05:52:13 2009
From: michael.watson at bbsrc.ac.uk (michael watson (IAH-C))
Date: Wed, 20 May 2009 10:52:13 +0100
Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors
In-Reply-To: <68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk>
	<68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net>
Message-ID: <8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk>

Hi Guys

Ok, the warnings were due to duplicate sequences - I had downloaded a
stream using Bio::DB::GenBank and I guess I assumed that would mean only
unique entries were sent back.  Using "--flatlookup --remove" gets rid
of the warnings.

Now for NC_003992.gbk...

To answer Hilmar's question:
mysql> select * from reference where crc = "CRC-E8D3CBBD80002FA1";
+--------------+-----------+--------------------------------------------
---------------------------------------------------------+--------------
-----+---------+----------------------+
| reference_id | dbxref_id | location
| title             | authors | crc                  |
+--------------+-----------+--------------------------------------------
---------------------------------------------------------+--------------
-----+---------+----------------------+
|          152 |      NULL | Submitted (12-AUG-2004) National Center for
Biotechnology Information, NIH, Bethesda, MD 20894, USA | Direct
Submission | NULL    | CRC-E8D3CBBD80002FA1 |
+--------------+-----------+--------------------------------------------
---------------------------------------------------------+--------------
-----+---------+----------------------+

And when I run load_seqdatabase.pl on NC_003992.gbk alone I still get:

perl load_seqdatabase.pl --host localhost --dbname fmd_biosql --format
genbank --dbuser removed --dbpass removed --flatlookup --remove
NC_003992.gbk

Loading NC_003992.gbk ...

-------------------- WARNING ---------------------
MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values
were ("","Direct Submission","Submitted (12-AUG-2004) National Center
for Biotechnology Information, NIH, Bethesda, MD 20894,
USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs (<NULL>)
Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3
---------------------------------------------------
Could not store NC_003992: 
------------- EXCEPTION  -------------
MSG: create: object (Bio::Annotation::Reference) failed to insert or to
be found by unique key
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/BasePersistenceAdaptor.pm:206
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/BasePersistenceAdaptor.pm:251
STACK Bio::DB::Persistent::PersistentObject::store
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/Persistent/PersistentObject.pm:271
STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/AnnotationCollectionAdaptor.pm:217
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/BasePersistenceAdaptor.pm:214
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/BasePersistenceAdaptor.pm:251
STACK Bio::DB::Persistent::PersistentObject::store
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/Persistent/PersistentObject.pm:271
STACK Bio::DB::BioSQL::SeqAdaptor::store_children
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/SeqAdaptor.pm:224
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/BasePersistenceAdaptor.pm:214
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/BasePersistenceAdaptor.pm:251
STACK Bio::DB::Persistent::PersistentObject::store
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/Persistent/PersistentObject.pm:271
STACK (eval) load_seqdatabase.pl:622
STACK toplevel load_seqdatabase.pl:604

--------------------------------------

 at load_seqdatabase.pl line 635

And I still have:

mysql> select * from reference where crc = "CRC-E8D3CBBD80002FA1";
+--------------+-----------+--------------------------------------------
---------------------------------------------------------+--------------
-----+---------+----------------------+
| reference_id | dbxref_id | location
| title             | authors | crc                  |
+--------------+-----------+--------------------------------------------
---------------------------------------------------------+--------------
-----+---------+----------------------+
|          152 |      NULL | Submitted (12-AUG-2004) National Center for
Biotechnology Information, NIH, Bethesda, MD 20894, USA | Direct
Submission | NULL    | CRC-E8D3CBBD80002FA1 |
+--------------+-----------+--------------------------------------------
---------------------------------------------------------+--------------
-----+---------+----------------------+
1 row in set (0.01 sec)

Could this be because bases 1 to 8203 of the sequence have three
references, and the crc is created on the first and then duplicated on
the second, thus causing a problem?

Cheers
Mick

-----Original Message-----
From: Hilmar Lapp [mailto:hlapp at gmx.net] 
Sent: 19 May 2009 13:25
To: michael watson (IAH-C)
Cc: biosql-l at lists.open-bio.org
Subject: Re: [BioSQL-l] load_seqdatabase.pl warnings and errors


On May 19, 2009, at 4:17 AM, michael watson (IAH-C) wrote:

> [...]
> -------------------- WARNING ---------------------
>
> MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values  
> were
> ("AY312586S1","32307407","AY312586","Foot-and-mouth disease virus O
> isolate O/SKR/2000 S fragment, complete
>
> 1,9762)
>
> Duplicate entry 'AY312586-1-1' for key 2
>
> ---------------------------------------------------

This suggests that a sequence with the above accession or GI number  
was already in the database, or occurs in the file twice.

If this situation is possible, you will have to pass the --lookup (or  
--flatlookup) flag to the script, and specify how you want updates to  
take place when they are necessary (options --noupdate, --remove, and  
--mergeobjs).

> -------------------- WARNING ---------------------
>
> MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
> values were ("","1") FKs (324,3,4)
>
> Duplicate entry '324-3-4-1' for key 2
> ---------------------------------------------------

I suspect that 324 is the primary key of the sequence record that  
raised the duplicate entry warning above. Can you check that?

If the insert is turned into an update, these warnings should go away  
too.

> [...]
> -------------------- WARNING ---------------------
>
> MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
> values were ("","1") FKs (323,3,4)
>
> Duplicate entry '323-3-4-1' for key 2
>
> ---------------------------------------------------

Similar to before, except 323 is probably the primary key for AY312587.

> [...]
> -------------------- WARNING ---------------------
>
> MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
> values were ("","1") FKs (325,3,4)
>
> Duplicate entry '325-3-4-1' for key 2
>
> ---------------------------------------------------

And if the order of messages is preserved correctly, 325 would be the  
primary key of AY312589.

> [...]
> -------------------- WARNING ---------------------
>
> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed,  
> values
> were ("","Direct Submission","Submitted (12-AUG-2004) National Center
> for Biotechnology Information, NIH,
>
> C-E8D3CBBD80002FA1","1","8170","") FKs (<NULL>)
>
> Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3
>
> ---------------------------------------------------

This one is odd. Can you check which existing entry you have with  
reference.crc = 'CRC-E8D3CBBD80002FA1'?

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Wed May 20 06:59:19 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 20 May 2009 11:59:19 +0100
Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors
In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk>
	<68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net>
	<8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk>
Message-ID: <320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com>

On Wed, May 20, 2009 at 10:52 AM, michael watson (IAH-C)
<michael.watson at bbsrc.ac.uk> wrote:
>
> Hi Guys
>
> Ok, the warnings were due to duplicate sequences - I had downloaded a
> stream using Bio::DB::GenBank and I guess I assumed that would mean only
> unique entries were sent back. ?Using "--flatlookup --remove" gets rid
> of the warnings.

Great - easy :)

> Now for NC_003992.gbk...
>
> To answer Hilmar's question:
> ...
> And when I run load_seqdatabase.pl on NC_003992.gbk alone I still get:
>
> perl load_seqdatabase.pl --host localhost --dbname fmd_biosql --format
> genbank --dbuser removed --dbpass removed --flatlookup --remove
> NC_003992.gbk
>
> Loading NC_003992.gbk ...
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values
> were ("","Direct Submission","Submitted (12-AUG-2004) National Center
> for Biotechnology Information, NIH, Bethesda, MD 20894,
> USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs (<NULL>)
> Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3
> ---------------------------------------------------
> Could not store NC_003992:
> ------------- EXCEPTION ?-------------
> MSG: create: object (Bio::Annotation::Reference) failed to insert or to
> be found by unique key
> ...

I would guess that the problem is this rather generic reference in
NC_003992 may be repeated exactly in another genome (causing the CRC
collision):

CONSRTM   NCBI Genome Project
TITLE     Direct Submission
JOURNAL   Submitted (12-AUG-2004) National Center for Biotechnology
Information, NIH, Bethesda, MD 20894, USA

See http://www.ncbi.nlm.nih.gov/nuccore/NC_011452

i.e. Could there be another direct submission by the NCBI on that date
in your collection?  You could search the database looking for that
CRC and trace it back to a bioentry, or just try grep for "JOURNAL
Submitted (12-AUG-2004) National Center for Biotechnology" on your
GenBank files. e.g. Something like this SQL statement might be
interesting:

SELECT bioentry.accession, reference.title FROM bioentry,
bioentry_reference, reference WHERE
bioentry.bioentry_id=bioentry_reference.bioentry_id AND
bioentry_reference.reference_id=reference.reference_id AND
reference.crc="CRC-E8D3CBBD80002FA1";

Peter


From michael.watson at bbsrc.ac.uk  Wed May 20 07:25:52 2009
From: michael.watson at bbsrc.ac.uk (michael watson (IAH-C))
Date: Wed, 20 May 2009 12:25:52 +0100
Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors
In-Reply-To: <320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk>
	<68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net>
	<8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk>
	<320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com>
Message-ID: <8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk>

We have a winner :)

NC_003992, NC_011452, NC_011451, NC_011450 all share at least one reference.

Would changing --flatlookup to --lookup change the behaviour so it checks for an existing reference before trying to insert the duplicate?

The answer is no :( (see below).

I guess this may need some coding then!

Thanks!
Mick

perl load_seqdatabase.pl --host localhost --dbname fmd_biosql --format genbank --dbuser removed --dbpass removed --lookup --remove NC_003992.gbk 
Loading NC_003992.gbk ...

-------------------- WARNING ---------------------
MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values were ("","Direct Submission","Submitted (12-AUG-2004) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs (<NULL>)
Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3
---------------------------------------------------
Could not store NC_003992: 
------------- EXCEPTION  -------------
MSG: create: object (Bio::Annotation::Reference) failed to insert or to be found by unique key
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
STACK Bio::DB::Persistent::PersistentObject::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm:271
STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/AnnotationCollectionAdaptor.pm:217
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:214
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
STACK Bio::DB::Persistent::PersistentObject::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm:271
STACK Bio::DB::BioSQL::SeqAdaptor::store_children /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/SeqAdaptor.pm:224
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:214
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
STACK Bio::DB::Persistent::PersistentObject::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm:271
STACK (eval) load_seqdatabase.pl:622
STACK toplevel load_seqdatabase.pl:604

--------------------------------------

 at load_seqdatabase.pl line 635

-----Original Message-----
From: p.j.a.cock at googlemail.com [mailto:p.j.a.cock at googlemail.com] On Behalf Of Peter
Sent: 20 May 2009 11:59
To: michael watson (IAH-C)
Cc: Hilmar Lapp; biosql-l at lists.open-bio.org
Subject: Re: [BioSQL-l] load_seqdatabase.pl warnings and errors

On Wed, May 20, 2009 at 10:52 AM, michael watson (IAH-C)
<michael.watson at bbsrc.ac.uk> wrote:
>
> Hi Guys
>
> Ok, the warnings were due to duplicate sequences - I had downloaded a
> stream using Bio::DB::GenBank and I guess I assumed that would mean only
> unique entries were sent back. ?Using "--flatlookup --remove" gets rid
> of the warnings.

Great - easy :)

> Now for NC_003992.gbk...
>
> To answer Hilmar's question:
> ...
> And when I run load_seqdatabase.pl on NC_003992.gbk alone I still get:
>
> perl load_seqdatabase.pl --host localhost --dbname fmd_biosql --format
> genbank --dbuser removed --dbpass removed --flatlookup --remove
> NC_003992.gbk
>
> Loading NC_003992.gbk ...
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values
> were ("","Direct Submission","Submitted (12-AUG-2004) National Center
> for Biotechnology Information, NIH, Bethesda, MD 20894,
> USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs (<NULL>)
> Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3
> ---------------------------------------------------
> Could not store NC_003992:
> ------------- EXCEPTION ?-------------
> MSG: create: object (Bio::Annotation::Reference) failed to insert or to
> be found by unique key
> ...

I would guess that the problem is this rather generic reference in
NC_003992 may be repeated exactly in another genome (causing the CRC
collision):

CONSRTM   NCBI Genome Project
TITLE     Direct Submission
JOURNAL   Submitted (12-AUG-2004) National Center for Biotechnology
Information, NIH, Bethesda, MD 20894, USA

See http://www.ncbi.nlm.nih.gov/nuccore/NC_011452

i.e. Could there be another direct submission by the NCBI on that date
in your collection?  You could search the database looking for that
CRC and trace it back to a bioentry, or just try grep for "JOURNAL
Submitted (12-AUG-2004) National Center for Biotechnology" on your
GenBank files. e.g. Something like this SQL statement might be
interesting:

SELECT bioentry.accession, reference.title FROM bioentry,
bioentry_reference, reference WHERE
bioentry.bioentry_id=bioentry_reference.bioentry_id AND
bioentry_reference.reference_id=reference.reference_id AND
reference.crc="CRC-E8D3CBBD80002FA1";

Peter


From biopython at maubp.freeserve.co.uk  Wed May 20 07:34:51 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 20 May 2009 12:34:51 +0100
Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors
In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk>
	<68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net>
	<8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk>
	<320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com>
	<8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk>
Message-ID: <320fb6e00905200434x3e1c7978ue1c58382f7478354@mail.gmail.com>

On Wed, May 20, 2009 at 12:25 PM, michael watson (IAH-C)
<michael.watson at bbsrc.ac.uk> wrote:
>
> We have a winner :)
>
> NC_003992, NC_011452, NC_011451, NC_011450 all share
> at least one reference.
>
> Would changing --flatlookup to --lookup change the behaviour
> so it checks for an existing reference before trying to insert the
> duplicate?
>
> The answer is no :( (see below).
>
> I guess this may need some coding then!

My crude idea for a simple ad-hoc solution would be to remove these
pointless references from the records, before loading them into
BioSQL.

One way would be to edit the four GenBank files by hand (e.g. to
remove the reference or make them unique). You might also do this in a
BioPerl script that loads the records, edits the references, and then
puts them in the database. Personally I use Python not Perl, so I
can't tell you how you might do that with BioPerl.

Hilmar may be able to comment from a BioPerl/BioSQL point of view -
clearly CRC collisions of this nature will happen again in future.

Peter

From holland at eaglegenomics.com  Wed May 20 07:44:58 2009
From: holland at eaglegenomics.com (Richard Holland)
Date: Wed, 20 May 2009 12:44:58 +0100
Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors
In-Reply-To: <320fb6e00905200434x3e1c7978ue1c58382f7478354@mail.gmail.com>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk>
	<68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net>
	<8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk>
	<320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com>
	<8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk>
	<320fb6e00905200434x3e1c7978ue1c58382f7478354@mail.gmail.com>
Message-ID: <1242819898.18348.1.camel@buzzybee>

Theoretically, although unlikely, it is statistically entirely possible
for two completely different references to share the same CRC. Hence the
CRC shouldn't really be used as an indicator of uniqueness, although it
is still useful as a hashing function for indexing and quick lookup.

cheers,
Richard

On Wed, 2009-05-20 at 12:34 +0100, Peter wrote:
> On Wed, May 20, 2009 at 12:25 PM, michael watson (IAH-C)
> <michael.watson at bbsrc.ac.uk> wrote:
> >
> > We have a winner :)
> >
> > NC_003992, NC_011452, NC_011451, NC_011450 all share
> > at least one reference.
> >
> > Would changing --flatlookup to --lookup change the behaviour
> > so it checks for an existing reference before trying to insert the
> > duplicate?
> >
> > The answer is no :( (see below).
> >
> > I guess this may need some coding then!
> 
> My crude idea for a simple ad-hoc solution would be to remove these
> pointless references from the records, before loading them into
> BioSQL.
> 
> One way would be to edit the four GenBank files by hand (e.g. to
> remove the reference or make them unique). You might also do this in a
> BioPerl script that loads the records, edits the references, and then
> puts them in the database. Personally I use Python not Perl, so I
> can't tell you how you might do that with BioPerl.
> 
> Hilmar may be able to comment from a BioPerl/BioSQL point of view -
> clearly CRC collisions of this nature will happen again in future.
> 
> Peter
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
-- 
Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From hlapp at gmx.net  Wed May 20 11:10:20 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 20 May 2009 11:10:20 -0400
Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors
In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk>
	<68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net>
	<8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk>
	<320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com>
	<8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk>
Message-ID: <0212C167-7618-4761-A191-C6CE4B41EC2A@gmx.net>

Indeed changing the lookup will have no effect since deletion of  
bioentries doesn't cascade to references (only to bioentry-to- 
reference associations).

What I don't understand yet is how you get the CRC clash. Normally  
this kind of situation can happen if the first occurrence does not and  
the second does have PMID, by which it will be looked up, lookup fails  
(b/c the first occurrence didn't come with PMID), resulting in an  
insert of the erroneously deemed "new" reference, which then fails  
with a CRC clash.

However, there is no PMID nor any other identifier here, so I'll have  
to look into the code to find out why the second occurrence is either  
not looked up before an insert is attempted, or if it is looked up,  
why the lookup fails to find the record stored earlier.

	-hilmar

On May 20, 2009, at 7:25 AM, michael watson (IAH-C) wrote:

> We have a winner :)
>
> NC_003992, NC_011452, NC_011451, NC_011450 all share at least one  
> reference.
>
> Would changing --flatlookup to --lookup change the behaviour so it  
> checks for an existing reference before trying to insert the  
> duplicate?
>
> The answer is no :( (see below).
>
> I guess this may need some coding then!
>
> Thanks!
> Mick
>
> perl load_seqdatabase.pl --host localhost --dbname fmd_biosql -- 
> format genbank --dbuser removed --dbpass removed --lookup --remove  
> NC_003992.gbk
> Loading NC_003992.gbk ...
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed,  
> values were ("","Direct Submission","Submitted (12-AUG-2004)  
> National Center for Biotechnology Information, NIH, Bethesda, MD  
> 20894, USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs (<NULL>)
> Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3
> ---------------------------------------------------
> Could not store NC_003992:
> ------------- EXCEPTION  -------------
> MSG: create: object (Bio::Annotation::Reference) failed to insert or  
> to be found by unique key
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/BasePersistenceAdaptor.pm:206
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/BasePersistenceAdaptor.pm:251
> STACK Bio::DB::Persistent::PersistentObject::store /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> Persistent/PersistentObject.pm:271
> STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children / 
> usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/ 
> Bio/DB/BioSQL/AnnotationCollectionAdaptor.pm:217
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/BasePersistenceAdaptor.pm:214
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/BasePersistenceAdaptor.pm:251
> STACK Bio::DB::Persistent::PersistentObject::store /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> Persistent/PersistentObject.pm:271
> STACK Bio::DB::BioSQL::SeqAdaptor::store_children /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/SeqAdaptor.pm:224
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/BasePersistenceAdaptor.pm:214
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/BasePersistenceAdaptor.pm:251
> STACK Bio::DB::Persistent::PersistentObject::store /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> Persistent/PersistentObject.pm:271
> STACK (eval) load_seqdatabase.pl:622
> STACK toplevel load_seqdatabase.pl:604
>
> --------------------------------------
>
> at load_seqdatabase.pl line 635
>
> -----Original Message-----
> From: p.j.a.cock at googlemail.com [mailto:p.j.a.cock at googlemail.com]  
> On Behalf Of Peter
> Sent: 20 May 2009 11:59
> To: michael watson (IAH-C)
> Cc: Hilmar Lapp; biosql-l at lists.open-bio.org
> Subject: Re: [BioSQL-l] load_seqdatabase.pl warnings and errors
>
> On Wed, May 20, 2009 at 10:52 AM, michael watson (IAH-C)
> <michael.watson at bbsrc.ac.uk> wrote:
>>
>> Hi Guys
>>
>> Ok, the warnings were due to duplicate sequences - I had downloaded a
>> stream using Bio::DB::GenBank and I guess I assumed that would mean  
>> only
>> unique entries were sent back.  Using "--flatlookup --remove" gets  
>> rid
>> of the warnings.
>
> Great - easy :)
>
>> Now for NC_003992.gbk...
>>
>> To answer Hilmar's question:
>> ...
>> And when I run load_seqdatabase.pl on NC_003992.gbk alone I still  
>> get:
>>
>> perl load_seqdatabase.pl --host localhost --dbname fmd_biosql -- 
>> format
>> genbank --dbuser removed --dbpass removed --flatlookup --remove
>> NC_003992.gbk
>>
>> Loading NC_003992.gbk ...
>>
>> -------------------- WARNING ---------------------
>> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed,  
>> values
>> were ("","Direct Submission","Submitted (12-AUG-2004) National Center
>> for Biotechnology Information, NIH, Bethesda, MD 20894,
>> USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs (<NULL>)
>> Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3
>> ---------------------------------------------------
>> Could not store NC_003992:
>> ------------- EXCEPTION  -------------
>> MSG: create: object (Bio::Annotation::Reference) failed to insert  
>> or to
>> be found by unique key
>> ...
>
> I would guess that the problem is this rather generic reference in
> NC_003992 may be repeated exactly in another genome (causing the CRC
> collision):
>
> CONSRTM   NCBI Genome Project
> TITLE     Direct Submission
> JOURNAL   Submitted (12-AUG-2004) National Center for Biotechnology
> Information, NIH, Bethesda, MD 20894, USA
>
> See http://www.ncbi.nlm.nih.gov/nuccore/NC_011452
>
> i.e. Could there be another direct submission by the NCBI on that date
> in your collection?  You could search the database looking for that
> CRC and trace it back to a bioentry, or just try grep for "JOURNAL
> Submitted (12-AUG-2004) National Center for Biotechnology" on your
> GenBank files. e.g. Something like this SQL statement might be
> interesting:
>
> SELECT bioentry.accession, reference.title FROM bioentry,
> bioentry_reference, reference WHERE
> bioentry.bioentry_id=bioentry_reference.bioentry_id AND
> bioentry_reference.reference_id=reference.reference_id AND
> reference.crc="CRC-E8D3CBBD80002FA1";
>
> Peter

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Fri May 22 08:27:06 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 22 May 2009 13:27:06 +0100
Subject: [BioSQL-l] RULES in BioSQL PostgreSQL schema
Message-ID: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com>

Hi all,

This is a continuation of a thread / bug report from Biopython (Bug 2833)
where attempting to import duplicate entries into BioSQL did not raise an
error on PostgreSQL (but does on MySQL). Cymon traced this to the
RULES present in the schema to help bioperl-db.

On Fri, May 22, 2009 at 3:05 AM, Hilmar Lapp <hlapp at gmx.net> wrote:
>
> On May 21, 2009, at 6:52 PM, Cymon Cox wrote:
>
>> [...]
>>
>> Hi Andrea,
>>
>> The problem appears to be related to the BioSQL schema/PostGreSQL.
>>
>> As you indicated, adding a duplicate entry to bioentry returns a "INSERT 0
>> 0" and doesnt throw an IntegrityError which is what the code is looking
>> from and presumably what MySQL throws.
>>
>> The reason it doesnt throw an error is because of one (or both) of the
>> RULES in the schema:
>
> Indeed, I'd almost forgotten. The rules are there mostly as a remnant from
> earlier versions of PostgreSQL to support transactional loading the way
> bioperl-db (the object-relational mapping for BioPerl) is optimized. You
> probably don't need them anywhere else.
>
> ? ? ? ?-hilmar
>
> <gory-details>
> Bioperl-db is optimized such that entities that very likely don't exist yet
> in the database are attempted for insert right away. If the insert fails due
> to a unique key violation, the record is looked up (and then expected to be
> found). In Oracle and MySQL you can do this and the transaction remains
> healthy; i.e., you can commit the transaction later and all statements
> except those that failed will be committed. In PostgreSQL any failed
> statement dooms the entire transaction, and the only way out is a rollback.
> In this case, if you want the loading of one sequence record as one
> transaction, failing to insert a single feature record will doom the entire
> sequence load and you would need to start over with the sequence. To fix
> this, I wrote the rules, which in essence do do the lookups for PostgreSQL
> that the bioperl-db code would otherwise avoid, and on insert do nothing if
> the record is found, which results in zero rows affected when you would
> expect one (which is what bioperl-db cues off of and then triggers a
> lookup).
> The right way to do this meanwhile is to use nested transactions, which
> PostgreSQL supports since v8.0.x, but I haven't gotten around to implement
> support for that in Bioperl-db.
> </gory-details>

Hilmar,

It seems for Biopython to work properly with BioSQL on PostgreSQL
these bioentry rules should be removed from the schema (as the
comments in the schema do suggest). Obviously doing this would
break any installation also using the current version of bioperl-db.

Do the RULES affect BioJava or BioRuby using BioSQL on
PostgreSQL?

Are you happy to remove these RULES in BioSQL v1.0.x (after
making the outlined transactional changes in bioperl-db)?

Thanks,

Peter


From hlapp at gmx.net  Fri May 22 11:03:11 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Fri, 22 May 2009 11:03:11 -0400
Subject: [BioSQL-l] RULES in BioSQL PostgreSQL schema
In-Reply-To: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com>
References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com>
Message-ID: <CFCA6516-FFEA-4D8A-9E51-BCBAFC0F67A7@gmx.net>


On May 22, 2009, at 8:27 AM, Peter wrote:

> Are you happy to remove these RULES in BioSQL v1.0.x (after
> making the outlined transactional changes in bioperl-db)?

In principle yes. It would also mean dropping support for PostgreSQL  
v7.x, but I would hope that that's a non-issue.

But if anyone here is still using and relying on PostgreSQL v7.x (or  
earlier?) do let us know, please.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Fri May 22 11:57:38 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 22 May 2009 16:57:38 +0100
Subject: [BioSQL-l] [Biopython-dev] RULES in BioSQL PostgreSQL schema
In-Reply-To: <CFCA6516-FFEA-4D8A-9E51-BCBAFC0F67A7@gmx.net>
References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com>
	<CFCA6516-FFEA-4D8A-9E51-BCBAFC0F67A7@gmx.net>
Message-ID: <320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com>

On Fri, May 22, 2009 at 4:03 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
>
> On May 22, 2009, at 8:27 AM, Peter wrote:
>
>> Are you happy to remove these RULES in BioSQL v1.0.x (after
>> making the outlined transactional changes in bioperl-db)?
>
> In principle yes. It would also mean dropping support for PostgreSQL v7.x,
> but I would hope that that's a non-issue.
>
> But if anyone here is still using and relying on PostgreSQL v7.x (or
> earlier?) do let us know, please.

Great.

In the meantime could you add a big warning about this issue to the
INSTALL notes for PostgreSQL (i.e. recommend removing the RULES
section if not using bioper-db)?
http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/INSTALL

Peter

From hlapp at gmx.net  Fri May 22 14:20:58 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Fri, 22 May 2009 14:20:58 -0400
Subject: [BioSQL-l] [Biopython-dev] RULES in BioSQL PostgreSQL schema
In-Reply-To: <320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com>
References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com>
	<CFCA6516-FFEA-4D8A-9E51-BCBAFC0F67A7@gmx.net>
	<320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com>
Message-ID: <410BDC8E-1305-4FE5-860D-694E6A3EA9E6@gmx.net>

Yes, agree. Would you mind filing this in Bugzilla for BioSQL? -hilmar

On May 22, 2009, at 11:57 AM, Peter wrote:

> On Fri, May 22, 2009 at 4:03 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
>>
>> On May 22, 2009, at 8:27 AM, Peter wrote:
>>
>>> Are you happy to remove these RULES in BioSQL v1.0.x (after
>>> making the outlined transactional changes in bioperl-db)?
>>
>> In principle yes. It would also mean dropping support for  
>> PostgreSQL v7.x,
>> but I would hope that that's a non-issue.
>>
>> But if anyone here is still using and relying on PostgreSQL v7.x (or
>> earlier?) do let us know, please.
>
> Great.
>
> In the meantime could you add a big warning about this issue to the
> INSTALL notes for PostgreSQL (i.e. recommend removing the RULES
> section if not using bioper-db)?
> http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/INSTALL
>
> Peter

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Fri May 22 18:46:54 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 22 May 2009 23:46:54 +0100
Subject: [BioSQL-l] [Biopython-dev] RULES in BioSQL PostgreSQL schema
In-Reply-To: <410BDC8E-1305-4FE5-860D-694E6A3EA9E6@gmx.net>
References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com>
	<CFCA6516-FFEA-4D8A-9E51-BCBAFC0F67A7@gmx.net>
	<320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com>
	<410BDC8E-1305-4FE5-860D-694E6A3EA9E6@gmx.net>
Message-ID: <320fb6e00905221546i26edc7a2u2a02fb0d01c374ea@mail.gmail.com>

On 5/22/09, Hilmar Lapp <hlapp at gmx.net> wrote:
> Yes, agree. Would you mind filing this in Bugzilla for BioSQL? -hilmar

I've filed Bug 2839, hopefully this is what you had in mind:
http://bugzilla.open-bio.org/show_bug.cgi?id=2839

Peter

From michael.watson at bbsrc.ac.uk  Wed May 27 08:50:45 2009
From: michael.watson at bbsrc.ac.uk (michael watson (IAH-C))
Date: Wed, 27 May 2009 13:50:45 +0100
Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors
In-Reply-To: <0212C167-7618-4761-A191-C6CE4B41EC2A@gmx.net>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk>
	<68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net>
	<8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk>
	<320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com>
	<8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk>
	<0212C167-7618-4761-A191-C6CE4B41EC2A@gmx.net>
Message-ID: <8975119BCD0AC5419D61A9CF1A923E9507E27A82@iahce2ksrv1.iah.bbsrc.ac.uk>

Hi Hilmar

I tried to dig around in the code, but quite frankly I quickly got lost.
What is clear is that the existing reference is not being found in the
cache nor the database, and therefore a unique key violation occurs when
the code tries to insert the object.

I'm pretty stuffed on this project until I can get this sorted out.

If someone tells me where to look I can try and sort out why this
happens, but at the moment (for me) it's like looking for a needle in a
haystack.

Thanks in advance

Mick

-----Original Message-----
From: Hilmar Lapp [mailto:hlapp at gmx.net] 
Sent: 20 May 2009 16:10
To: michael watson (IAH-C)
Cc: Peter; biosql-l at lists.open-bio.org
Subject: Re: [BioSQL-l] load_seqdatabase.pl warnings and errors

Indeed changing the lookup will have no effect since deletion of  
bioentries doesn't cascade to references (only to bioentry-to- 
reference associations).

What I don't understand yet is how you get the CRC clash. Normally  
this kind of situation can happen if the first occurrence does not and  
the second does have PMID, by which it will be looked up, lookup fails  
(b/c the first occurrence didn't come with PMID), resulting in an  
insert of the erroneously deemed "new" reference, which then fails  
with a CRC clash.

However, there is no PMID nor any other identifier here, so I'll have  
to look into the code to find out why the second occurrence is either  
not looked up before an insert is attempted, or if it is looked up,  
why the lookup fails to find the record stored earlier.

	-hilmar

On May 20, 2009, at 7:25 AM, michael watson (IAH-C) wrote:

> We have a winner :)
>
> NC_003992, NC_011452, NC_011451, NC_011450 all share at least one  
> reference.
>
> Would changing --flatlookup to --lookup change the behaviour so it  
> checks for an existing reference before trying to insert the  
> duplicate?
>
> The answer is no :( (see below).
>
> I guess this may need some coding then!
>
> Thanks!
> Mick
>
> perl load_seqdatabase.pl --host localhost --dbname fmd_biosql -- 
> format genbank --dbuser removed --dbpass removed --lookup --remove  
> NC_003992.gbk
> Loading NC_003992.gbk ...
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed,  
> values were ("","Direct Submission","Submitted (12-AUG-2004)  
> National Center for Biotechnology Information, NIH, Bethesda, MD  
> 20894, USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs (<NULL>)
> Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3
> ---------------------------------------------------
> Could not store NC_003992:
> ------------- EXCEPTION  -------------
> MSG: create: object (Bio::Annotation::Reference) failed to insert or  
> to be found by unique key
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/BasePersistenceAdaptor.pm:206
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/BasePersistenceAdaptor.pm:251
> STACK Bio::DB::Persistent::PersistentObject::store /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> Persistent/PersistentObject.pm:271
> STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children / 
> usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/ 
> Bio/DB/BioSQL/AnnotationCollectionAdaptor.pm:217
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/BasePersistenceAdaptor.pm:214
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/BasePersistenceAdaptor.pm:251
> STACK Bio::DB::Persistent::PersistentObject::store /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> Persistent/PersistentObject.pm:271
> STACK Bio::DB::BioSQL::SeqAdaptor::store_children /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/SeqAdaptor.pm:224
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/BasePersistenceAdaptor.pm:214
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/BasePersistenceAdaptor.pm:251
> STACK Bio::DB::Persistent::PersistentObject::store /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> Persistent/PersistentObject.pm:271
> STACK (eval) load_seqdatabase.pl:622
> STACK toplevel load_seqdatabase.pl:604
>
> --------------------------------------
>
> at load_seqdatabase.pl line 635
>
> -----Original Message-----
> From: p.j.a.cock at googlemail.com [mailto:p.j.a.cock at googlemail.com]  
> On Behalf Of Peter
> Sent: 20 May 2009 11:59
> To: michael watson (IAH-C)
> Cc: Hilmar Lapp; biosql-l at lists.open-bio.org
> Subject: Re: [BioSQL-l] load_seqdatabase.pl warnings and errors
>
> On Wed, May 20, 2009 at 10:52 AM, michael watson (IAH-C)
> <michael.watson at bbsrc.ac.uk> wrote:
>>
>> Hi Guys
>>
>> Ok, the warnings were due to duplicate sequences - I had downloaded a
>> stream using Bio::DB::GenBank and I guess I assumed that would mean  
>> only
>> unique entries were sent back.  Using "--flatlookup --remove" gets  
>> rid
>> of the warnings.
>
> Great - easy :)
>
>> Now for NC_003992.gbk...
>>
>> To answer Hilmar's question:
>> ...
>> And when I run load_seqdatabase.pl on NC_003992.gbk alone I still  
>> get:
>>
>> perl load_seqdatabase.pl --host localhost --dbname fmd_biosql -- 
>> format
>> genbank --dbuser removed --dbpass removed --flatlookup --remove
>> NC_003992.gbk
>>
>> Loading NC_003992.gbk ...
>>
>> -------------------- WARNING ---------------------
>> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed,  
>> values
>> were ("","Direct Submission","Submitted (12-AUG-2004) National Center
>> for Biotechnology Information, NIH, Bethesda, MD 20894,
>> USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs (<NULL>)
>> Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3
>> ---------------------------------------------------
>> Could not store NC_003992:
>> ------------- EXCEPTION  -------------
>> MSG: create: object (Bio::Annotation::Reference) failed to insert  
>> or to
>> be found by unique key
>> ...
>
> I would guess that the problem is this rather generic reference in
> NC_003992 may be repeated exactly in another genome (causing the CRC
> collision):
>
> CONSRTM   NCBI Genome Project
> TITLE     Direct Submission
> JOURNAL   Submitted (12-AUG-2004) National Center for Biotechnology
> Information, NIH, Bethesda, MD 20894, USA
>
> See http://www.ncbi.nlm.nih.gov/nuccore/NC_011452
>
> i.e. Could there be another direct submission by the NCBI on that date
> in your collection?  You could search the database looking for that
> CRC and trace it back to a bioentry, or just try grep for "JOURNAL
> Submitted (12-AUG-2004) National Center for Biotechnology" on your
> GenBank files. e.g. Something like this SQL statement might be
> interesting:
>
> SELECT bioentry.accession, reference.title FROM bioentry,
> bioentry_reference, reference WHERE
> bioentry.bioentry_id=bioentry_reference.bioentry_id AND
> bioentry_reference.reference_id=reference.reference_id AND
> reference.crc="CRC-E8D3CBBD80002FA1";
>
> Peter

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Thu May 14 18:20:47 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 14 May 2009 19:20:47 +0100
Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in
	BioSQL
Message-ID: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>

Hi,

This is cross-posted between biopython-dev and biosql-l as it regards
parsing the description (DE) lines in SwissProt files and how they are
stored in BioSQL.  This follows from an earlier discussion on
biopython-dev

Older SwissProt files just had one or two DE lines, and it made sense
to treat this as a simple string mapped onto the description field in
the bioentry table in BioSQL.  This appears to what happens with
BioPerl 1.5.x and in Biopython (although the details regarding white
space differ).  However, newer SwissProt files have many DE lines with
additional structure.  The example Michiel gave earlier on the
biopython-dev list was:

http://www.uniprot.org/uniprot/Q9XHP0.txt

This has the following DE lines:

DE   RecName: Full=11S globulin seed storage protein 2;
DE   AltName: Full=11S globulin seed storage protein II;
DE   AltName: Full=Alpha-globulin;
DE   Contains:
DE     RecName: Full=11S globulin seed storage protein 2 acidic chain;
DE     AltName: Full=11S globulin seed storage protein II acidic chain;
DE   Contains:
DE     RecName: Full=11S globulin seed storage protein 2 basic chain;
DE     AltName: Full=11S globulin seed storage protein II basic chain;
DE   Flags: Precursor;

I had to fight with perl to get my old copy of BioPerl working again
(some week reference thing), but I managed, and then loaded this file
into my test BioSQL database with:

$ perl load_seqdatabase.pl --dbname biosql_test --dbuser root --dbpass
XXX --namespace biosql_test --format swiss Q9XHP0.txt

Then I looked at the resulting description in the main bioentry table:

$ mysql --user=root -p biosql_test -e 'SELECT description FROM
bioentry WHERE accession="Q9XHP0";'

This is stored as one huge long string (without the newlines, I'm not
sure if BioPerl strips those in parsing the file, or when loading it
into the database):

RecName: Full=11S globulin seed storage protein 2; AltName: Full=11S
globulin seed storage protein II; AltName: Full=Alpha-globulin;
Contains: RecName: Full=11S globulin seed storage protein 2 acidic
chain; AltName: Full=11S globulin seed storage protein II acidic
chain; Contains: RecName: Full=11S globulin seed storage protein 2
basic chain; AltName: Full=11S globulin seed storage protein II basic
chain; Flags: Precursor;

For Biopython, I emptied the database then did:

>>> from Bio import SeqIO
>>> from BioSQL import BioSeqDatabase
>>> server = BioSeqDatabase.open_database(driver="MySQLdb", user="root", passwd = "XXX", host = "localhost", db="biosql_test")
>>> db = server["biosql-test"] #namespace
>>> db.load(SeqIO.parse(open("Q9XHP0.txt"), "swiss"))
1
>>> server.commit()

As before, I looked in the table with mysql.  Again - this stores the
full description from the DE line, although with the newlines
embedded.  So, Biopython is consistent with my old copy of BioPerl
(1.5.x) if we ignore the white space.

However, how does this look in BioPerl 1.6?  If this is the same, are
there any plans to change this?  For Biopython we have discussed
recording most of the DE information under the annotations instead
(keyed off RecName, AltName, Contains, Flags), but I would like to be
consistent with BioPerl+BioSQL.

Thanks

Peter


From biopython at maubp.freeserve.co.uk  Sat May 16 11:53:07 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 16 May 2009 12:53:07 +0100
Subject: [BioSQL-l] Recording "nucleotide" in the sequence table?
Message-ID: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com>

Hi all,

You may recall a year ago or so, we talked about how BioPerl and
Biopython used lower case alphabet names ("dna", "rna", "protein")
while BioJava was inconsistent and used upper (or even mixed case).

http://lists.open-bio.org/pipermail/biopython/2007-November/003894.html
http://lists.open-bio.org/pipermail/biojava-l/2007-November/006034.html
http://lists.open-bio.org/pipermail/biosql-l/2008-March/001185.html

You'll notice that thread was split over several mailing lists (and
looking back, I think I missed some posts as I only read the Biopython
and BioSQL lists).

Anyway, this lead to the following proposal:

http://www.biosql.org/wiki/Enhancement_Requests#Check_constraint_on_biosequence.alphabet

In Biopython we also use "unknown" for sequences which are not known
to be "dna", "rna", "protein".  I presume this was copying BioPerl.

In a recent bug report (Bug 2829) it was pointed out that we
(Biopython) don't attempt to record nucleotide alphabets in BioSQL
(i.e. a sequence which could be DNA or RNA but we don't know which),
they just get "unknown" as their biosequence.alphabet entry.

Is there any precedent in BioPerl, BioJava or BioRuby for how to
handle this?  If not, I'd like to introduce and agree on "nucleotide"
for this situation.

Peter


From biopython at maubp.freeserve.co.uk  Sat May 16 12:12:01 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 16 May 2009 13:12:01 +0100
Subject: [BioSQL-l] BioSQL at BOSC 2009?
Message-ID: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com>

Hi,

Will any of the key BioSQL people from the Bio* projects be at BOSC
(and ISMB) this year? http://open-bio.org/wiki/BOSC_2009

There will be several people from Biopython there this year, including
me and Brad Chapman who are both familiar with BioSQL.  This would be
a nice opportunity for further improving BioSQL compatibility between
the Bio* projects - something that has been suggested in the past,
e.g.

http://lists.open-bio.org/pipermail/biopython/2007-November/003893.html
http://lists.open-bio.org/pipermail/biojava-l/2007-November/006037.html

I don't follow the BioPerl, BioJava or BioRuby mailing lists - and I
doubt many of their developers follow the Biopython mailing lists.
So, rather than having any BioSQL compatibility discussions split over
individual Bio* project specific mailing lists, it seems using the
BioSQL mailing list is most appropriate.

I have CC'd a few key people just in case they are not on the BioSQL
mailing list, if I have missed anyone please forward this to them and
ask them to sign up.

Thanks,

Peter


From markjschreiber at gmail.com  Sat May 16 14:58:19 2009
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Sat, 16 May 2009 22:58:19 +0800
Subject: [BioSQL-l] Recording "nucleotide" in the sequence table?
In-Reply-To: <93b45ca50905160755o4e5c9520n55bc5b84774f277a@mail.gmail.com>
References: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com>
	<93b45ca50905160755o4e5c9520n55bc5b84774f277a@mail.gmail.com>
Message-ID: <93b45ca50905160758j7c9f1d78k9ec49008d10f2e4f@mail.gmail.com>

I don't think you can do this with certainty. If you don't know the source
alphabet then an amino acid sequence could look like dna if it is only using
acgt and some of the ambiguity codes.

If it is a long sequence it will become increasingly unlikey it is amino
acid but never certain.

On 16 May 2009, 7:54 PM, "Peter" <biopython at maubp.freeserve.co.uk> wrote:

Hi all,

You may recall a year ago or so, we talked about how BioPerl and
Biopython used lower case alphabet names ("dna", "rna", "protein")
while BioJava was inconsistent and used upper (or even mixed case).

http://lists.open-bio.org/pipermail/biopython/2007-November/003894.html
http://lists.open-bio.org/pipermail/biojava-l/2007-November/006034.html
http://lists.open-bio.org/pipermail/biosql-l/2008-March/001185.html

You'll notice that thread was split over several mailing lists (and
looking back, I think I missed some posts as I only read the Biopython
and BioSQL lists).

Anyway, this lead to the following proposal:

http://www.biosql.org/wiki/Enhancement_Requests#Check_constraint_on_biosequence.alphabet

In Biopython we also use "unknown" for sequences which are not known
to be "dna", "rna", "protein".  I presume this was copying BioPerl.

In a recent bug report (Bug 2829) it was pointed out that we
(Biopython) don't attempt to record nucleotide alphabets in BioSQL
(i.e. a sequence which could be DNA or RNA but we don't know which),
they just get "unknown" as their biosequence.alphabet entry.

Is there any precedent in BioPerl, BioJava or BioRuby for how to
handle this?  If not, I'd like to introduce and agree on "nucleotide"
for this situation.

Peter
_______________________________________________
BioSQL-l mailing list
BioSQL-l at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biosql-l


From hlapp at gmx.net  Sat May 16 15:17:39 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 16 May 2009 11:17:39 -0400
Subject: [BioSQL-l] BioSQL at BOSC 2009?
In-Reply-To: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com>
References: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com>
Message-ID: <1BD503B3-D805-4882-87DD-820138792DB2@gmx.net>


On May 16, 2009, at 8:12 AM, Peter wrote:

> Will any of the key BioSQL people from the Bio* projects be at BOSC
> (and ISMB) this year? http://open-bio.org/wiki/BOSC_2009

Yes, I'll be there (though I am not presenting this year).

> [...] This would be a nice opportunity for further improving BioSQL  
> compatibility between the Bio* projects - something that has been  
> suggested in the past,

Indeed, excellent idea. Should we plan for a BoF?

	-hilmar

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Sat May 16 16:48:40 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 16 May 2009 12:48:40 -0400
Subject: [BioSQL-l] Recording "nucleotide" in the sequence table?
In-Reply-To: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com>
References: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com>
Message-ID: <48684763-5657-40F3-BE75-31E8CDE42613@gmx.net>


On May 16, 2009, at 7:53 AM, Peter wrote:

> In a recent bug report (Bug 2829) it was pointed out that we
> (Biopython) don't attempt to record nucleotide alphabets in BioSQL
> (i.e. a sequence which could be DNA or RNA but we don't know which),
> they just get "unknown" as their biosequence.alphabet entry.

I'm assuming that you do know that it's not protein, right? I.e.,  
assigning alphabet "unknown" isn't exactly right.

> Is there any precedent in BioPerl, BioJava or BioRuby for how to
> handle this?  If not, I'd like to introduce and agree on "nucleotide"
> for this situation.


So which letters (symbols) does the "nucleotide" alphabet contain?

Getting back to Mark's question, how do you know that it's either dna  
or rna but not protein? Is the problem that the user can't tell you  
whether it's dna or rna but they know it's not protein, or is it that  
the user doesn't say anything and all you have is the symbols of the  
sequence, which are a, c, g, and t only.

In BioPerl we'll guess the alphabet if the user doesn't say what it  
is, and at present if what we're seeing are the symbols a, c, g, and t  
only, then the guess is dna. If we're seeing u rather than t, we guess  
it's rna. An "unknown" alphabet would be for the user to expressly  
choose.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Sat May 16 20:25:21 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 16 May 2009 21:25:21 +0100
Subject: [BioSQL-l] Recording "nucleotide" in the sequence table?
In-Reply-To: <48684763-5657-40F3-BE75-31E8CDE42613@gmx.net>
References: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com>
	<48684763-5657-40F3-BE75-31E8CDE42613@gmx.net>
Message-ID: <320fb6e00905161325m497882ddge853e1525acdfede@mail.gmail.com>

Hilmar wrote:
>  I'm assuming that you do know that it's not protein, right?
>  I.e., assigning alphabet "unknown" isn't exactly right.

Yes, if the sequence is using the generic nucleotide alphabet this
means it is NOT protein, and could be DNA or RNA.  So yes,
downgrading a "nucleotide" alphabet to just "unknown" when
storing it in BioSQL (as we do now) is losing information - hence
me starting this thread.

> > Is there any precedent in BioPerl, BioJava or BioRuby for how to
> > handle this?  If not, I'd like to introduce and agree on "nucleotide"
> > for this situation.
>
>  So which letters (symbols) does the "nucleotide" alphabet contain?

Potentially anything - although I would expect the standard (ambiguous)
letters using in RNA or DNA, plus perhaps gap symbols.

> Getting back to Mark's question, how do you know that it's either dna or
> rna but not protein?

We know because the user (or parser) has explicitly used the generic
nucleotide alphabet, this means it is not protein, and is either
DNA or RNA. From the point of loading the sequence into BioSQL,
we don't know or care where the sequence came from - we just get
given the data with a declared alphabet.

> Is the problem that the user can't tell you whether it's dna or
> rna but they know it's not protein, or is it that the user doesn't
> say anything and all you have is the symbols of the sequence,
> which are a, c, g, and t only.

In the situation I'm talking about, either the user has explicitly
picked the alphabet, or perhaps one of our parsers has done so.
This would be because the user don't know, of the file format
doesn't specify this information.  This is admittedly a corner
case - generally there will be either be T or U entries in the
sequence so DNA or RNA can be deduced unambiguously.

> In BioPerl we'll guess the alphabet if the user doesn't say what it is, and
> at present if what we're seeing are the symbols a, c, g, and t only, then
> the guess is dna. If we're seeing u rather than t, we guess it's rna. An
> "unknown" alphabet would be for the user to expressly choose.

What would BioPerl do with the nucleotide sequence GCGCGCGA?
Presumably you guess, thus record either "dna" or "rna" in BioSQL,
so the issue of wanting to record "nucleotide" never arises.

In python "guessing" is discouraged.  If we have a nucleotide sequence
like GCGCGCGA, this could be DNA or RNA - you can't tell.  Our
nucleotide alphabet covers this situation , although another strong
reason for having it is as a common base class for the RNA and
DNA alphabets.

On 5/16/09, Mark Schreiber <markjschreiber at gmail.com> wrote:
> I don't think you can do this with certainty. If you don't know the source
> alphabet then an amino acid sequence could look like dna if it is only
> using acgt and some of the ambiguity codes.
>
> If it is a long sequence it will become increasingly unlikey it is amino
> acid but never certain.

The python answer is don't guess. If you read in a FASTA file with
Biopython it will by default be given a generic alphabet, unless you
explicitly specify otherwise (and in BioSQL the alphabet will be
stored as "unknown").  i.e. the onus is on the user to be explicit.

Peter


From biopython at maubp.freeserve.co.uk  Sat May 16 21:23:04 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 16 May 2009 22:23:04 +0100
Subject: [BioSQL-l] BioSQL at BOSC 2009?
In-Reply-To: <1BD503B3-D805-4882-87DD-820138792DB2@gmx.net>
References: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com>
	<1BD503B3-D805-4882-87DD-820138792DB2@gmx.net>
Message-ID: <320fb6e00905161423l4df26525hbc9a824419c7a370@mail.gmail.com>

On 5/16/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>
>  On May 16, 2009, at 8:12 AM, Peter wrote:
>
> > Will any of the key BioSQL people from the Bio* projects be at BOSC
> > (and ISMB) this year? http://open-bio.org/wiki/BOSC_2009
> >
>
>  Yes, I'll be there (though I am not presenting this year).
>
> > [...] This would be a nice opportunity for further improving BioSQL
> > compatibility between the Bio* projects - something that has been
> > suggested in the past,
>
>  Indeed, excellent idea. Should we plan for a BoF?

If you want to do this as a formal BoF, then sure.

Brad and I (plus other Biopython folk like Tiago and Bartek, who I
believe are not so interested in BioSQL) are already talking about a
Bioython BoF/hackathon session at BOSC. It would be easier if that
didn't overlap with a BioSQL session ;)  (but not impossible - Brad
and I can perhaps split our time?)

I will be staying for all of ISMB, and I think Brad is about for the
Monday and maybe Tuesday, so that might be an alternative for
scheduling.

Peter


From hlapp at gmx.net  Sat May 16 21:57:15 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 16 May 2009 17:57:15 -0400
Subject: [BioSQL-l] BioSQL at BOSC 2009?
In-Reply-To: <320fb6e00905161423l4df26525hbc9a824419c7a370@mail.gmail.com>
References: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com>
	<1BD503B3-D805-4882-87DD-820138792DB2@gmx.net>
	<320fb6e00905161423l4df26525hbc9a824419c7a370@mail.gmail.com>
Message-ID: <74D4CC78-FC7B-4595-9D24-EB6B3ED43318@gmx.net>


On May 16, 2009, at 5:23 PM, Peter wrote:

> I will be staying for all of ISMB


I am too. Should we doodle something once the program is out?

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Sat May 16 22:10:43 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 16 May 2009 18:10:43 -0400
Subject: [BioSQL-l] Recording "nucleotide" in the sequence table?
In-Reply-To: <320fb6e00905161325m497882ddge853e1525acdfede@mail.gmail.com>
References: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com>
	<48684763-5657-40F3-BE75-31E8CDE42613@gmx.net>
	<320fb6e00905161325m497882ddge853e1525acdfede@mail.gmail.com>
Message-ID: <9A3473A7-2125-486C-BAAE-27820FF19D8D@gmx.net>

I think we'll have to define carefully what we mean by "generic  
nucleotide alphabet". (Normally I hear nucleotide used as the type of  
a sequence, but not its alphabet.)

A nucleotide alphabet in the way you describe it also can't really be  
the "base class" for either a DNA or RNA alphabet, can it? Typically  
in OOP, derived classes expand on a base class, not restrict it. So  
isn't there potential for confusion?

What you are essentially talking about is the case when a sequence  
contains only A, C, and G. In that case, we don't know either that  
it's not protein, do we?

> [...] In python "guessing" is discouraged.  If we have a nucleotide  
> sequence
> like GCGCGCGA, this could be DNA or RNA - you can't tell.

And how do you tell it's nucleotide to begin with?

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Sat May 16 22:34:57 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 16 May 2009 18:34:57 -0400
Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in
	BioSQL
In-Reply-To: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
Message-ID: <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>

Don't you love SwissProt (or UniProt as we must call it now I  
suppose). They (understandably) try to squeeze ever more annotation  
into the existing tags, rather than adding new tags.

So, of the following structure:

DE   RecName: Full=11S globulin seed storage protein 2;
DE   AltName: Full=11S globulin seed storage protein II;
DE   AltName: Full=Alpha-globulin;
DE   Contains:
DE     RecName: Full=11S globulin seed storage protein 2 acidic chain;
DE     AltName: Full=11S globulin seed storage protein II acidic chain;
DE   Contains:
DE     RecName: Full=11S globulin seed storage protein 2 basic chain;
DE     AltName: Full=11S globulin seed storage protein II basic chain;
DE   Flags: Precursor;

really only the first line, with the 'RecName: Full=' removed, is the  
description line as we know it. The rest, I would say, is annotation,  
such as two alternative names, amino acid chains contained in the full  
record (shouldn't this be feature annotation, really? and indeed it is  
- why it needs to be repeated here is beyond me) and their names as  
well as alternative names, and the fact that the sequence is a  
precursor form.

Leaving all this in one string has the advantage that we can round- 
trip it (and there is probably hardly any other way to accomplish  
that), but clearly in terms of semantics this isn't the sequence  
description as we know it anymore.

Does anyone else think too that completely changing the semantics of  
sequence annotation fields is a bad idea? <sigh/>

My inclination from a BioPerl perspective is to extract the part  
following 'RecName: Full=' as the description, and attach the rest as  
annotation. We could in fact use the TagTree class for this. I'm cross- 
posting to BioPerl too to gather what other BioPerl'ers think about  
this.

	-hilmar

On May 14, 2009, at 2:20 PM, Peter wrote:

> Hi,
>
> This is cross-posted between biopython-dev and biosql-l as it regards
> parsing the description (DE) lines in SwissProt files and how they are
> stored in BioSQL.  This follows from an earlier discussion on
> biopython-dev
>
> Older SwissProt files just had one or two DE lines, and it made sense
> to treat this as a simple string mapped onto the description field in
> the bioentry table in BioSQL.  This appears to what happens with
> BioPerl 1.5.x and in Biopython (although the details regarding white
> space differ).  However, newer SwissProt files have many DE lines with
> additional structure.  The example Michiel gave earlier on the
> biopython-dev list was:
>
> http://www.uniprot.org/uniprot/Q9XHP0.txt
>
> This has the following DE lines:
>
> DE   RecName: Full=11S globulin seed storage protein 2;
> DE   AltName: Full=11S globulin seed storage protein II;
> DE   AltName: Full=Alpha-globulin;
> DE   Contains:
> DE     RecName: Full=11S globulin seed storage protein 2 acidic chain;
> DE     AltName: Full=11S globulin seed storage protein II acidic  
> chain;
> DE   Contains:
> DE     RecName: Full=11S globulin seed storage protein 2 basic chain;
> DE     AltName: Full=11S globulin seed storage protein II basic chain;
> DE   Flags: Precursor;
>
> I had to fight with perl to get my old copy of BioPerl working again
> (some week reference thing), but I managed, and then loaded this file
> into my test BioSQL database with:
>
> $ perl load_seqdatabase.pl --dbname biosql_test --dbuser root --dbpass
> XXX --namespace biosql_test --format swiss Q9XHP0.txt
>
> Then I looked at the resulting description in the main bioentry table:
>
> $ mysql --user=root -p biosql_test -e 'SELECT description FROM
> bioentry WHERE accession="Q9XHP0";'
>
> This is stored as one huge long string (without the newlines, I'm not
> sure if BioPerl strips those in parsing the file, or when loading it
> into the database):
>
> RecName: Full=11S globulin seed storage protein 2; AltName: Full=11S
> globulin seed storage protein II; AltName: Full=Alpha-globulin;
> Contains: RecName: Full=11S globulin seed storage protein 2 acidic
> chain; AltName: Full=11S globulin seed storage protein II acidic
> chain; Contains: RecName: Full=11S globulin seed storage protein 2
> basic chain; AltName: Full=11S globulin seed storage protein II basic
> chain; Flags: Precursor;
>
> For Biopython, I emptied the database then did:
>
>>>> from Bio import SeqIO
>>>> from BioSQL import BioSeqDatabase
>>>> server = BioSeqDatabase.open_database(driver="MySQLdb",  
>>>> user="root", passwd = "XXX", host = "localhost", db="biosql_test")
>>>> db = server["biosql-test"] #namespace
>>>> db.load(SeqIO.parse(open("Q9XHP0.txt"), "swiss"))
> 1
>>>> server.commit()
>
> As before, I looked in the table with mysql.  Again - this stores the
> full description from the DE line, although with the newlines
> embedded.  So, Biopython is consistent with my old copy of BioPerl
> (1.5.x) if we ignore the white space.
>
> However, how does this look in BioPerl 1.6?  If this is the same, are
> there any plans to change this?  For Biopython we have discussed
> recording most of the DE information under the annotations instead
> (keyed off RecName, AltName, Contains, Flags), but I would like to be
> consistent with BioPerl+BioSQL.
>
> Thanks
>
> Peter
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Sat May 16 23:06:41 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 17 May 2009 00:06:41 +0100
Subject: [BioSQL-l] Recording "nucleotide" in the sequence table?
In-Reply-To: <9A3473A7-2125-486C-BAAE-27820FF19D8D@gmx.net>
References: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com>
	<48684763-5657-40F3-BE75-31E8CDE42613@gmx.net>
	<320fb6e00905161325m497882ddge853e1525acdfede@mail.gmail.com>
	<9A3473A7-2125-486C-BAAE-27820FF19D8D@gmx.net>
Message-ID: <320fb6e00905161606l5fdb0862mf25a45dad07dac8@mail.gmail.com>

On 5/16/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>
>  I think we'll have to define carefully what we mean by "generic nucleotide
> alphabet". (Normally I hear nucleotide used as the type of a sequence, but
> not its alphabet.)

In Biopython the type of a sequence (e.g. DNA, RNA or Protein) is
recorded by an alphabet object (which may also record the expected
range of letters).

>  A nucleotide alphabet in the way you describe it also can't really be the
> "base class" for either a DNA or RNA alphabet, can it? Typically in OOP,
> derived classes expand on a base class, not restrict it. So isn't there
> potential for confusion?

Well, that's how it was done for the Biopython alphabet classes.
I'm simplifying slightly, but at the top level we have a generic
alphabet, which has as children generic protein and generic
nucleotide (which has as its children generic dna and generic
rna).  Each of these then has IUPAC subclasses which are further
restrictions where the valid letters are proscribed.

> What you are essentially talking about is the case when a sequence
> contains only A, C, and G. In that case, we don't know either that
> it's not protein, do we?
>
> > [...] In python "guessing" is discouraged.  If we have a nucleotide
> > sequence like GCGCGCGA, this could be DNA or RNA - you can't
> > tell.
>
> And how do you tell it's nucleotide to begin with?

That is the whole point.  When deciding what to record in the
biosequence.alphabet field in BioSQL we (Bioython) can only
go by what the alphabet associated with the sequence object.
Whoever created the sequence specified the alphabet based
on meta data, external knowledge, or guessed. If this was
done by a parser, then the file format itself may have
specified the sequence type.

If none of BioPerl, BioJava and BioRuby have an analogous
sequence representation for a nucleotide sequence which
might be DNA or RNA, then perhaps the current situation
with only "protein", "dna", "rna" and "unknown" in the
biosequence.alphabet field in BioSQL is sufficient.

Peter


From biopython at maubp.freeserve.co.uk  Sat May 16 23:14:54 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 17 May 2009 00:14:54 +0100
Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in
	BioSQL
In-Reply-To: <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
Message-ID: <320fb6e00905161614p5a2d964fs6de110bb915ed066@mail.gmail.com>

On 5/16/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>
> Don't you love SwissProt (or UniProt as we must call it now I suppose).
> They (understandably) try to squeeze ever more annotation into the existing
> tags, rather than adding new tags.
>
>  So, of the following structure:
>
>  DE   RecName: Full=11S globulin seed storage protein 2;
>  DE   AltName: Full=11S globulin seed storage protein II;
>  DE   AltName: Full=Alpha-globulin;
>  DE   Contains:
>  DE     RecName: Full=11S globulin seed storage protein 2 acidic chain;
>  DE     AltName: Full=11S globulin seed storage protein II acidic chain;
>  DE   Contains:
>  DE     RecName: Full=11S globulin seed storage protein 2 basic chain;
>  DE     AltName: Full=11S globulin seed storage protein II basic chain;
>  DE   Flags: Precursor;
>
>  really only the first line, with the 'RecName: Full=' removed, is the
> description line as we know it. The rest, I would say, is annotation, such
> as two alternative names, amino acid chains contained in the full record
> (shouldn't this be feature annotation, really? and indeed it is - why it
> needs to be repeated here is beyond me) and their names as well as
> alternative names, and the fact that the sequence is a precursor form.
>
>  Leaving all this in one string has the advantage that we can round-trip it
> (and there is probably hardly any other way to accomplish that), but clearly
> in terms of semantics this isn't the sequence description as we know it
> anymore.
>
>  Does anyone else think too that completely changing the semantics of
> sequence annotation fields is a bad idea? <sigh/>

+1
That's pretty much what I thought on seeing this the first time.

>  My inclination from a BioPerl perspective is to extract the part following
> 'RecName: Full=' as the description, and attach the rest as annotation. We
> could in fact use the TagTree class for this. I'm cross-posting to BioPerl
> too to gather what other BioPerl'ers think about this.

Am I right to infer that currently BioPerl 1.6.x, like BioPerl 1.5.x just
treats the DE lines as only big long string?

Could you translate your idea about the TagTree class into something
concrete with BioSQL tables and fields for me? I'm not familiar with
the TagTree (or Perl).

Over on the Biopython list we'd talked about storing this annotation in
a nested structured.  However, in order to use the BioSQL annotations
mechanisms, I think a simple flat structure is required :(

Peter


From cjfields at illinois.edu  Sat May 16 23:16:05 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Sat, 16 May 2009 18:16:05 -0500
Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in
	BioSQL
In-Reply-To: <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
Message-ID: <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>


On May 16, 2009, at 5:34 PM, Hilmar Lapp wrote:

> Don't you love SwissProt (or UniProt as we must call it now I  
> suppose). They (understandably) try to squeeze ever more annotation  
> into the existing tags, rather than adding new tags.
>
> So, of the following structure:
>
> DE   RecName: Full=11S globulin seed storage protein 2;
> DE   AltName: Full=11S globulin seed storage protein II;
> DE   AltName: Full=Alpha-globulin;
> DE   Contains:
> DE     RecName: Full=11S globulin seed storage protein 2 acidic chain;
> DE     AltName: Full=11S globulin seed storage protein II acidic  
> chain;
> DE   Contains:
> DE     RecName: Full=11S globulin seed storage protein 2 basic chain;
> DE     AltName: Full=11S globulin seed storage protein II basic chain;
> DE   Flags: Precursor;
>
> really only the first line, with the 'RecName: Full=' removed, is  
> the description line as we know it. The rest, I would say, is  
> annotation, such as two alternative names, amino acid chains  
> contained in the full record (shouldn't this be feature annotation,  
> really? and indeed it is - why it needs to be repeated here is  
> beyond me) and their names as well as alternative names, and the  
> fact that the sequence is a precursor form.
>
> Leaving all this in one string has the advantage that we can round- 
> trip it (and there is probably hardly any other way to accomplish  
> that), but clearly in terms of semantics this isn't the sequence  
> description as we know it anymore.
>
> Does anyone else think too that completely changing the semantics of  
> sequence annotation fields is a bad idea? <sigh/>
>
> My inclination from a BioPerl perspective is to extract the part  
> following 'RecName: Full=' as the description, and attach the rest  
> as annotation. We could in fact use the TagTree class for this. I'm  
> cross-posting to BioPerl too to gather what other BioPerl'ers think  
> about this.
>
> 	-hilmar

This is much like the GN issues we've run into before, and we *could*  
set this up using TagTree or similar.  In the latter case of gene name  
the data is stored in a text tree as follows:

gene_names:
   gene_name:
     Name: GC1QBP
     Synonyms: HABP1
     Synonyms: SF2P32
     Synonyms: C1QBP

That could be changed to an XML string:

<?xml version="1.0" encoding="UTF-8"?>
<gene_names>
   <gene_name>
     <Name>GC1QBP</Name>
     <Synonyms>HABP1</Synonyms>
     <Synonyms>SF2P32</Synonyms>
     <Synonyms>C1QBP</Synonyms>
   </gene_name>
</gene_names>

Thinking about this we should attempt to coalesce around a standard  
instead of forcing the other Bio*  to a specific format.

chris


From biopython at maubp.freeserve.co.uk  Sat May 16 23:28:43 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 17 May 2009 00:28:43 +0100
Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in
	BioSQL
In-Reply-To: <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
	<071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>
Message-ID: <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com>

On 5/17/09, Chris Fields <cjfields at illinois.edu> wrote:
>
> On May 16, 2009, at 5:34 PM, Hilmar Lapp wrote:
> > My inclination from a BioPerl perspective is to extract the part following
> > 'RecName: Full=' as the description, and attach the rest as annotation. We
> > could in fact use the TagTree class for this. I'm cross-posting to BioPerl
> > too to gather what other BioPerl'ers think about this.
> >
> >        -hilmar
> >
>
> This is much like the GN issues we've run into before, and we *could* set
> this up using TagTree or similar.  In the latter case of gene name the data
> is stored in a text tree as follows:
>
>  gene_names:
>   gene_name:
>     Name: GC1QBP
>     Synonyms: HABP1
>     Synonyms: SF2P32
>     Synonyms: C1QBP
>
>  That could be changed to an XML string:
>
>  <?xml version="1.0" encoding="UTF-8"?>
>  <gene_names>
>   <gene_name>
>     <Name>GC1QBP</Name>
>     <Synonyms>HABP1</Synonyms>
>     <Synonyms>SF2P32</Synonyms>
>     <Synonyms>C1QBP</Synonyms>
>   </gene_name>
>  </gene_names>
>
> Thinking about this we should attempt to coalesce around a standard instead
> of forcing the other Bio*  to a specific format.

How would you record this in BioSQL?  As an XML string for an annotation value?

Brad has suggested JSON might be useful for this kind of thing (see
also per-letter-annotation discussion).

Peter


From hlapp at gmx.net  Sat May 16 23:37:14 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 16 May 2009 19:37:14 -0400
Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in
	BioSQL
In-Reply-To: <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
	<071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>
	<320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com>
Message-ID: <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net>


On May 16, 2009, at 7:28 PM, Peter wrote:

>> That could be changed to an XML string:
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <gene_names>
>>  <gene_name>
>>    <Name>GC1QBP</Name>
>>    <Synonyms>HABP1</Synonyms>
>>    <Synonyms>SF2P32</Synonyms>
>>    <Synonyms>C1QBP</Synonyms>
>>  </gene_name>
>> </gene_names>
>>
>> Thinking about this we should attempt to coalesce around a standard  
>> instead
>> of forcing the other Bio*  to a specific format.
>
> How would you record this in BioSQL?  As an XML string for an  
> annotation value?

Yes. A TagTree object can be serialized to XML, and the XML can be  
stored as the annotation value in BioSQL. As the XML can be read back  
in, it allows full round-tripping.

> Brad has suggested JSON might be useful for this kind of thing (see
> also per-letter-annotation discussion).

JSON could be another serialization format, but XML is equally or  
better supported in all languages except JavaScript. Furthermore, you  
could just send the XML to the browser and have an XSLT (either  
directly, or indirectly through JavaScript doing the transformation)  
do the rendering.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Sat May 16 23:42:17 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 16 May 2009 19:42:17 -0400
Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in
	BioSQL
In-Reply-To: <320fb6e00905161614p5a2d964fs6de110bb915ed066@mail.gmail.com>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
	<320fb6e00905161614p5a2d964fs6de110bb915ed066@mail.gmail.com>
Message-ID: <8CD4EED1-A689-447F-8F6E-8D2204DD4E86@gmx.net>


On May 16, 2009, at 7:14 PM, Peter wrote:

> Am I right to infer that currently BioPerl 1.6.x, like BioPerl 1.5.x  
> just
> treats the DE lines as only big long string?

Yes.

> Could you translate your idea about the TagTree class into something
> concrete with BioSQL tables and fields for me? [...] Over on the  
> Biopython list we'd talked about storing this annotation in a nested  
> structured.

That's more or less what TagTree is.

>  However, in order to use the BioSQL annotations mechanisms, I think  
> a simple flat structure is required :(

Not necessarily. If you have a flat serialization (such as XML) the  
nested structure isn't needed. Of course that's not a fully normalized  
relational representation, but if you had one, how often would it be  
used, how efficient would those queries be (SQL is poor at nested or  
recursive data structures), and how much pain would it be to write the  
object-relational mappings?

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Sun May 17 12:40:47 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 17 May 2009 13:40:47 +0100
Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in
	BioSQL
In-Reply-To: <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
	<071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>
	<320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com>
	<0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net>
Message-ID: <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com>

On 5/17/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>
>  On May 16, 2009, at 7:28 PM, Peter wrote:
> > > That could be changed to an XML string:
> > >
> > > <?xml version="1.0" encoding="UTF-8"?>
> > > <gene_names>
> > >  <gene_name>
> > >   <Name>GC1QBP</Name>
> > >   <Synonyms>HABP1</Synonyms>
> > >   <Synonyms>SF2P32</Synonyms>
> > >   <Synonyms>C1QBP</Synonyms>
> > >  </gene_name>
> > > </gene_names>
> > >
> > > Thinking about this we should attempt to coalesce around a standard
> > > instead of forcing the other Bio*  to a specific format.

Absolutely - some common standard should be agreed.

Would you envision doing this for other structured fields, inventing a
new mini XML format each time?  That seems open ended and likely to
cause a lot of work keeping all the Bio* project synchronised.

Here you have mapped RecName and AltName fields in the DE lines to
Name and Synonyms (shouldn't that be Synonym singular?).  I also don't
get why you have used a gene_name entry inside a gene_names list.
Would you hold the contains information and the flags information from
the DE lines in separate XML entries?

I would have gone for something much closer to the original DE line
markup i.e. using the field names UniProt use, RecName and AltName,
rather than mapping these to Name and Synonym.

> > How would you record this in BioSQL?  As an XML string for an annotation
> > value?
>
> Yes. A TagTree object can be serialized to XML, and the XML can be stored
> as the annotation value in BioSQL. As the XML can be read back in, it allows
> full round-tripping.

Assuming you stored all the DE markup, then yes, a round trip back to
the SwissProt file could be possible.  And, depending on the details
of the XML structure used, it would be possible to represent this in a
python structure too.

> > Brad has suggested JSON might be useful for this kind of thing (see
> > also per-letter-annotation discussion).
>
> JSON could be another serialization format, but XML is equally or better
> supported in all languages except JavaScript. Furthermore, you could just
> send the XML to the browser and have an XSLT (either directly, or indirectly
> through JavaScript doing the transformation) do the rendering.

I have no strong preference for either XML or JSON (but would rather
avoid them if they are not really needed).  For other types of
annotation there may be a clearer advantage for one over the other,
e.g. per letter annotation like the secondary structure of a protein
sequence, or the quality scores of a nucleotide contig.

On 5/17/09, Hilmar Lapp <hlapp at gmx.net> wrote:
> Not necessarily. If you have a flat serialization (such as XML) the nested
> structure isn't needed. Of course that's not a fully normalized relational
> representation, but if you had one, how often would it be used, how
> efficient would those queries be (SQL is poor at nested or recursive data
> structures), and how much pain would it be to write the object-relational
> mappings?

In this example, searching the database using one of the SwissProt
AltNames (synonyms), or filtering on the Flags sounds like a
reasonable request - but this would be very difficult if the data is
stored inside XML strings.

Of course, because the RecName and AltName entries are top level, we
could just record them as normal - simple strings in the annotations
table.  This seems much nicer.  Likewise the "Flags: Precursor;" line.
 i.e. listing the tag/value pairs which could be used in the
bioentry_qualifier_value table:

AltName = "Full=11S globulin seed storage protein II"
AltName = "Full=Alpha-globulin"
Flags = "Precursor"

(the RecName field, "Full=11S globulin seed storage protein 2", could
be used for the bioentry.description instead)

The above are all pretty easy.  We only need to consider nesting (or
something like XML or JSON) for some of the DE information, in the
example discussed the Contains lines.  Even this could be even be done
by storing each contains entry as a single long string (holding both
the name and synonyms) directly from the DE line itself, something
like this:

Contains = "RecName: Full=11S globulin seed storage protein 2 acidic
chain;\nAltName: Full=11S globulin seed storage protein II acidic
chain;"
Contains = "RecName: Full=11S globulin seed storage protein 2 basic
chain;\nAltName: Full=11S globulin seed storage protein II basic
chain;"

Peter


From sanjay.harke at gmail.com  Sun May 17 13:17:14 2009
From: sanjay.harke at gmail.com (Sanjay Harke)
Date: Sun, 17 May 2009 18:47:14 +0530
Subject: [BioSQL-l] BioSQL-l Digest, Vol 62, Issue 3
In-Reply-To: <mailman.15416.1242564062.2782.biosql-l@lists.open-bio.org>
References: <mailman.15416.1242564062.2782.biosql-l@lists.open-bio.org>
Message-ID: <31bb4380905170617k47951f83ia5bed32577a02956@mail.gmail.com>

Dear peter,

Kindly guide me for developing the connectivity of BioSql to Bioperl?

sanjay


From hlapp at gmx.net  Sun May 17 14:56:29 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 17 May 2009 10:56:29 -0400
Subject: [BioSQL-l] BioSQL-l Digest, Vol 62, Issue 3
In-Reply-To: <31bb4380905170617k47951f83ia5bed32577a02956@mail.gmail.com>
References: <mailman.15416.1242564062.2782.biosql-l@lists.open-bio.org>
	<31bb4380905170617k47951f83ia5bed32577a02956@mail.gmail.com>
Message-ID: <B83D0C6C-9728-4DBA-A07D-E274E504A795@gmx.net>


http://dx.doi.org/10.1038/npre.2007.1233.1

On May 17, 2009, at 9:17 AM, Sanjay Harke wrote:

> Dear peter,
>
> Kindly guide me for developing the connectivity of BioSql to Bioperl?
>
> sanjay
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Sun May 17 15:21:59 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 17 May 2009 11:21:59 -0400
Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in
	BioSQL
In-Reply-To: <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
	<071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>
	<320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com>
	<0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net>
	<320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com>
Message-ID: <A8AB4BCB-9CD3-428D-AF10-899AD8055EC7@gmx.net>


On May 17, 2009, at 8:40 AM, Peter wrote:

> On 5/17/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>>
>> On May 16, 2009, at 7:28 PM, Peter wrote:
>>>> That could be changed to an XML string:
>>>>
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <gene_names>
>>>> <gene_name>
>>>>  <Name>GC1QBP</Name>
>>>>  <Synonyms>HABP1</Synonyms>
>>>>  <Synonyms>SF2P32</Synonyms>
>>>>  <Synonyms>C1QBP</Synonyms>
>>>> </gene_name>
>>>> </gene_names>
>>>>
>>>> Thinking about this we should attempt to coalesce around a standard
>>>> instead of forcing the other Bio*  to a specific format.
>
> [...] Here you have mapped RecName and AltName fields in the DE  
> lines to
> Name and Synonyms (shouldn't that be Synonym singular?).

The example is for the GN lines in SwissProt, not the DE lines.

> [...]
> On 5/17/09, Hilmar Lapp <hlapp at gmx.net> wrote:
>> Not necessarily. If you have a flat serialization (such as XML) the  
>> nested
>> structure isn't needed. Of course that's not a fully normalized  
>> relational
>> representation, but if you had one, how often would it be used, how
>> efficient would those queries be (SQL is poor at nested or  
>> recursive data
>> structures), and how much pain would it be to write the object- 
>> relational
>> mappings?
>
> In this example, searching the database using one of the SwissProt
> AltNames (synonyms), or filtering on the Flags sounds like a
> reasonable request - but this would be very difficult if the data is
> stored inside XML strings.

Actually no. Modern full-text indexers (inside or outside the  
database) can index XML text columns right away and very well. In  
fact, for the last project that I built a full-text search for (on top  
of a BioSQL database) I did that by writing custom XML documents to a  
separate table for each record I wanted indexed. Oracle's full text  
indexer did the rest. I also built a separate identifier/name/ 
accession index that pulled all the gene names, symbols, accession  
numbers, identifiers etc into a single table for indexing.

What I mean is, a fully normalized relational representation,  
especially if nested, is often not the most efficient data structure  
for efficient searching and filtering.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Mon May 18 10:03:52 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 18 May 2009 11:03:52 +0100
Subject: [BioSQL-l] Recording "nucleotide" in the sequence table?
In-Reply-To: <320fb6e00905161606l5fdb0862mf25a45dad07dac8@mail.gmail.com>
References: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com>
	<48684763-5657-40F3-BE75-31E8CDE42613@gmx.net>
	<320fb6e00905161325m497882ddge853e1525acdfede@mail.gmail.com>
	<9A3473A7-2125-486C-BAAE-27820FF19D8D@gmx.net>
	<320fb6e00905161606l5fdb0862mf25a45dad07dac8@mail.gmail.com>
Message-ID: <320fb6e00905180303m19d0c6e0hdc22ff550e518c6c@mail.gmail.com>

On Sun, May 17, 2009 at 12:06 AM, Peter wrote:
> If none of BioPerl, BioJava and BioRuby have an analogous
> sequence representation for a nucleotide sequence which
> might be DNA or RNA, then perhaps the current situation
> with only "protein", "dna", "rna" and "unknown" in the
> biosequence.alphabet field in BioSQL is sufficient.

The original Biopython bug reporter (Bug 2829, David Wyllie)
has replied on the bug.  In his case, rather than using the
generic nucleotide alphabet, he can be a bit more explicit
since he does actually know his sequence is DNA, and this
does get recorded in BioSQL fine.

Given the "nucleotide" alphabet is a corner case in Biopython,
and has no analogue in BioPerl, the status quo is fine. i.e.
The biosequence.alphabet field should contain "dna", "rna",
"protein" or "unknown" (in lower case).

Thanks for your thoughts everyone.

Peter


From michael.watson at bbsrc.ac.uk  Mon May 18 12:45:19 2009
From: michael.watson at bbsrc.ac.uk (michael watson (IAH-C))
Date: Mon, 18 May 2009 13:45:19 +0100
Subject: [BioSQL-l] Full text indexing/Searching in MySQL
Message-ID: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk>

Hi

 
Has anyone implemented full text indexing/searching for BioSQL in MySQL,
either using MySQL's full text features or any other solution?

 
Any tips, advice, documentation, code etc available?

 
Thanks

Mick

 
Head of Bioinformatics
Institute for Animal Health
Compton
Berks
RG20 7NN
01635 578411 

 
Please consider the environment and don't print this e-mail unless you
really need to.

The information contained in this message may be confidential or legally
privileged and is intended solely for the addressee. If you have
received this message in error please delete it & notify the originator
immediately.  Unauthorised use, disclosure, copying or alteration of
this message is forbidden & may be unlawful.  The contents of this
e-mail are the views of the sender and do not necessarily represent the
views of the Institute.   This email, and associated attachments, has
been checked locally for viruses but we can accept no responsibility
once it has left our systems.  Communications on Institute computers are
monitored to secure the effective operation of the systems and for other
lawful purposes.

 
The Institute for Animal Health is a company limited by guarantee,
registered in England no. 559784.  

The Institute is also a registered charity, Charity Commissioners
Reference No. 228824

 
From hlapp at gmx.net  Mon May 18 13:24:34 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Mon, 18 May 2009 09:24:34 -0400
Subject: [BioSQL-l] Full text indexing/Searching in MySQL
In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk>
Message-ID: <DB12094C-A6E1-4F79-B669-848807B96788@gmx.net>

I've done that using Oracle, not MySQL. I assume that's therefore not  
what you want to hear about and hence will shut up :)

	-hilmar

On May 18, 2009, at 8:45 AM, michael watson (IAH-C) wrote:

> Hi
>
>
>
> Has anyone implemented full text indexing/searching for BioSQL in  
> MySQL,
> either using MySQL's full text features or any other solution?
>
>
>
> Any tips, advice, documentation, code etc available?
>
>
>
> Thanks
>
> Mick
>
>
>
> Head of Bioinformatics
> Institute for Animal Health
> Compton
> Berks
> RG20 7NN
> 01635 578411
>
>
>
> Please consider the environment and don't print this e-mail unless you
> really need to.
>
> The information contained in this message may be confidential or  
> legally
> privileged and is intended solely for the addressee. If you have
> received this message in error please delete it & notify the  
> originator
> immediately.  Unauthorised use, disclosure, copying or alteration of
> this message is forbidden & may be unlawful.  The contents of this
> e-mail are the views of the sender and do not necessarily represent  
> the
> views of the Institute.   This email, and associated attachments, has
> been checked locally for viruses but we can accept no responsibility
> once it has left our systems.  Communications on Institute computers  
> are
> monitored to secure the effective operation of the systems and for  
> other
> lawful purposes.
>
>
>
> The Institute for Animal Health is a company limited by guarantee,
> registered in England no. 559784.
>
> The Institute is also a registered charity, Charity Commissioners
> Reference No. 228824
>
>
>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Mon May 18 13:26:40 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 18 May 2009 14:26:40 +0100
Subject: [BioSQL-l] Full text indexing/Searching in MySQL
In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk>
Message-ID: <320fb6e00905180626o4855aa06v6c6ae665885a3fce@mail.gmail.com>

On Mon, May 18, 2009 at 1:45 PM, michael watson (IAH-C)
<michael.watson at bbsrc.ac.uk> wrote:
>
> Hi
>
> Has anyone implemented full text indexing/searching for BioSQL in MySQL,
> either using MySQL's full text features or any other solution?
>
> Any tips, advice, documentation, code etc available?
>
> Thanks
>
> Mick

Hilmar mentioned he has done something like this on this thread,
where he was storing XML strings as annotation values:

http://lists.open-bio.org/pipermail/biosql-l/2009-May/001534.html

(You've probably read that - but just in case, worth mentioning).

Peter


From biopython at maubp.freeserve.co.uk  Mon May 18 13:38:03 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 18 May 2009 14:38:03 +0100
Subject: [BioSQL-l] [Biopython-dev] SwissProt DE lines and
	bioentry.description field in BioSQL
In-Reply-To: <A8AB4BCB-9CD3-428D-AF10-899AD8055EC7@gmx.net>
References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com>
	<074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net>
	<071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu>
	<320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com>
	<0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net>
	<320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com>
	<A8AB4BCB-9CD3-428D-AF10-899AD8055EC7@gmx.net>
Message-ID: <320fb6e00905180638q29de63c4if0627eff416c4481@mail.gmail.com>

On Sun, May 17, 2009 at 4:21 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
>
> On May 17, 2009, at 8:40 AM, Peter wrote:
>>
>> [...] Here you have mapped RecName and AltName fields in the DE lines to
>> Name and Synonyms (shouldn't that be Synonym singular?).
>
> The example is for the GN lines in SwissProt, not the DE lines.

Ah, that probably explains some of my confusion.

>> In this example, searching the database using one of the SwissProt
>> AltNames (synonyms), or filtering on the Flags sounds like a
>> reasonable request - but this would be very difficult if the data is
>> stored inside XML strings.
>
> Actually no. Modern full-text indexers (inside or outside the database) can
> index XML text columns right away and very well. In fact, for the last
> project that I built a full-text search for (on top of a BioSQL database) I
> did that by writing custom XML documents to a separate table for each
> record I wanted indexed. Oracle's full text indexer did the rest. I also built a
> separate identifier/name/accession index that pulled all the gene names,
> symbols, accession numbers, identifiers etc into a single table for
> indexing.

OK, when I said searching "would be very difficult if the data is
stored inside XML strings", maybe it wasn't so difficult for you - but
that still sounds complicated!

Sticking with the GN lines and the synonym, if this was stored as a
simple tag/value as usual in BioSQL, I would write my SQL statement to
search the annotation table where the term id was that associated with
a GN synonym, and the annotation value was "HABP1".  Simple.

Using the XML approach, are you suggesting you could do a full text
search on the annotation value field, looking for any rows where the
field contains "<Synonyms>HABP1</Synonyms>", where the term id matches
the GN lines' XML string? This sounds simplistic and probably rather
slow - presumably why you resorted to the more complicated indexing
scheme described above?

> What I mean is, a fully normalized relational representation, especially if
> nested, is often not the most efficient data structure for efficient
> searching and filtering.

OK.  But do we really need to worry about complex nested structures
for the SwissProt annotation (or in general)?

Peter


From jimp at compbio.dundee.ac.uk  Mon May 18 14:01:28 2009
From: jimp at compbio.dundee.ac.uk (James Procter)
Date: Mon, 18 May 2009 15:01:28 +0100
Subject: [BioSQL-l] BioSQL at BOSC 2009?
In-Reply-To: <74D4CC78-FC7B-4595-9D24-EB6B3ED43318@gmx.net>
References: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com>	<1BD503B3-D805-4882-87DD-820138792DB2@gmx.net>	<320fb6e00905161423l4df26525hbc9a824419c7a370@mail.gmail.com>
	<74D4CC78-FC7B-4595-9D24-EB6B3ED43318@gmx.net>
Message-ID: <4A116A38.9050705@compbio.dundee.ac.uk>

Hi all.

Hilmar Lapp wrote:
> On May 16, 2009, at 5:23 PM, Peter wrote:
> 
>> I will be staying for all of ISMB
Same here.
> 
> 
> I am too. Should we doodle something once the program is out?
I'll watch out for the URL if you post it to the list!

Jim.

-- 
-------------------------------------------------------------------
J. B. Procter  (ENFIN/VAMSAS)  Barton Bioinformatics Research Group
Phone/Fax:+44(0)1382 388734/345764  http://www.compbio.dundee.ac.uk
The University of Dundee is a Scottish Registered Charity, No. SC015096.


From roy.chaudhuri at gmail.com  Mon May 18 17:37:39 2009
From: roy.chaudhuri at gmail.com (Roy Chaudhuri)
Date: Mon, 18 May 2009 18:37:39 +0100
Subject: [BioSQL-l] Full text indexing/Searching in MySQL
In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk>
Message-ID: <4A119CE3.3080208@gmail.com>

Hi Mick,

> Has anyone implemented full text indexing/searching for BioSQL in MySQL,
> either using MySQL's full text features or any other solution?

I've kind of done this. The trouble is that full text is only 
implemented on the non-transactional MyISAM tables, not InnoDB (it has 
long been promised for InnoDB, but no sign yet). My hack solution was to 
parse out the fields I was interested in (feature tags such as gene and 
product) and include them in a separate MyISAM table, cross-referenced 
to BioSQL using seqfeature_id. This involves duplicating data (which is 
a bad thing), but should be okay if database updates are infrequent. I 
mimic atomic changes by building an updated version of the MyISAM table 
separately, then switching to use the new version at the same time as I 
commit the BioSQL updates.

There's also Sphinx (http://www.sphinxsearch.com), which is a plug-in 
that can implement full-text searches in InnoDB, but I haven't 
experimented with that so have no idea how well it works.

Cheers.
Roy.


From holland at eaglegenomics.com  Mon May 18 18:20:52 2009
From: holland at eaglegenomics.com (Richard Holland)
Date: Mon, 18 May 2009 19:20:52 +0100
Subject: [BioSQL-l] Full text indexing/Searching in MySQL
In-Reply-To: <4A119CE3.3080208@gmail.com>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk>
	<4A119CE3.3080208@gmail.com>
Message-ID: <1242670852.28726.2.camel@buzzybee>

There's also Lucene, which is a Java-based full-text indexer which can
be attached to all kinds of data sources, including MySQL databases:

http://lucene.apache.org/java/docs/

cheers,
Richard

On Mon, 2009-05-18 at 18:37 +0100, Roy Chaudhuri wrote:
> Hi Mick,
> 
> > Has anyone implemented full text indexing/searching for BioSQL in MySQL,
> > either using MySQL's full text features or any other solution?
> 
> I've kind of done this. The trouble is that full text is only 
> implemented on the non-transactional MyISAM tables, not InnoDB (it has 
> long been promised for InnoDB, but no sign yet). My hack solution was to 
> parse out the fields I was interested in (feature tags such as gene and 
> product) and include them in a separate MyISAM table, cross-referenced 
> to BioSQL using seqfeature_id. This involves duplicating data (which is 
> a bad thing), but should be okay if database updates are infrequent. I 
> mimic atomic changes by building an updated version of the MyISAM table 
> separately, then switching to use the new version at the same time as I 
> commit the BioSQL updates.
> 
> There's also Sphinx (http://www.sphinxsearch.com), which is a plug-in 
> that can implement full-text searches in InnoDB, but I haven't 
> experimented with that so have no idea how well it works.
> 
> Cheers.
> Roy.
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
-- 
Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From michael.watson at bbsrc.ac.uk  Tue May 19 08:17:32 2009
From: michael.watson at bbsrc.ac.uk (michael watson (IAH-C))
Date: Tue, 19 May 2009 09:17:32 +0100
Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors
Message-ID: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk>

Hi

 
I'm using:

 
biosql-1.0.1

bioperl-db-1.5.2_100

bioperl-1.5.2_102

 
When I run load_seqdatabase.pl on about 3000 GenBank sequences, I get:

 
Loading fmd_180509.gbk ...

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were
("AY312586S1","32307407","AY312586","Foot-and-mouth disease virus O
isolate O/SKR/2000 S fragment, complete

1,9762)

Duplicate entry 'AY312586-1-1' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
values were ("","1") FKs (324,3,4)

Duplicate entry '324-3-4-1' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were
("AY312586S2","32307408","AY312587","Foot-and-mouth disease virus O
isolate O/SKR/2000 L fragment, complete

1,9762)

Duplicate entry 'AY312587-1-1' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
values were ("","1") FKs (323,3,4)

Duplicate entry '323-3-4-1' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
values were ("","2") FKs (323,22,4)

Duplicate entry '323-22-4-2' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
values were ("","3") FKs (323,15,4)

Duplicate entry '323-15-4-3' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were
("AY312588S1","32307403","AY312588","Foot-and-mouth disease virus O
isolate O/SKR/2002 S fragment, complete

1,9762)

Duplicate entry 'AY312588-1-1' for key 2

 
---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
values were ("","1") FKs (326,3,4)

Duplicate entry '326-3-4-1' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were
("AY312588S2","32307404","AY312589","Foot-and-mouth disease virus O
isolate O/SKR/2002 L fragment, complete

1,9762)

Duplicate entry 'AY312589-1-1' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
values were ("","1") FKs (325,3,4)

Duplicate entry '325-3-4-1' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
values were ("","2") FKs (325,22,4)

Duplicate entry '325-22-4-2' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
values were ("","3") FKs (325,15,4)

Duplicate entry '325-15-4-3' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were
("S87919S2","247466","S87923","L [foot-and-mouth disease virus FMDV,
strain CS8, Genomic RNA, 10 nt, segmen

1,9754)

Duplicate entry 'S87923-1-1' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
values were ("","1") FKs (782,3,4)

Duplicate entry '782-3-4-1' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
values were ("","2") FKs (782,13,4)

Duplicate entry '782-13-4-2' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were
("S87919S1","247464","S87919","L [foot-and-mouth disease virus FMDV,
strain CS8, Genomic RNA, 35 nt, segmen

1,9754)

Duplicate entry 'S87919-1-1' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
values were ("","1") FKs (781,3,4)

Duplicate entry '781-3-4-1' for key 2

---------------------------------------------------

 
-------------------- WARNING ---------------------

MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values
were ("","Direct Submission","Submitted (12-AUG-2004) National Center
for Biotechnology Information, NIH, 

C-E8D3CBBD80002FA1","1","8170","") FKs (<NULL>)

Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3

---------------------------------------------------

Could not store NC_011452:

------------- EXCEPTION  -------------

MSG: create: object (Bio::Annotation::Reference) failed to insert or to
be found by unique key

STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/BasePersistenceAdaptor.pm:206

STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/BasePersistenceAdaptor.pm:251

STACK Bio::DB::Persistent::PersistentObject::store
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/Persistent/PersistentObject.pm:271

STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/AnnotationCollectionAdaptor.pm:

STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/BasePersistenceAdaptor.pm:214

STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/BasePersistenceAdaptor.pm:251

STACK Bio::DB::Persistent::PersistentObject::store
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/Persistent/PersistentObject.pm:271

STACK Bio::DB::BioSQL::SeqAdaptor::store_children
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/SeqAdaptor.pm:224

STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/BasePersistenceAdaptor.pm:214

STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/BasePersistenceAdaptor.pm:251

STACK Bio::DB::Persistent::PersistentObject::store
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/Persistent/PersistentObject.pm:271

STACK (eval) load_seqdatabase.pl:622

STACK toplevel load_seqdatabase.pl:604

 
--------------------------------------

 
 at load_seqdatabase.pl line 635

 
Any clues?

 
Thanks

Mick

 
Head of Bioinformatics
Institute for Animal Health
Compton
Berks
RG20 7NN
01635 578411 

 
Please consider the environment and don't print this e-mail unless you
really need to.

The information contained in this message may be confidential or legally
privileged and is intended solely for the addressee. If you have
received this message in error please delete it & notify the originator
immediately.  Unauthorised use, disclosure, copying or alteration of
this message is forbidden & may be unlawful.  The contents of this
e-mail are the views of the sender and do not necessarily represent the
views of the Institute.   This email, and associated attachments, has
been checked locally for viruses but we can accept no responsibility
once it has left our systems.  Communications on Institute computers are
monitored to secure the effective operation of the systems and for other
lawful purposes.

 
The Institute for Animal Health is a company limited by guarantee,
registered in England no. 559784.  

The Institute is also a registered charity, Charity Commissioners
Reference No. 228824

 
From biopython at maubp.freeserve.co.uk  Tue May 19 09:31:05 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 19 May 2009 10:31:05 +0100
Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors
In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk>
Message-ID: <320fb6e00905190231t79ac1dc9j49585929e9b5304a@mail.gmail.com>

On Tue, May 19, 2009 at 9:17 AM, michael watson (IAH-C)
<michael.watson at bbsrc.ac.uk> wrote:
>
> Hi
>
> I'm using:
>
> biosql-1.0.1
> bioperl-db-1.5.2_100
> bioperl-1.5.2_102
>
> When I run load_seqdatabase.pl on about 3000 GenBank sequences,
> I get:
>
> Loading fmd_180509.gbk ...
> ...
> ---------------------------------------------------
>
> Could not store NC_011452:
>
> ------------- EXCEPTION ?-------------
>
> MSG: create: object (Bio::Annotation::Reference) failed to insert or to
> be found by unique key
>
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
> /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
> B/BioSQL/BasePersistenceAdaptor.pm:206
>
> ...
>
> STACK Bio::DB::Persistent::PersistentObject::store
> /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
> B/Persistent/PersistentObject.pm:271
>
> STACK (eval) load_seqdatabase.pl:622
>
> STACK toplevel load_seqdatabase.pl:604
>
> --------------------------------------
>
> ?at load_seqdatabase.pl line 635
>
> Any clues?

You got a lot of warning about feature keys (which I am guessing are
from different GenBank entries), but the failure seems to be from
something to do with the annotation in NC_011452.

Try downloading just NC_011452 in GenBank format, and testing that:
http://www.ncbi.nlm.nih.gov/nuccore/NC_011452

I would expect that to fail in the same way, and you would at least
have isolated the issue to a smaller test case. If it works, then
maybe the copy of NC_011452 in your file is corrupted somehow - check
for differences.

Peter


From hlapp at gmx.net  Tue May 19 12:25:25 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Tue, 19 May 2009 08:25:25 -0400
Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors
In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk>
Message-ID: <68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net>


On May 19, 2009, at 4:17 AM, michael watson (IAH-C) wrote:

> [...]
> -------------------- WARNING ---------------------
>
> MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values  
> were
> ("AY312586S1","32307407","AY312586","Foot-and-mouth disease virus O
> isolate O/SKR/2000 S fragment, complete
>
> 1,9762)
>
> Duplicate entry 'AY312586-1-1' for key 2
>
> ---------------------------------------------------

This suggests that a sequence with the above accession or GI number  
was already in the database, or occurs in the file twice.

If this situation is possible, you will have to pass the --lookup (or  
--flatlookup) flag to the script, and specify how you want updates to  
take place when they are necessary (options --noupdate, --remove, and  
--mergeobjs).

> -------------------- WARNING ---------------------
>
> MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
> values were ("","1") FKs (324,3,4)
>
> Duplicate entry '324-3-4-1' for key 2
> ---------------------------------------------------

I suspect that 324 is the primary key of the sequence record that  
raised the duplicate entry warning above. Can you check that?

If the insert is turned into an update, these warnings should go away  
too.

> [...]
> -------------------- WARNING ---------------------
>
> MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
> values were ("","1") FKs (323,3,4)
>
> Duplicate entry '323-3-4-1' for key 2
>
> ---------------------------------------------------

Similar to before, except 323 is probably the primary key for AY312587.

> [...]
> -------------------- WARNING ---------------------
>
> MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
> values were ("","1") FKs (325,3,4)
>
> Duplicate entry '325-3-4-1' for key 2
>
> ---------------------------------------------------

And if the order of messages is preserved correctly, 325 would be the  
primary key of AY312589.

> [...]
> -------------------- WARNING ---------------------
>
> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed,  
> values
> were ("","Direct Submission","Submitted (12-AUG-2004) National Center
> for Biotechnology Information, NIH,
>
> C-E8D3CBBD80002FA1","1","8170","") FKs (<NULL>)
>
> Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3
>
> ---------------------------------------------------

This one is odd. Can you check which existing entry you have with  
reference.crc = 'CRC-E8D3CBBD80002FA1'?

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From michael.watson at bbsrc.ac.uk  Wed May 20 09:52:13 2009
From: michael.watson at bbsrc.ac.uk (michael watson (IAH-C))
Date: Wed, 20 May 2009 10:52:13 +0100
Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors
In-Reply-To: <68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk>
	<68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net>
Message-ID: <8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk>

Hi Guys

Ok, the warnings were due to duplicate sequences - I had downloaded a
stream using Bio::DB::GenBank and I guess I assumed that would mean only
unique entries were sent back.  Using "--flatlookup --remove" gets rid
of the warnings.

Now for NC_003992.gbk...

To answer Hilmar's question:
mysql> select * from reference where crc = "CRC-E8D3CBBD80002FA1";
+--------------+-----------+--------------------------------------------
---------------------------------------------------------+--------------
-----+---------+----------------------+
| reference_id | dbxref_id | location
| title             | authors | crc                  |
+--------------+-----------+--------------------------------------------
---------------------------------------------------------+--------------
-----+---------+----------------------+
|          152 |      NULL | Submitted (12-AUG-2004) National Center for
Biotechnology Information, NIH, Bethesda, MD 20894, USA | Direct
Submission | NULL    | CRC-E8D3CBBD80002FA1 |
+--------------+-----------+--------------------------------------------
---------------------------------------------------------+--------------
-----+---------+----------------------+

And when I run load_seqdatabase.pl on NC_003992.gbk alone I still get:

perl load_seqdatabase.pl --host localhost --dbname fmd_biosql --format
genbank --dbuser removed --dbpass removed --flatlookup --remove
NC_003992.gbk

Loading NC_003992.gbk ...

-------------------- WARNING ---------------------
MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values
were ("","Direct Submission","Submitted (12-AUG-2004) National Center
for Biotechnology Information, NIH, Bethesda, MD 20894,
USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs (<NULL>)
Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3
---------------------------------------------------
Could not store NC_003992: 
------------- EXCEPTION  -------------
MSG: create: object (Bio::Annotation::Reference) failed to insert or to
be found by unique key
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/BasePersistenceAdaptor.pm:206
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/BasePersistenceAdaptor.pm:251
STACK Bio::DB::Persistent::PersistentObject::store
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/Persistent/PersistentObject.pm:271
STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/AnnotationCollectionAdaptor.pm:217
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/BasePersistenceAdaptor.pm:214
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/BasePersistenceAdaptor.pm:251
STACK Bio::DB::Persistent::PersistentObject::store
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/Persistent/PersistentObject.pm:271
STACK Bio::DB::BioSQL::SeqAdaptor::store_children
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/SeqAdaptor.pm:224
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/BasePersistenceAdaptor.pm:214
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/BioSQL/BasePersistenceAdaptor.pm:251
STACK Bio::DB::Persistent::PersistentObject::store
/usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D
B/Persistent/PersistentObject.pm:271
STACK (eval) load_seqdatabase.pl:622
STACK toplevel load_seqdatabase.pl:604

--------------------------------------

 at load_seqdatabase.pl line 635

And I still have:

mysql> select * from reference where crc = "CRC-E8D3CBBD80002FA1";
+--------------+-----------+--------------------------------------------
---------------------------------------------------------+--------------
-----+---------+----------------------+
| reference_id | dbxref_id | location
| title             | authors | crc                  |
+--------------+-----------+--------------------------------------------
---------------------------------------------------------+--------------
-----+---------+----------------------+
|          152 |      NULL | Submitted (12-AUG-2004) National Center for
Biotechnology Information, NIH, Bethesda, MD 20894, USA | Direct
Submission | NULL    | CRC-E8D3CBBD80002FA1 |
+--------------+-----------+--------------------------------------------
---------------------------------------------------------+--------------
-----+---------+----------------------+
1 row in set (0.01 sec)

Could this be because bases 1 to 8203 of the sequence have three
references, and the crc is created on the first and then duplicated on
the second, thus causing a problem?

Cheers
Mick

-----Original Message-----
From: Hilmar Lapp [mailto:hlapp at gmx.net] 
Sent: 19 May 2009 13:25
To: michael watson (IAH-C)
Cc: biosql-l at lists.open-bio.org
Subject: Re: [BioSQL-l] load_seqdatabase.pl warnings and errors


On May 19, 2009, at 4:17 AM, michael watson (IAH-C) wrote:

> [...]
> -------------------- WARNING ---------------------
>
> MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values  
> were
> ("AY312586S1","32307407","AY312586","Foot-and-mouth disease virus O
> isolate O/SKR/2000 S fragment, complete
>
> 1,9762)
>
> Duplicate entry 'AY312586-1-1' for key 2
>
> ---------------------------------------------------

This suggests that a sequence with the above accession or GI number  
was already in the database, or occurs in the file twice.

If this situation is possible, you will have to pass the --lookup (or  
--flatlookup) flag to the script, and specify how you want updates to  
take place when they are necessary (options --noupdate, --remove, and  
--mergeobjs).

> -------------------- WARNING ---------------------
>
> MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
> values were ("","1") FKs (324,3,4)
>
> Duplicate entry '324-3-4-1' for key 2
> ---------------------------------------------------

I suspect that 324 is the primary key of the sequence record that  
raised the duplicate entry warning above. Can you check that?

If the insert is turned into an update, these warnings should go away  
too.

> [...]
> -------------------- WARNING ---------------------
>
> MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
> values were ("","1") FKs (323,3,4)
>
> Duplicate entry '323-3-4-1' for key 2
>
> ---------------------------------------------------

Similar to before, except 323 is probably the primary key for AY312587.

> [...]
> -------------------- WARNING ---------------------
>
> MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed,
> values were ("","1") FKs (325,3,4)
>
> Duplicate entry '325-3-4-1' for key 2
>
> ---------------------------------------------------

And if the order of messages is preserved correctly, 325 would be the  
primary key of AY312589.

> [...]
> -------------------- WARNING ---------------------
>
> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed,  
> values
> were ("","Direct Submission","Submitted (12-AUG-2004) National Center
> for Biotechnology Information, NIH,
>
> C-E8D3CBBD80002FA1","1","8170","") FKs (<NULL>)
>
> Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3
>
> ---------------------------------------------------

This one is odd. Can you check which existing entry you have with  
reference.crc = 'CRC-E8D3CBBD80002FA1'?

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Wed May 20 10:59:19 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 20 May 2009 11:59:19 +0100
Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors
In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk>
	<68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net>
	<8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk>
Message-ID: <320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com>

On Wed, May 20, 2009 at 10:52 AM, michael watson (IAH-C)
<michael.watson at bbsrc.ac.uk> wrote:
>
> Hi Guys
>
> Ok, the warnings were due to duplicate sequences - I had downloaded a
> stream using Bio::DB::GenBank and I guess I assumed that would mean only
> unique entries were sent back. ?Using "--flatlookup --remove" gets rid
> of the warnings.

Great - easy :)

> Now for NC_003992.gbk...
>
> To answer Hilmar's question:
> ...
> And when I run load_seqdatabase.pl on NC_003992.gbk alone I still get:
>
> perl load_seqdatabase.pl --host localhost --dbname fmd_biosql --format
> genbank --dbuser removed --dbpass removed --flatlookup --remove
> NC_003992.gbk
>
> Loading NC_003992.gbk ...
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values
> were ("","Direct Submission","Submitted (12-AUG-2004) National Center
> for Biotechnology Information, NIH, Bethesda, MD 20894,
> USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs (<NULL>)
> Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3
> ---------------------------------------------------
> Could not store NC_003992:
> ------------- EXCEPTION ?-------------
> MSG: create: object (Bio::Annotation::Reference) failed to insert or to
> be found by unique key
> ...

I would guess that the problem is this rather generic reference in
NC_003992 may be repeated exactly in another genome (causing the CRC
collision):

CONSRTM   NCBI Genome Project
TITLE     Direct Submission
JOURNAL   Submitted (12-AUG-2004) National Center for Biotechnology
Information, NIH, Bethesda, MD 20894, USA

See http://www.ncbi.nlm.nih.gov/nuccore/NC_011452

i.e. Could there be another direct submission by the NCBI on that date
in your collection?  You could search the database looking for that
CRC and trace it back to a bioentry, or just try grep for "JOURNAL
Submitted (12-AUG-2004) National Center for Biotechnology" on your
GenBank files. e.g. Something like this SQL statement might be
interesting:

SELECT bioentry.accession, reference.title FROM bioentry,
bioentry_reference, reference WHERE
bioentry.bioentry_id=bioentry_reference.bioentry_id AND
bioentry_reference.reference_id=reference.reference_id AND
reference.crc="CRC-E8D3CBBD80002FA1";

Peter


From michael.watson at bbsrc.ac.uk  Wed May 20 11:25:52 2009
From: michael.watson at bbsrc.ac.uk (michael watson (IAH-C))
Date: Wed, 20 May 2009 12:25:52 +0100
Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors
In-Reply-To: <320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk>
	<68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net>
	<8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk>
	<320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com>
Message-ID: <8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk>

We have a winner :)

NC_003992, NC_011452, NC_011451, NC_011450 all share at least one reference.

Would changing --flatlookup to --lookup change the behaviour so it checks for an existing reference before trying to insert the duplicate?

The answer is no :( (see below).

I guess this may need some coding then!

Thanks!
Mick

perl load_seqdatabase.pl --host localhost --dbname fmd_biosql --format genbank --dbuser removed --dbpass removed --lookup --remove NC_003992.gbk 
Loading NC_003992.gbk ...

-------------------- WARNING ---------------------
MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values were ("","Direct Submission","Submitted (12-AUG-2004) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs (<NULL>)
Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3
---------------------------------------------------
Could not store NC_003992: 
------------- EXCEPTION  -------------
MSG: create: object (Bio::Annotation::Reference) failed to insert or to be found by unique key
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
STACK Bio::DB::Persistent::PersistentObject::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm:271
STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/AnnotationCollectionAdaptor.pm:217
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:214
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
STACK Bio::DB::Persistent::PersistentObject::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm:271
STACK Bio::DB::BioSQL::SeqAdaptor::store_children /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/SeqAdaptor.pm:224
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:214
STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
STACK Bio::DB::Persistent::PersistentObject::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm:271
STACK (eval) load_seqdatabase.pl:622
STACK toplevel load_seqdatabase.pl:604

--------------------------------------

 at load_seqdatabase.pl line 635

-----Original Message-----
From: p.j.a.cock at googlemail.com [mailto:p.j.a.cock at googlemail.com] On Behalf Of Peter
Sent: 20 May 2009 11:59
To: michael watson (IAH-C)
Cc: Hilmar Lapp; biosql-l at lists.open-bio.org
Subject: Re: [BioSQL-l] load_seqdatabase.pl warnings and errors

On Wed, May 20, 2009 at 10:52 AM, michael watson (IAH-C)
<michael.watson at bbsrc.ac.uk> wrote:
>
> Hi Guys
>
> Ok, the warnings were due to duplicate sequences - I had downloaded a
> stream using Bio::DB::GenBank and I guess I assumed that would mean only
> unique entries were sent back. ?Using "--flatlookup --remove" gets rid
> of the warnings.

Great - easy :)

> Now for NC_003992.gbk...
>
> To answer Hilmar's question:
> ...
> And when I run load_seqdatabase.pl on NC_003992.gbk alone I still get:
>
> perl load_seqdatabase.pl --host localhost --dbname fmd_biosql --format
> genbank --dbuser removed --dbpass removed --flatlookup --remove
> NC_003992.gbk
>
> Loading NC_003992.gbk ...
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values
> were ("","Direct Submission","Submitted (12-AUG-2004) National Center
> for Biotechnology Information, NIH, Bethesda, MD 20894,
> USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs (<NULL>)
> Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3
> ---------------------------------------------------
> Could not store NC_003992:
> ------------- EXCEPTION ?-------------
> MSG: create: object (Bio::Annotation::Reference) failed to insert or to
> be found by unique key
> ...

I would guess that the problem is this rather generic reference in
NC_003992 may be repeated exactly in another genome (causing the CRC
collision):

CONSRTM   NCBI Genome Project
TITLE     Direct Submission
JOURNAL   Submitted (12-AUG-2004) National Center for Biotechnology
Information, NIH, Bethesda, MD 20894, USA

See http://www.ncbi.nlm.nih.gov/nuccore/NC_011452

i.e. Could there be another direct submission by the NCBI on that date
in your collection?  You could search the database looking for that
CRC and trace it back to a bioentry, or just try grep for "JOURNAL
Submitted (12-AUG-2004) National Center for Biotechnology" on your
GenBank files. e.g. Something like this SQL statement might be
interesting:

SELECT bioentry.accession, reference.title FROM bioentry,
bioentry_reference, reference WHERE
bioentry.bioentry_id=bioentry_reference.bioentry_id AND
bioentry_reference.reference_id=reference.reference_id AND
reference.crc="CRC-E8D3CBBD80002FA1";

Peter


From biopython at maubp.freeserve.co.uk  Wed May 20 11:34:51 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 20 May 2009 12:34:51 +0100
Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors
In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk>
	<68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net>
	<8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk>
	<320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com>
	<8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk>
Message-ID: <320fb6e00905200434x3e1c7978ue1c58382f7478354@mail.gmail.com>

On Wed, May 20, 2009 at 12:25 PM, michael watson (IAH-C)
<michael.watson at bbsrc.ac.uk> wrote:
>
> We have a winner :)
>
> NC_003992, NC_011452, NC_011451, NC_011450 all share
> at least one reference.
>
> Would changing --flatlookup to --lookup change the behaviour
> so it checks for an existing reference before trying to insert the
> duplicate?
>
> The answer is no :( (see below).
>
> I guess this may need some coding then!

My crude idea for a simple ad-hoc solution would be to remove these
pointless references from the records, before loading them into
BioSQL.

One way would be to edit the four GenBank files by hand (e.g. to
remove the reference or make them unique). You might also do this in a
BioPerl script that loads the records, edits the references, and then
puts them in the database. Personally I use Python not Perl, so I
can't tell you how you might do that with BioPerl.

Hilmar may be able to comment from a BioPerl/BioSQL point of view -
clearly CRC collisions of this nature will happen again in future.

Peter


From holland at eaglegenomics.com  Wed May 20 11:44:58 2009
From: holland at eaglegenomics.com (Richard Holland)
Date: Wed, 20 May 2009 12:44:58 +0100
Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors
In-Reply-To: <320fb6e00905200434x3e1c7978ue1c58382f7478354@mail.gmail.com>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk>
	<68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net>
	<8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk>
	<320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com>
	<8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk>
	<320fb6e00905200434x3e1c7978ue1c58382f7478354@mail.gmail.com>
Message-ID: <1242819898.18348.1.camel@buzzybee>

Theoretically, although unlikely, it is statistically entirely possible
for two completely different references to share the same CRC. Hence the
CRC shouldn't really be used as an indicator of uniqueness, although it
is still useful as a hashing function for indexing and quick lookup.

cheers,
Richard

On Wed, 2009-05-20 at 12:34 +0100, Peter wrote:
> On Wed, May 20, 2009 at 12:25 PM, michael watson (IAH-C)
> <michael.watson at bbsrc.ac.uk> wrote:
> >
> > We have a winner :)
> >
> > NC_003992, NC_011452, NC_011451, NC_011450 all share
> > at least one reference.
> >
> > Would changing --flatlookup to --lookup change the behaviour
> > so it checks for an existing reference before trying to insert the
> > duplicate?
> >
> > The answer is no :( (see below).
> >
> > I guess this may need some coding then!
> 
> My crude idea for a simple ad-hoc solution would be to remove these
> pointless references from the records, before loading them into
> BioSQL.
> 
> One way would be to edit the four GenBank files by hand (e.g. to
> remove the reference or make them unique). You might also do this in a
> BioPerl script that loads the records, edits the references, and then
> puts them in the database. Personally I use Python not Perl, so I
> can't tell you how you might do that with BioPerl.
> 
> Hilmar may be able to comment from a BioPerl/BioSQL point of view -
> clearly CRC collisions of this nature will happen again in future.
> 
> Peter
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
-- 
Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From hlapp at gmx.net  Wed May 20 15:10:20 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 20 May 2009 11:10:20 -0400
Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors
In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk>
	<68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net>
	<8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk>
	<320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com>
	<8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk>
Message-ID: <0212C167-7618-4761-A191-C6CE4B41EC2A@gmx.net>

Indeed changing the lookup will have no effect since deletion of  
bioentries doesn't cascade to references (only to bioentry-to- 
reference associations).

What I don't understand yet is how you get the CRC clash. Normally  
this kind of situation can happen if the first occurrence does not and  
the second does have PMID, by which it will be looked up, lookup fails  
(b/c the first occurrence didn't come with PMID), resulting in an  
insert of the erroneously deemed "new" reference, which then fails  
with a CRC clash.

However, there is no PMID nor any other identifier here, so I'll have  
to look into the code to find out why the second occurrence is either  
not looked up before an insert is attempted, or if it is looked up,  
why the lookup fails to find the record stored earlier.

	-hilmar

On May 20, 2009, at 7:25 AM, michael watson (IAH-C) wrote:

> We have a winner :)
>
> NC_003992, NC_011452, NC_011451, NC_011450 all share at least one  
> reference.
>
> Would changing --flatlookup to --lookup change the behaviour so it  
> checks for an existing reference before trying to insert the  
> duplicate?
>
> The answer is no :( (see below).
>
> I guess this may need some coding then!
>
> Thanks!
> Mick
>
> perl load_seqdatabase.pl --host localhost --dbname fmd_biosql -- 
> format genbank --dbuser removed --dbpass removed --lookup --remove  
> NC_003992.gbk
> Loading NC_003992.gbk ...
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed,  
> values were ("","Direct Submission","Submitted (12-AUG-2004)  
> National Center for Biotechnology Information, NIH, Bethesda, MD  
> 20894, USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs (<NULL>)
> Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3
> ---------------------------------------------------
> Could not store NC_003992:
> ------------- EXCEPTION  -------------
> MSG: create: object (Bio::Annotation::Reference) failed to insert or  
> to be found by unique key
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/BasePersistenceAdaptor.pm:206
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/BasePersistenceAdaptor.pm:251
> STACK Bio::DB::Persistent::PersistentObject::store /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> Persistent/PersistentObject.pm:271
> STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children / 
> usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/ 
> Bio/DB/BioSQL/AnnotationCollectionAdaptor.pm:217
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/BasePersistenceAdaptor.pm:214
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/BasePersistenceAdaptor.pm:251
> STACK Bio::DB::Persistent::PersistentObject::store /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> Persistent/PersistentObject.pm:271
> STACK Bio::DB::BioSQL::SeqAdaptor::store_children /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/SeqAdaptor.pm:224
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/BasePersistenceAdaptor.pm:214
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/BasePersistenceAdaptor.pm:251
> STACK Bio::DB::Persistent::PersistentObject::store /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> Persistent/PersistentObject.pm:271
> STACK (eval) load_seqdatabase.pl:622
> STACK toplevel load_seqdatabase.pl:604
>
> --------------------------------------
>
> at load_seqdatabase.pl line 635
>
> -----Original Message-----
> From: p.j.a.cock at googlemail.com [mailto:p.j.a.cock at googlemail.com]  
> On Behalf Of Peter
> Sent: 20 May 2009 11:59
> To: michael watson (IAH-C)
> Cc: Hilmar Lapp; biosql-l at lists.open-bio.org
> Subject: Re: [BioSQL-l] load_seqdatabase.pl warnings and errors
>
> On Wed, May 20, 2009 at 10:52 AM, michael watson (IAH-C)
> <michael.watson at bbsrc.ac.uk> wrote:
>>
>> Hi Guys
>>
>> Ok, the warnings were due to duplicate sequences - I had downloaded a
>> stream using Bio::DB::GenBank and I guess I assumed that would mean  
>> only
>> unique entries were sent back.  Using "--flatlookup --remove" gets  
>> rid
>> of the warnings.
>
> Great - easy :)
>
>> Now for NC_003992.gbk...
>>
>> To answer Hilmar's question:
>> ...
>> And when I run load_seqdatabase.pl on NC_003992.gbk alone I still  
>> get:
>>
>> perl load_seqdatabase.pl --host localhost --dbname fmd_biosql -- 
>> format
>> genbank --dbuser removed --dbpass removed --flatlookup --remove
>> NC_003992.gbk
>>
>> Loading NC_003992.gbk ...
>>
>> -------------------- WARNING ---------------------
>> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed,  
>> values
>> were ("","Direct Submission","Submitted (12-AUG-2004) National Center
>> for Biotechnology Information, NIH, Bethesda, MD 20894,
>> USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs (<NULL>)
>> Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3
>> ---------------------------------------------------
>> Could not store NC_003992:
>> ------------- EXCEPTION  -------------
>> MSG: create: object (Bio::Annotation::Reference) failed to insert  
>> or to
>> be found by unique key
>> ...
>
> I would guess that the problem is this rather generic reference in
> NC_003992 may be repeated exactly in another genome (causing the CRC
> collision):
>
> CONSRTM   NCBI Genome Project
> TITLE     Direct Submission
> JOURNAL   Submitted (12-AUG-2004) National Center for Biotechnology
> Information, NIH, Bethesda, MD 20894, USA
>
> See http://www.ncbi.nlm.nih.gov/nuccore/NC_011452
>
> i.e. Could there be another direct submission by the NCBI on that date
> in your collection?  You could search the database looking for that
> CRC and trace it back to a bioentry, or just try grep for "JOURNAL
> Submitted (12-AUG-2004) National Center for Biotechnology" on your
> GenBank files. e.g. Something like this SQL statement might be
> interesting:
>
> SELECT bioentry.accession, reference.title FROM bioentry,
> bioentry_reference, reference WHERE
> bioentry.bioentry_id=bioentry_reference.bioentry_id AND
> bioentry_reference.reference_id=reference.reference_id AND
> reference.crc="CRC-E8D3CBBD80002FA1";
>
> Peter

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Fri May 22 12:27:06 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 22 May 2009 13:27:06 +0100
Subject: [BioSQL-l] RULES in BioSQL PostgreSQL schema
Message-ID: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com>

Hi all,

This is a continuation of a thread / bug report from Biopython (Bug 2833)
where attempting to import duplicate entries into BioSQL did not raise an
error on PostgreSQL (but does on MySQL). Cymon traced this to the
RULES present in the schema to help bioperl-db.

On Fri, May 22, 2009 at 3:05 AM, Hilmar Lapp <hlapp at gmx.net> wrote:
>
> On May 21, 2009, at 6:52 PM, Cymon Cox wrote:
>
>> [...]
>>
>> Hi Andrea,
>>
>> The problem appears to be related to the BioSQL schema/PostGreSQL.
>>
>> As you indicated, adding a duplicate entry to bioentry returns a "INSERT 0
>> 0" and doesnt throw an IntegrityError which is what the code is looking
>> from and presumably what MySQL throws.
>>
>> The reason it doesnt throw an error is because of one (or both) of the
>> RULES in the schema:
>
> Indeed, I'd almost forgotten. The rules are there mostly as a remnant from
> earlier versions of PostgreSQL to support transactional loading the way
> bioperl-db (the object-relational mapping for BioPerl) is optimized. You
> probably don't need them anywhere else.
>
> ? ? ? ?-hilmar
>
> <gory-details>
> Bioperl-db is optimized such that entities that very likely don't exist yet
> in the database are attempted for insert right away. If the insert fails due
> to a unique key violation, the record is looked up (and then expected to be
> found). In Oracle and MySQL you can do this and the transaction remains
> healthy; i.e., you can commit the transaction later and all statements
> except those that failed will be committed. In PostgreSQL any failed
> statement dooms the entire transaction, and the only way out is a rollback.
> In this case, if you want the loading of one sequence record as one
> transaction, failing to insert a single feature record will doom the entire
> sequence load and you would need to start over with the sequence. To fix
> this, I wrote the rules, which in essence do do the lookups for PostgreSQL
> that the bioperl-db code would otherwise avoid, and on insert do nothing if
> the record is found, which results in zero rows affected when you would
> expect one (which is what bioperl-db cues off of and then triggers a
> lookup).
> The right way to do this meanwhile is to use nested transactions, which
> PostgreSQL supports since v8.0.x, but I haven't gotten around to implement
> support for that in Bioperl-db.
> </gory-details>

Hilmar,

It seems for Biopython to work properly with BioSQL on PostgreSQL
these bioentry rules should be removed from the schema (as the
comments in the schema do suggest). Obviously doing this would
break any installation also using the current version of bioperl-db.

Do the RULES affect BioJava or BioRuby using BioSQL on
PostgreSQL?

Are you happy to remove these RULES in BioSQL v1.0.x (after
making the outlined transactional changes in bioperl-db)?

Thanks,

Peter


From hlapp at gmx.net  Fri May 22 15:03:11 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Fri, 22 May 2009 11:03:11 -0400
Subject: [BioSQL-l] RULES in BioSQL PostgreSQL schema
In-Reply-To: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com>
References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com>
Message-ID: <CFCA6516-FFEA-4D8A-9E51-BCBAFC0F67A7@gmx.net>


On May 22, 2009, at 8:27 AM, Peter wrote:

> Are you happy to remove these RULES in BioSQL v1.0.x (after
> making the outlined transactional changes in bioperl-db)?

In principle yes. It would also mean dropping support for PostgreSQL  
v7.x, but I would hope that that's a non-issue.

But if anyone here is still using and relying on PostgreSQL v7.x (or  
earlier?) do let us know, please.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Fri May 22 15:57:38 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 22 May 2009 16:57:38 +0100
Subject: [BioSQL-l] [Biopython-dev] RULES in BioSQL PostgreSQL schema
In-Reply-To: <CFCA6516-FFEA-4D8A-9E51-BCBAFC0F67A7@gmx.net>
References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com>
	<CFCA6516-FFEA-4D8A-9E51-BCBAFC0F67A7@gmx.net>
Message-ID: <320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com>

On Fri, May 22, 2009 at 4:03 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
>
> On May 22, 2009, at 8:27 AM, Peter wrote:
>
>> Are you happy to remove these RULES in BioSQL v1.0.x (after
>> making the outlined transactional changes in bioperl-db)?
>
> In principle yes. It would also mean dropping support for PostgreSQL v7.x,
> but I would hope that that's a non-issue.
>
> But if anyone here is still using and relying on PostgreSQL v7.x (or
> earlier?) do let us know, please.

Great.

In the meantime could you add a big warning about this issue to the
INSTALL notes for PostgreSQL (i.e. recommend removing the RULES
section if not using bioper-db)?
http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/INSTALL

Peter


From hlapp at gmx.net  Fri May 22 18:20:58 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Fri, 22 May 2009 14:20:58 -0400
Subject: [BioSQL-l] [Biopython-dev] RULES in BioSQL PostgreSQL schema
In-Reply-To: <320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com>
References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com>
	<CFCA6516-FFEA-4D8A-9E51-BCBAFC0F67A7@gmx.net>
	<320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com>
Message-ID: <410BDC8E-1305-4FE5-860D-694E6A3EA9E6@gmx.net>

Yes, agree. Would you mind filing this in Bugzilla for BioSQL? -hilmar

On May 22, 2009, at 11:57 AM, Peter wrote:

> On Fri, May 22, 2009 at 4:03 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
>>
>> On May 22, 2009, at 8:27 AM, Peter wrote:
>>
>>> Are you happy to remove these RULES in BioSQL v1.0.x (after
>>> making the outlined transactional changes in bioperl-db)?
>>
>> In principle yes. It would also mean dropping support for  
>> PostgreSQL v7.x,
>> but I would hope that that's a non-issue.
>>
>> But if anyone here is still using and relying on PostgreSQL v7.x (or
>> earlier?) do let us know, please.
>
> Great.
>
> In the meantime could you add a big warning about this issue to the
> INSTALL notes for PostgreSQL (i.e. recommend removing the RULES
> section if not using bioper-db)?
> http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/INSTALL
>
> Peter

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Fri May 22 22:46:54 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Fri, 22 May 2009 23:46:54 +0100
Subject: [BioSQL-l] [Biopython-dev] RULES in BioSQL PostgreSQL schema
In-Reply-To: <410BDC8E-1305-4FE5-860D-694E6A3EA9E6@gmx.net>
References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com>
	<CFCA6516-FFEA-4D8A-9E51-BCBAFC0F67A7@gmx.net>
	<320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com>
	<410BDC8E-1305-4FE5-860D-694E6A3EA9E6@gmx.net>
Message-ID: <320fb6e00905221546i26edc7a2u2a02fb0d01c374ea@mail.gmail.com>

On 5/22/09, Hilmar Lapp <hlapp at gmx.net> wrote:
> Yes, agree. Would you mind filing this in Bugzilla for BioSQL? -hilmar

I've filed Bug 2839, hopefully this is what you had in mind:
http://bugzilla.open-bio.org/show_bug.cgi?id=2839

Peter


From michael.watson at bbsrc.ac.uk  Wed May 27 12:50:45 2009
From: michael.watson at bbsrc.ac.uk (michael watson (IAH-C))
Date: Wed, 27 May 2009 13:50:45 +0100
Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors
In-Reply-To: <0212C167-7618-4761-A191-C6CE4B41EC2A@gmx.net>
References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk>
	<68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net>
	<8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk>
	<320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com>
	<8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk>
	<0212C167-7618-4761-A191-C6CE4B41EC2A@gmx.net>
Message-ID: <8975119BCD0AC5419D61A9CF1A923E9507E27A82@iahce2ksrv1.iah.bbsrc.ac.uk>

Hi Hilmar

I tried to dig around in the code, but quite frankly I quickly got lost.
What is clear is that the existing reference is not being found in the
cache nor the database, and therefore a unique key violation occurs when
the code tries to insert the object.

I'm pretty stuffed on this project until I can get this sorted out.

If someone tells me where to look I can try and sort out why this
happens, but at the moment (for me) it's like looking for a needle in a
haystack.

Thanks in advance

Mick

-----Original Message-----
From: Hilmar Lapp [mailto:hlapp at gmx.net] 
Sent: 20 May 2009 16:10
To: michael watson (IAH-C)
Cc: Peter; biosql-l at lists.open-bio.org
Subject: Re: [BioSQL-l] load_seqdatabase.pl warnings and errors

Indeed changing the lookup will have no effect since deletion of  
bioentries doesn't cascade to references (only to bioentry-to- 
reference associations).

What I don't understand yet is how you get the CRC clash. Normally  
this kind of situation can happen if the first occurrence does not and  
the second does have PMID, by which it will be looked up, lookup fails  
(b/c the first occurrence didn't come with PMID), resulting in an  
insert of the erroneously deemed "new" reference, which then fails  
with a CRC clash.

However, there is no PMID nor any other identifier here, so I'll have  
to look into the code to find out why the second occurrence is either  
not looked up before an insert is attempted, or if it is looked up,  
why the lookup fails to find the record stored earlier.

	-hilmar

On May 20, 2009, at 7:25 AM, michael watson (IAH-C) wrote:

> We have a winner :)
>
> NC_003992, NC_011452, NC_011451, NC_011450 all share at least one  
> reference.
>
> Would changing --flatlookup to --lookup change the behaviour so it  
> checks for an existing reference before trying to insert the  
> duplicate?
>
> The answer is no :( (see below).
>
> I guess this may need some coding then!
>
> Thanks!
> Mick
>
> perl load_seqdatabase.pl --host localhost --dbname fmd_biosql -- 
> format genbank --dbuser removed --dbpass removed --lookup --remove  
> NC_003992.gbk
> Loading NC_003992.gbk ...
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed,  
> values were ("","Direct Submission","Submitted (12-AUG-2004)  
> National Center for Biotechnology Information, NIH, Bethesda, MD  
> 20894, USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs (<NULL>)
> Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3
> ---------------------------------------------------
> Could not store NC_003992:
> ------------- EXCEPTION  -------------
> MSG: create: object (Bio::Annotation::Reference) failed to insert or  
> to be found by unique key
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/BasePersistenceAdaptor.pm:206
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/BasePersistenceAdaptor.pm:251
> STACK Bio::DB::Persistent::PersistentObject::store /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> Persistent/PersistentObject.pm:271
> STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children / 
> usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/ 
> Bio/DB/BioSQL/AnnotationCollectionAdaptor.pm:217
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/BasePersistenceAdaptor.pm:214
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/BasePersistenceAdaptor.pm:251
> STACK Bio::DB::Persistent::PersistentObject::store /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> Persistent/PersistentObject.pm:271
> STACK Bio::DB::BioSQL::SeqAdaptor::store_children /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/SeqAdaptor.pm:224
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/BasePersistenceAdaptor.pm:214
> STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> BioSQL/BasePersistenceAdaptor.pm:251
> STACK Bio::DB::Persistent::PersistentObject::store /usr/users/ 
> bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ 
> Persistent/PersistentObject.pm:271
> STACK (eval) load_seqdatabase.pl:622
> STACK toplevel load_seqdatabase.pl:604
>
> --------------------------------------
>
> at load_seqdatabase.pl line 635
>
> -----Original Message-----
> From: p.j.a.cock at googlemail.com [mailto:p.j.a.cock at googlemail.com]  
> On Behalf Of Peter
> Sent: 20 May 2009 11:59
> To: michael watson (IAH-C)
> Cc: Hilmar Lapp; biosql-l at lists.open-bio.org
> Subject: Re: [BioSQL-l] load_seqdatabase.pl warnings and errors
>
> On Wed, May 20, 2009 at 10:52 AM, michael watson (IAH-C)
> <michael.watson at bbsrc.ac.uk> wrote:
>>
>> Hi Guys
>>
>> Ok, the warnings were due to duplicate sequences - I had downloaded a
>> stream using Bio::DB::GenBank and I guess I assumed that would mean  
>> only
>> unique entries were sent back.  Using "--flatlookup --remove" gets  
>> rid
>> of the warnings.
>
> Great - easy :)
>
>> Now for NC_003992.gbk...
>>
>> To answer Hilmar's question:
>> ...
>> And when I run load_seqdatabase.pl on NC_003992.gbk alone I still  
>> get:
>>
>> perl load_seqdatabase.pl --host localhost --dbname fmd_biosql -- 
>> format
>> genbank --dbuser removed --dbpass removed --flatlookup --remove
>> NC_003992.gbk
>>
>> Loading NC_003992.gbk ...
>>
>> -------------------- WARNING ---------------------
>> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed,  
>> values
>> were ("","Direct Submission","Submitted (12-AUG-2004) National Center
>> for Biotechnology Information, NIH, Bethesda, MD 20894,
>> USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs (<NULL>)
>> Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3
>> ---------------------------------------------------
>> Could not store NC_003992:
>> ------------- EXCEPTION  -------------
>> MSG: create: object (Bio::Annotation::Reference) failed to insert  
>> or to
>> be found by unique key
>> ...
>
> I would guess that the problem is this rather generic reference in
> NC_003992 may be repeated exactly in another genome (causing the CRC
> collision):
>
> CONSRTM   NCBI Genome Project
> TITLE     Direct Submission
> JOURNAL   Submitted (12-AUG-2004) National Center for Biotechnology
> Information, NIH, Bethesda, MD 20894, USA
>
> See http://www.ncbi.nlm.nih.gov/nuccore/NC_011452
>
> i.e. Could there be another direct submission by the NCBI on that date
> in your collection?  You could search the database looking for that
> CRC and trace it back to a bioentry, or just try grep for "JOURNAL
> Submitted (12-AUG-2004) National Center for Biotechnology" on your
> GenBank files. e.g. Something like this SQL statement might be
> interesting:
>
> SELECT bioentry.accession, reference.title FROM bioentry,
> bioentry_reference, reference WHERE
> bioentry.bioentry_id=bioentry_reference.bioentry_id AND
> bioentry_reference.reference_id=reference.reference_id AND
> reference.crc="CRC-E8D3CBBD80002FA1";
>
> Peter

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================