From biopython at maubp.freeserve.co.uk Thu May 14 14:20:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 May 2009 19:20:47 +0100 Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL Message-ID: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> Hi, This is cross-posted between biopython-dev and biosql-l as it regards parsing the description (DE) lines in SwissProt files and how they are stored in BioSQL. This follows from an earlier discussion on biopython-dev Older SwissProt files just had one or two DE lines, and it made sense to treat this as a simple string mapped onto the description field in the bioentry table in BioSQL. This appears to what happens with BioPerl 1.5.x and in Biopython (although the details regarding white space differ). However, newer SwissProt files have many DE lines with additional structure. The example Michiel gave earlier on the biopython-dev list was: http://www.uniprot.org/uniprot/Q9XHP0.txt This has the following DE lines: DE RecName: Full=11S globulin seed storage protein 2; DE AltName: Full=11S globulin seed storage protein II; DE AltName: Full=Alpha-globulin; DE Contains: DE RecName: Full=11S globulin seed storage protein 2 acidic chain; DE AltName: Full=11S globulin seed storage protein II acidic chain; DE Contains: DE RecName: Full=11S globulin seed storage protein 2 basic chain; DE AltName: Full=11S globulin seed storage protein II basic chain; DE Flags: Precursor; I had to fight with perl to get my old copy of BioPerl working again (some week reference thing), but I managed, and then loaded this file into my test BioSQL database with: $ perl load_seqdatabase.pl --dbname biosql_test --dbuser root --dbpass XXX --namespace biosql_test --format swiss Q9XHP0.txt Then I looked at the resulting description in the main bioentry table: $ mysql --user=root -p biosql_test -e 'SELECT description FROM bioentry WHERE accession="Q9XHP0";' This is stored as one huge long string (without the newlines, I'm not sure if BioPerl strips those in parsing the file, or when loading it into the database): RecName: Full=11S globulin seed storage protein 2; AltName: Full=11S globulin seed storage protein II; AltName: Full=Alpha-globulin; Contains: RecName: Full=11S globulin seed storage protein 2 acidic chain; AltName: Full=11S globulin seed storage protein II acidic chain; Contains: RecName: Full=11S globulin seed storage protein 2 basic chain; AltName: Full=11S globulin seed storage protein II basic chain; Flags: Precursor; For Biopython, I emptied the database then did: >>> from Bio import SeqIO >>> from BioSQL import BioSeqDatabase >>> server = BioSeqDatabase.open_database(driver="MySQLdb", user="root", passwd = "XXX", host = "localhost", db="biosql_test") >>> db = server["biosql-test"] #namespace >>> db.load(SeqIO.parse(open("Q9XHP0.txt"), "swiss")) 1 >>> server.commit() As before, I looked in the table with mysql. Again - this stores the full description from the DE line, although with the newlines embedded. So, Biopython is consistent with my old copy of BioPerl (1.5.x) if we ignore the white space. However, how does this look in BioPerl 1.6? If this is the same, are there any plans to change this? For Biopython we have discussed recording most of the DE information under the annotations instead (keyed off RecName, AltName, Contains, Flags), but I would like to be consistent with BioPerl+BioSQL. Thanks Peter From biopython at maubp.freeserve.co.uk Sat May 16 07:53:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 16 May 2009 12:53:07 +0100 Subject: [BioSQL-l] Recording "nucleotide" in the sequence table? Message-ID: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com> Hi all, You may recall a year ago or so, we talked about how BioPerl and Biopython used lower case alphabet names ("dna", "rna", "protein") while BioJava was inconsistent and used upper (or even mixed case). http://lists.open-bio.org/pipermail/biopython/2007-November/003894.html http://lists.open-bio.org/pipermail/biojava-l/2007-November/006034.html http://lists.open-bio.org/pipermail/biosql-l/2008-March/001185.html You'll notice that thread was split over several mailing lists (and looking back, I think I missed some posts as I only read the Biopython and BioSQL lists). Anyway, this lead to the following proposal: http://www.biosql.org/wiki/Enhancement_Requests#Check_constraint_on_biosequence.alphabet In Biopython we also use "unknown" for sequences which are not known to be "dna", "rna", "protein". I presume this was copying BioPerl. In a recent bug report (Bug 2829) it was pointed out that we (Biopython) don't attempt to record nucleotide alphabets in BioSQL (i.e. a sequence which could be DNA or RNA but we don't know which), they just get "unknown" as their biosequence.alphabet entry. Is there any precedent in BioPerl, BioJava or BioRuby for how to handle this? If not, I'd like to introduce and agree on "nucleotide" for this situation. Peter From biopython at maubp.freeserve.co.uk Sat May 16 08:12:01 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 16 May 2009 13:12:01 +0100 Subject: [BioSQL-l] BioSQL at BOSC 2009? Message-ID: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com> Hi, Will any of the key BioSQL people from the Bio* projects be at BOSC (and ISMB) this year? http://open-bio.org/wiki/BOSC_2009 There will be several people from Biopython there this year, including me and Brad Chapman who are both familiar with BioSQL. This would be a nice opportunity for further improving BioSQL compatibility between the Bio* projects - something that has been suggested in the past, e.g. http://lists.open-bio.org/pipermail/biopython/2007-November/003893.html http://lists.open-bio.org/pipermail/biojava-l/2007-November/006037.html I don't follow the BioPerl, BioJava or BioRuby mailing lists - and I doubt many of their developers follow the Biopython mailing lists. So, rather than having any BioSQL compatibility discussions split over individual Bio* project specific mailing lists, it seems using the BioSQL mailing list is most appropriate. I have CC'd a few key people just in case they are not on the BioSQL mailing list, if I have missed anyone please forward this to them and ask them to sign up. Thanks, Peter From markjschreiber at gmail.com Sat May 16 10:58:19 2009 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sat, 16 May 2009 22:58:19 +0800 Subject: [BioSQL-l] Recording "nucleotide" in the sequence table? In-Reply-To: <93b45ca50905160755o4e5c9520n55bc5b84774f277a@mail.gmail.com> References: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com> <93b45ca50905160755o4e5c9520n55bc5b84774f277a@mail.gmail.com> Message-ID: <93b45ca50905160758j7c9f1d78k9ec49008d10f2e4f@mail.gmail.com> I don't think you can do this with certainty. If you don't know the source alphabet then an amino acid sequence could look like dna if it is only using acgt and some of the ambiguity codes. If it is a long sequence it will become increasingly unlikey it is amino acid but never certain. On 16 May 2009, 7:54 PM, "Peter" wrote: Hi all, You may recall a year ago or so, we talked about how BioPerl and Biopython used lower case alphabet names ("dna", "rna", "protein") while BioJava was inconsistent and used upper (or even mixed case). http://lists.open-bio.org/pipermail/biopython/2007-November/003894.html http://lists.open-bio.org/pipermail/biojava-l/2007-November/006034.html http://lists.open-bio.org/pipermail/biosql-l/2008-March/001185.html You'll notice that thread was split over several mailing lists (and looking back, I think I missed some posts as I only read the Biopython and BioSQL lists). Anyway, this lead to the following proposal: http://www.biosql.org/wiki/Enhancement_Requests#Check_constraint_on_biosequence.alphabet In Biopython we also use "unknown" for sequences which are not known to be "dna", "rna", "protein". I presume this was copying BioPerl. In a recent bug report (Bug 2829) it was pointed out that we (Biopython) don't attempt to record nucleotide alphabets in BioSQL (i.e. a sequence which could be DNA or RNA but we don't know which), they just get "unknown" as their biosequence.alphabet entry. Is there any precedent in BioPerl, BioJava or BioRuby for how to handle this? If not, I'd like to introduce and agree on "nucleotide" for this situation. Peter _______________________________________________ BioSQL-l mailing list BioSQL-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biosql-l From hlapp at gmx.net Sat May 16 11:17:39 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 16 May 2009 11:17:39 -0400 Subject: [BioSQL-l] BioSQL at BOSC 2009? In-Reply-To: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com> References: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com> Message-ID: <1BD503B3-D805-4882-87DD-820138792DB2@gmx.net> On May 16, 2009, at 8:12 AM, Peter wrote: > Will any of the key BioSQL people from the Bio* projects be at BOSC > (and ISMB) this year? http://open-bio.org/wiki/BOSC_2009 Yes, I'll be there (though I am not presenting this year). > [...] This would be a nice opportunity for further improving BioSQL > compatibility between the Bio* projects - something that has been > suggested in the past, Indeed, excellent idea. Should we plan for a BoF? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sat May 16 12:48:40 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 16 May 2009 12:48:40 -0400 Subject: [BioSQL-l] Recording "nucleotide" in the sequence table? In-Reply-To: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com> References: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com> Message-ID: <48684763-5657-40F3-BE75-31E8CDE42613@gmx.net> On May 16, 2009, at 7:53 AM, Peter wrote: > In a recent bug report (Bug 2829) it was pointed out that we > (Biopython) don't attempt to record nucleotide alphabets in BioSQL > (i.e. a sequence which could be DNA or RNA but we don't know which), > they just get "unknown" as their biosequence.alphabet entry. I'm assuming that you do know that it's not protein, right? I.e., assigning alphabet "unknown" isn't exactly right. > Is there any precedent in BioPerl, BioJava or BioRuby for how to > handle this? If not, I'd like to introduce and agree on "nucleotide" > for this situation. So which letters (symbols) does the "nucleotide" alphabet contain? Getting back to Mark's question, how do you know that it's either dna or rna but not protein? Is the problem that the user can't tell you whether it's dna or rna but they know it's not protein, or is it that the user doesn't say anything and all you have is the symbols of the sequence, which are a, c, g, and t only. In BioPerl we'll guess the alphabet if the user doesn't say what it is, and at present if what we're seeing are the symbols a, c, g, and t only, then the guess is dna. If we're seeing u rather than t, we guess it's rna. An "unknown" alphabet would be for the user to expressly choose. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Sat May 16 16:25:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 16 May 2009 21:25:21 +0100 Subject: [BioSQL-l] Recording "nucleotide" in the sequence table? In-Reply-To: <48684763-5657-40F3-BE75-31E8CDE42613@gmx.net> References: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com> <48684763-5657-40F3-BE75-31E8CDE42613@gmx.net> Message-ID: <320fb6e00905161325m497882ddge853e1525acdfede@mail.gmail.com> Hilmar wrote: > I'm assuming that you do know that it's not protein, right? > I.e., assigning alphabet "unknown" isn't exactly right. Yes, if the sequence is using the generic nucleotide alphabet this means it is NOT protein, and could be DNA or RNA. So yes, downgrading a "nucleotide" alphabet to just "unknown" when storing it in BioSQL (as we do now) is losing information - hence me starting this thread. > > Is there any precedent in BioPerl, BioJava or BioRuby for how to > > handle this? If not, I'd like to introduce and agree on "nucleotide" > > for this situation. > > So which letters (symbols) does the "nucleotide" alphabet contain? Potentially anything - although I would expect the standard (ambiguous) letters using in RNA or DNA, plus perhaps gap symbols. > Getting back to Mark's question, how do you know that it's either dna or > rna but not protein? We know because the user (or parser) has explicitly used the generic nucleotide alphabet, this means it is not protein, and is either DNA or RNA. From the point of loading the sequence into BioSQL, we don't know or care where the sequence came from - we just get given the data with a declared alphabet. > Is the problem that the user can't tell you whether it's dna or > rna but they know it's not protein, or is it that the user doesn't > say anything and all you have is the symbols of the sequence, > which are a, c, g, and t only. In the situation I'm talking about, either the user has explicitly picked the alphabet, or perhaps one of our parsers has done so. This would be because the user don't know, of the file format doesn't specify this information. This is admittedly a corner case - generally there will be either be T or U entries in the sequence so DNA or RNA can be deduced unambiguously. > In BioPerl we'll guess the alphabet if the user doesn't say what it is, and > at present if what we're seeing are the symbols a, c, g, and t only, then > the guess is dna. If we're seeing u rather than t, we guess it's rna. An > "unknown" alphabet would be for the user to expressly choose. What would BioPerl do with the nucleotide sequence GCGCGCGA? Presumably you guess, thus record either "dna" or "rna" in BioSQL, so the issue of wanting to record "nucleotide" never arises. In python "guessing" is discouraged. If we have a nucleotide sequence like GCGCGCGA, this could be DNA or RNA - you can't tell. Our nucleotide alphabet covers this situation , although another strong reason for having it is as a common base class for the RNA and DNA alphabets. On 5/16/09, Mark Schreiber wrote: > I don't think you can do this with certainty. If you don't know the source > alphabet then an amino acid sequence could look like dna if it is only > using acgt and some of the ambiguity codes. > > If it is a long sequence it will become increasingly unlikey it is amino > acid but never certain. The python answer is don't guess. If you read in a FASTA file with Biopython it will by default be given a generic alphabet, unless you explicitly specify otherwise (and in BioSQL the alphabet will be stored as "unknown"). i.e. the onus is on the user to be explicit. Peter From biopython at maubp.freeserve.co.uk Sat May 16 17:23:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 16 May 2009 22:23:04 +0100 Subject: [BioSQL-l] BioSQL at BOSC 2009? In-Reply-To: <1BD503B3-D805-4882-87DD-820138792DB2@gmx.net> References: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com> <1BD503B3-D805-4882-87DD-820138792DB2@gmx.net> Message-ID: <320fb6e00905161423l4df26525hbc9a824419c7a370@mail.gmail.com> On 5/16/09, Hilmar Lapp wrote: > > On May 16, 2009, at 8:12 AM, Peter wrote: > > > Will any of the key BioSQL people from the Bio* projects be at BOSC > > (and ISMB) this year? http://open-bio.org/wiki/BOSC_2009 > > > > Yes, I'll be there (though I am not presenting this year). > > > [...] This would be a nice opportunity for further improving BioSQL > > compatibility between the Bio* projects - something that has been > > suggested in the past, > > Indeed, excellent idea. Should we plan for a BoF? If you want to do this as a formal BoF, then sure. Brad and I (plus other Biopython folk like Tiago and Bartek, who I believe are not so interested in BioSQL) are already talking about a Bioython BoF/hackathon session at BOSC. It would be easier if that didn't overlap with a BioSQL session ;) (but not impossible - Brad and I can perhaps split our time?) I will be staying for all of ISMB, and I think Brad is about for the Monday and maybe Tuesday, so that might be an alternative for scheduling. Peter From hlapp at gmx.net Sat May 16 17:57:15 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 16 May 2009 17:57:15 -0400 Subject: [BioSQL-l] BioSQL at BOSC 2009? In-Reply-To: <320fb6e00905161423l4df26525hbc9a824419c7a370@mail.gmail.com> References: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com> <1BD503B3-D805-4882-87DD-820138792DB2@gmx.net> <320fb6e00905161423l4df26525hbc9a824419c7a370@mail.gmail.com> Message-ID: <74D4CC78-FC7B-4595-9D24-EB6B3ED43318@gmx.net> On May 16, 2009, at 5:23 PM, Peter wrote: > I will be staying for all of ISMB I am too. Should we doodle something once the program is out? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sat May 16 18:10:43 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 16 May 2009 18:10:43 -0400 Subject: [BioSQL-l] Recording "nucleotide" in the sequence table? In-Reply-To: <320fb6e00905161325m497882ddge853e1525acdfede@mail.gmail.com> References: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com> <48684763-5657-40F3-BE75-31E8CDE42613@gmx.net> <320fb6e00905161325m497882ddge853e1525acdfede@mail.gmail.com> Message-ID: <9A3473A7-2125-486C-BAAE-27820FF19D8D@gmx.net> I think we'll have to define carefully what we mean by "generic nucleotide alphabet". (Normally I hear nucleotide used as the type of a sequence, but not its alphabet.) A nucleotide alphabet in the way you describe it also can't really be the "base class" for either a DNA or RNA alphabet, can it? Typically in OOP, derived classes expand on a base class, not restrict it. So isn't there potential for confusion? What you are essentially talking about is the case when a sequence contains only A, C, and G. In that case, we don't know either that it's not protein, do we? > [...] In python "guessing" is discouraged. If we have a nucleotide > sequence > like GCGCGCGA, this could be DNA or RNA - you can't tell. And how do you tell it's nucleotide to begin with? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sat May 16 18:34:57 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 16 May 2009 18:34:57 -0400 Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> Message-ID: <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> Don't you love SwissProt (or UniProt as we must call it now I suppose). They (understandably) try to squeeze ever more annotation into the existing tags, rather than adding new tags. So, of the following structure: DE RecName: Full=11S globulin seed storage protein 2; DE AltName: Full=11S globulin seed storage protein II; DE AltName: Full=Alpha-globulin; DE Contains: DE RecName: Full=11S globulin seed storage protein 2 acidic chain; DE AltName: Full=11S globulin seed storage protein II acidic chain; DE Contains: DE RecName: Full=11S globulin seed storage protein 2 basic chain; DE AltName: Full=11S globulin seed storage protein II basic chain; DE Flags: Precursor; really only the first line, with the 'RecName: Full=' removed, is the description line as we know it. The rest, I would say, is annotation, such as two alternative names, amino acid chains contained in the full record (shouldn't this be feature annotation, really? and indeed it is - why it needs to be repeated here is beyond me) and their names as well as alternative names, and the fact that the sequence is a precursor form. Leaving all this in one string has the advantage that we can round- trip it (and there is probably hardly any other way to accomplish that), but clearly in terms of semantics this isn't the sequence description as we know it anymore. Does anyone else think too that completely changing the semantics of sequence annotation fields is a bad idea? My inclination from a BioPerl perspective is to extract the part following 'RecName: Full=' as the description, and attach the rest as annotation. We could in fact use the TagTree class for this. I'm cross- posting to BioPerl too to gather what other BioPerl'ers think about this. -hilmar On May 14, 2009, at 2:20 PM, Peter wrote: > Hi, > > This is cross-posted between biopython-dev and biosql-l as it regards > parsing the description (DE) lines in SwissProt files and how they are > stored in BioSQL. This follows from an earlier discussion on > biopython-dev > > Older SwissProt files just had one or two DE lines, and it made sense > to treat this as a simple string mapped onto the description field in > the bioentry table in BioSQL. This appears to what happens with > BioPerl 1.5.x and in Biopython (although the details regarding white > space differ). However, newer SwissProt files have many DE lines with > additional structure. The example Michiel gave earlier on the > biopython-dev list was: > > http://www.uniprot.org/uniprot/Q9XHP0.txt > > This has the following DE lines: > > DE RecName: Full=11S globulin seed storage protein 2; > DE AltName: Full=11S globulin seed storage protein II; > DE AltName: Full=Alpha-globulin; > DE Contains: > DE RecName: Full=11S globulin seed storage protein 2 acidic chain; > DE AltName: Full=11S globulin seed storage protein II acidic > chain; > DE Contains: > DE RecName: Full=11S globulin seed storage protein 2 basic chain; > DE AltName: Full=11S globulin seed storage protein II basic chain; > DE Flags: Precursor; > > I had to fight with perl to get my old copy of BioPerl working again > (some week reference thing), but I managed, and then loaded this file > into my test BioSQL database with: > > $ perl load_seqdatabase.pl --dbname biosql_test --dbuser root --dbpass > XXX --namespace biosql_test --format swiss Q9XHP0.txt > > Then I looked at the resulting description in the main bioentry table: > > $ mysql --user=root -p biosql_test -e 'SELECT description FROM > bioentry WHERE accession="Q9XHP0";' > > This is stored as one huge long string (without the newlines, I'm not > sure if BioPerl strips those in parsing the file, or when loading it > into the database): > > RecName: Full=11S globulin seed storage protein 2; AltName: Full=11S > globulin seed storage protein II; AltName: Full=Alpha-globulin; > Contains: RecName: Full=11S globulin seed storage protein 2 acidic > chain; AltName: Full=11S globulin seed storage protein II acidic > chain; Contains: RecName: Full=11S globulin seed storage protein 2 > basic chain; AltName: Full=11S globulin seed storage protein II basic > chain; Flags: Precursor; > > For Biopython, I emptied the database then did: > >>>> from Bio import SeqIO >>>> from BioSQL import BioSeqDatabase >>>> server = BioSeqDatabase.open_database(driver="MySQLdb", >>>> user="root", passwd = "XXX", host = "localhost", db="biosql_test") >>>> db = server["biosql-test"] #namespace >>>> db.load(SeqIO.parse(open("Q9XHP0.txt"), "swiss")) > 1 >>>> server.commit() > > As before, I looked in the table with mysql. Again - this stores the > full description from the DE line, although with the newlines > embedded. So, Biopython is consistent with my old copy of BioPerl > (1.5.x) if we ignore the white space. > > However, how does this look in BioPerl 1.6? If this is the same, are > there any plans to change this? For Biopython we have discussed > recording most of the DE information under the annotations instead > (keyed off RecName, AltName, Contains, Flags), but I would like to be > consistent with BioPerl+BioSQL. > > Thanks > > Peter > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Sat May 16 19:06:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 17 May 2009 00:06:41 +0100 Subject: [BioSQL-l] Recording "nucleotide" in the sequence table? In-Reply-To: <9A3473A7-2125-486C-BAAE-27820FF19D8D@gmx.net> References: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com> <48684763-5657-40F3-BE75-31E8CDE42613@gmx.net> <320fb6e00905161325m497882ddge853e1525acdfede@mail.gmail.com> <9A3473A7-2125-486C-BAAE-27820FF19D8D@gmx.net> Message-ID: <320fb6e00905161606l5fdb0862mf25a45dad07dac8@mail.gmail.com> On 5/16/09, Hilmar Lapp wrote: > > I think we'll have to define carefully what we mean by "generic nucleotide > alphabet". (Normally I hear nucleotide used as the type of a sequence, but > not its alphabet.) In Biopython the type of a sequence (e.g. DNA, RNA or Protein) is recorded by an alphabet object (which may also record the expected range of letters). > A nucleotide alphabet in the way you describe it also can't really be the > "base class" for either a DNA or RNA alphabet, can it? Typically in OOP, > derived classes expand on a base class, not restrict it. So isn't there > potential for confusion? Well, that's how it was done for the Biopython alphabet classes. I'm simplifying slightly, but at the top level we have a generic alphabet, which has as children generic protein and generic nucleotide (which has as its children generic dna and generic rna). Each of these then has IUPAC subclasses which are further restrictions where the valid letters are proscribed. > What you are essentially talking about is the case when a sequence > contains only A, C, and G. In that case, we don't know either that > it's not protein, do we? > > > [...] In python "guessing" is discouraged. If we have a nucleotide > > sequence like GCGCGCGA, this could be DNA or RNA - you can't > > tell. > > And how do you tell it's nucleotide to begin with? That is the whole point. When deciding what to record in the biosequence.alphabet field in BioSQL we (Bioython) can only go by what the alphabet associated with the sequence object. Whoever created the sequence specified the alphabet based on meta data, external knowledge, or guessed. If this was done by a parser, then the file format itself may have specified the sequence type. If none of BioPerl, BioJava and BioRuby have an analogous sequence representation for a nucleotide sequence which might be DNA or RNA, then perhaps the current situation with only "protein", "dna", "rna" and "unknown" in the biosequence.alphabet field in BioSQL is sufficient. Peter From biopython at maubp.freeserve.co.uk Sat May 16 19:14:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 17 May 2009 00:14:54 +0100 Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> Message-ID: <320fb6e00905161614p5a2d964fs6de110bb915ed066@mail.gmail.com> On 5/16/09, Hilmar Lapp wrote: > > Don't you love SwissProt (or UniProt as we must call it now I suppose). > They (understandably) try to squeeze ever more annotation into the existing > tags, rather than adding new tags. > > So, of the following structure: > > DE RecName: Full=11S globulin seed storage protein 2; > DE AltName: Full=11S globulin seed storage protein II; > DE AltName: Full=Alpha-globulin; > DE Contains: > DE RecName: Full=11S globulin seed storage protein 2 acidic chain; > DE AltName: Full=11S globulin seed storage protein II acidic chain; > DE Contains: > DE RecName: Full=11S globulin seed storage protein 2 basic chain; > DE AltName: Full=11S globulin seed storage protein II basic chain; > DE Flags: Precursor; > > really only the first line, with the 'RecName: Full=' removed, is the > description line as we know it. The rest, I would say, is annotation, such > as two alternative names, amino acid chains contained in the full record > (shouldn't this be feature annotation, really? and indeed it is - why it > needs to be repeated here is beyond me) and their names as well as > alternative names, and the fact that the sequence is a precursor form. > > Leaving all this in one string has the advantage that we can round-trip it > (and there is probably hardly any other way to accomplish that), but clearly > in terms of semantics this isn't the sequence description as we know it > anymore. > > Does anyone else think too that completely changing the semantics of > sequence annotation fields is a bad idea? +1 That's pretty much what I thought on seeing this the first time. > My inclination from a BioPerl perspective is to extract the part following > 'RecName: Full=' as the description, and attach the rest as annotation. We > could in fact use the TagTree class for this. I'm cross-posting to BioPerl > too to gather what other BioPerl'ers think about this. Am I right to infer that currently BioPerl 1.6.x, like BioPerl 1.5.x just treats the DE lines as only big long string? Could you translate your idea about the TagTree class into something concrete with BioSQL tables and fields for me? I'm not familiar with the TagTree (or Perl). Over on the Biopython list we'd talked about storing this annotation in a nested structured. However, in order to use the BioSQL annotations mechanisms, I think a simple flat structure is required :( Peter From cjfields at illinois.edu Sat May 16 19:16:05 2009 From: cjfields at illinois.edu (Chris Fields) Date: Sat, 16 May 2009 18:16:05 -0500 Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> Message-ID: <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> On May 16, 2009, at 5:34 PM, Hilmar Lapp wrote: > Don't you love SwissProt (or UniProt as we must call it now I > suppose). They (understandably) try to squeeze ever more annotation > into the existing tags, rather than adding new tags. > > So, of the following structure: > > DE RecName: Full=11S globulin seed storage protein 2; > DE AltName: Full=11S globulin seed storage protein II; > DE AltName: Full=Alpha-globulin; > DE Contains: > DE RecName: Full=11S globulin seed storage protein 2 acidic chain; > DE AltName: Full=11S globulin seed storage protein II acidic > chain; > DE Contains: > DE RecName: Full=11S globulin seed storage protein 2 basic chain; > DE AltName: Full=11S globulin seed storage protein II basic chain; > DE Flags: Precursor; > > really only the first line, with the 'RecName: Full=' removed, is > the description line as we know it. The rest, I would say, is > annotation, such as two alternative names, amino acid chains > contained in the full record (shouldn't this be feature annotation, > really? and indeed it is - why it needs to be repeated here is > beyond me) and their names as well as alternative names, and the > fact that the sequence is a precursor form. > > Leaving all this in one string has the advantage that we can round- > trip it (and there is probably hardly any other way to accomplish > that), but clearly in terms of semantics this isn't the sequence > description as we know it anymore. > > Does anyone else think too that completely changing the semantics of > sequence annotation fields is a bad idea? > > My inclination from a BioPerl perspective is to extract the part > following 'RecName: Full=' as the description, and attach the rest > as annotation. We could in fact use the TagTree class for this. I'm > cross-posting to BioPerl too to gather what other BioPerl'ers think > about this. > > -hilmar This is much like the GN issues we've run into before, and we *could* set this up using TagTree or similar. In the latter case of gene name the data is stored in a text tree as follows: gene_names: gene_name: Name: GC1QBP Synonyms: HABP1 Synonyms: SF2P32 Synonyms: C1QBP That could be changed to an XML string: GC1QBP HABP1 SF2P32 C1QBP Thinking about this we should attempt to coalesce around a standard instead of forcing the other Bio* to a specific format. chris From biopython at maubp.freeserve.co.uk Sat May 16 19:28:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 17 May 2009 00:28:43 +0100 Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> Message-ID: <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> On 5/17/09, Chris Fields wrote: > > On May 16, 2009, at 5:34 PM, Hilmar Lapp wrote: > > My inclination from a BioPerl perspective is to extract the part following > > 'RecName: Full=' as the description, and attach the rest as annotation. We > > could in fact use the TagTree class for this. I'm cross-posting to BioPerl > > too to gather what other BioPerl'ers think about this. > > > > -hilmar > > > > This is much like the GN issues we've run into before, and we *could* set > this up using TagTree or similar. In the latter case of gene name the data > is stored in a text tree as follows: > > gene_names: > gene_name: > Name: GC1QBP > Synonyms: HABP1 > Synonyms: SF2P32 > Synonyms: C1QBP > > That could be changed to an XML string: > > > > > GC1QBP > HABP1 > SF2P32 > C1QBP > > > > Thinking about this we should attempt to coalesce around a standard instead > of forcing the other Bio* to a specific format. How would you record this in BioSQL? As an XML string for an annotation value? Brad has suggested JSON might be useful for this kind of thing (see also per-letter-annotation discussion). Peter From hlapp at gmx.net Sat May 16 19:37:14 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 16 May 2009 19:37:14 -0400 Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> Message-ID: <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> On May 16, 2009, at 7:28 PM, Peter wrote: >> That could be changed to an XML string: >> >> >> >> >> GC1QBP >> HABP1 >> SF2P32 >> C1QBP >> >> >> >> Thinking about this we should attempt to coalesce around a standard >> instead >> of forcing the other Bio* to a specific format. > > How would you record this in BioSQL? As an XML string for an > annotation value? Yes. A TagTree object can be serialized to XML, and the XML can be stored as the annotation value in BioSQL. As the XML can be read back in, it allows full round-tripping. > Brad has suggested JSON might be useful for this kind of thing (see > also per-letter-annotation discussion). JSON could be another serialization format, but XML is equally or better supported in all languages except JavaScript. Furthermore, you could just send the XML to the browser and have an XSLT (either directly, or indirectly through JavaScript doing the transformation) do the rendering. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sat May 16 19:42:17 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 16 May 2009 19:42:17 -0400 Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <320fb6e00905161614p5a2d964fs6de110bb915ed066@mail.gmail.com> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <320fb6e00905161614p5a2d964fs6de110bb915ed066@mail.gmail.com> Message-ID: <8CD4EED1-A689-447F-8F6E-8D2204DD4E86@gmx.net> On May 16, 2009, at 7:14 PM, Peter wrote: > Am I right to infer that currently BioPerl 1.6.x, like BioPerl 1.5.x > just > treats the DE lines as only big long string? Yes. > Could you translate your idea about the TagTree class into something > concrete with BioSQL tables and fields for me? [...] Over on the > Biopython list we'd talked about storing this annotation in a nested > structured. That's more or less what TagTree is. > However, in order to use the BioSQL annotations mechanisms, I think > a simple flat structure is required :( Not necessarily. If you have a flat serialization (such as XML) the nested structure isn't needed. Of course that's not a fully normalized relational representation, but if you had one, how often would it be used, how efficient would those queries be (SQL is poor at nested or recursive data structures), and how much pain would it be to write the object-relational mappings? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Sun May 17 08:40:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 17 May 2009 13:40:47 +0100 Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> Message-ID: <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com> On 5/17/09, Hilmar Lapp wrote: > > On May 16, 2009, at 7:28 PM, Peter wrote: > > > That could be changed to an XML string: > > > > > > > > > > > > > > > GC1QBP > > > HABP1 > > > SF2P32 > > > C1QBP > > > > > > > > > > > > Thinking about this we should attempt to coalesce around a standard > > > instead of forcing the other Bio* to a specific format. Absolutely - some common standard should be agreed. Would you envision doing this for other structured fields, inventing a new mini XML format each time? That seems open ended and likely to cause a lot of work keeping all the Bio* project synchronised. Here you have mapped RecName and AltName fields in the DE lines to Name and Synonyms (shouldn't that be Synonym singular?). I also don't get why you have used a gene_name entry inside a gene_names list. Would you hold the contains information and the flags information from the DE lines in separate XML entries? I would have gone for something much closer to the original DE line markup i.e. using the field names UniProt use, RecName and AltName, rather than mapping these to Name and Synonym. > > How would you record this in BioSQL? As an XML string for an annotation > > value? > > Yes. A TagTree object can be serialized to XML, and the XML can be stored > as the annotation value in BioSQL. As the XML can be read back in, it allows > full round-tripping. Assuming you stored all the DE markup, then yes, a round trip back to the SwissProt file could be possible. And, depending on the details of the XML structure used, it would be possible to represent this in a python structure too. > > Brad has suggested JSON might be useful for this kind of thing (see > > also per-letter-annotation discussion). > > JSON could be another serialization format, but XML is equally or better > supported in all languages except JavaScript. Furthermore, you could just > send the XML to the browser and have an XSLT (either directly, or indirectly > through JavaScript doing the transformation) do the rendering. I have no strong preference for either XML or JSON (but would rather avoid them if they are not really needed). For other types of annotation there may be a clearer advantage for one over the other, e.g. per letter annotation like the secondary structure of a protein sequence, or the quality scores of a nucleotide contig. On 5/17/09, Hilmar Lapp wrote: > Not necessarily. If you have a flat serialization (such as XML) the nested > structure isn't needed. Of course that's not a fully normalized relational > representation, but if you had one, how often would it be used, how > efficient would those queries be (SQL is poor at nested or recursive data > structures), and how much pain would it be to write the object-relational > mappings? In this example, searching the database using one of the SwissProt AltNames (synonyms), or filtering on the Flags sounds like a reasonable request - but this would be very difficult if the data is stored inside XML strings. Of course, because the RecName and AltName entries are top level, we could just record them as normal - simple strings in the annotations table. This seems much nicer. Likewise the "Flags: Precursor;" line. i.e. listing the tag/value pairs which could be used in the bioentry_qualifier_value table: AltName = "Full=11S globulin seed storage protein II" AltName = "Full=Alpha-globulin" Flags = "Precursor" (the RecName field, "Full=11S globulin seed storage protein 2", could be used for the bioentry.description instead) The above are all pretty easy. We only need to consider nesting (or something like XML or JSON) for some of the DE information, in the example discussed the Contains lines. Even this could be even be done by storing each contains entry as a single long string (holding both the name and synonyms) directly from the DE line itself, something like this: Contains = "RecName: Full=11S globulin seed storage protein 2 acidic chain;\nAltName: Full=11S globulin seed storage protein II acidic chain;" Contains = "RecName: Full=11S globulin seed storage protein 2 basic chain;\nAltName: Full=11S globulin seed storage protein II basic chain;" Peter From sanjay.harke at gmail.com Sun May 17 09:17:14 2009 From: sanjay.harke at gmail.com (Sanjay Harke) Date: Sun, 17 May 2009 18:47:14 +0530 Subject: [BioSQL-l] BioSQL-l Digest, Vol 62, Issue 3 In-Reply-To: References: Message-ID: <31bb4380905170617k47951f83ia5bed32577a02956@mail.gmail.com> Dear peter, Kindly guide me for developing the connectivity of BioSql to Bioperl? sanjay From hlapp at gmx.net Sun May 17 10:56:29 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 17 May 2009 10:56:29 -0400 Subject: [BioSQL-l] BioSQL-l Digest, Vol 62, Issue 3 In-Reply-To: <31bb4380905170617k47951f83ia5bed32577a02956@mail.gmail.com> References: <31bb4380905170617k47951f83ia5bed32577a02956@mail.gmail.com> Message-ID: http://dx.doi.org/10.1038/npre.2007.1233.1 On May 17, 2009, at 9:17 AM, Sanjay Harke wrote: > Dear peter, > > Kindly guide me for developing the connectivity of BioSql to Bioperl? > > sanjay > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sun May 17 11:21:59 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 17 May 2009 11:21:59 -0400 Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com> Message-ID: On May 17, 2009, at 8:40 AM, Peter wrote: > On 5/17/09, Hilmar Lapp wrote: >> >> On May 16, 2009, at 7:28 PM, Peter wrote: >>>> That could be changed to an XML string: >>>> >>>> >>>> >>>> >>>> GC1QBP >>>> HABP1 >>>> SF2P32 >>>> C1QBP >>>> >>>> >>>> >>>> Thinking about this we should attempt to coalesce around a standard >>>> instead of forcing the other Bio* to a specific format. > > [...] Here you have mapped RecName and AltName fields in the DE > lines to > Name and Synonyms (shouldn't that be Synonym singular?). The example is for the GN lines in SwissProt, not the DE lines. > [...] > On 5/17/09, Hilmar Lapp wrote: >> Not necessarily. If you have a flat serialization (such as XML) the >> nested >> structure isn't needed. Of course that's not a fully normalized >> relational >> representation, but if you had one, how often would it be used, how >> efficient would those queries be (SQL is poor at nested or >> recursive data >> structures), and how much pain would it be to write the object- >> relational >> mappings? > > In this example, searching the database using one of the SwissProt > AltNames (synonyms), or filtering on the Flags sounds like a > reasonable request - but this would be very difficult if the data is > stored inside XML strings. Actually no. Modern full-text indexers (inside or outside the database) can index XML text columns right away and very well. In fact, for the last project that I built a full-text search for (on top of a BioSQL database) I did that by writing custom XML documents to a separate table for each record I wanted indexed. Oracle's full text indexer did the rest. I also built a separate identifier/name/ accession index that pulled all the gene names, symbols, accession numbers, identifiers etc into a single table for indexing. What I mean is, a fully normalized relational representation, especially if nested, is often not the most efficient data structure for efficient searching and filtering. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Mon May 18 06:03:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 18 May 2009 11:03:52 +0100 Subject: [BioSQL-l] Recording "nucleotide" in the sequence table? In-Reply-To: <320fb6e00905161606l5fdb0862mf25a45dad07dac8@mail.gmail.com> References: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com> <48684763-5657-40F3-BE75-31E8CDE42613@gmx.net> <320fb6e00905161325m497882ddge853e1525acdfede@mail.gmail.com> <9A3473A7-2125-486C-BAAE-27820FF19D8D@gmx.net> <320fb6e00905161606l5fdb0862mf25a45dad07dac8@mail.gmail.com> Message-ID: <320fb6e00905180303m19d0c6e0hdc22ff550e518c6c@mail.gmail.com> On Sun, May 17, 2009 at 12:06 AM, Peter wrote: > If none of BioPerl, BioJava and BioRuby have an analogous > sequence representation for a nucleotide sequence which > might be DNA or RNA, then perhaps the current situation > with only "protein", "dna", "rna" and "unknown" in the > biosequence.alphabet field in BioSQL is sufficient. The original Biopython bug reporter (Bug 2829, David Wyllie) has replied on the bug. In his case, rather than using the generic nucleotide alphabet, he can be a bit more explicit since he does actually know his sequence is DNA, and this does get recorded in BioSQL fine. Given the "nucleotide" alphabet is a corner case in Biopython, and has no analogue in BioPerl, the status quo is fine. i.e. The biosequence.alphabet field should contain "dna", "rna", "protein" or "unknown" (in lower case). Thanks for your thoughts everyone. Peter From michael.watson at bbsrc.ac.uk Mon May 18 08:45:19 2009 From: michael.watson at bbsrc.ac.uk (michael watson (IAH-C)) Date: Mon, 18 May 2009 13:45:19 +0100 Subject: [BioSQL-l] Full text indexing/Searching in MySQL Message-ID: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk> Hi Has anyone implemented full text indexing/searching for BioSQL in MySQL, either using MySQL's full text features or any other solution? Any tips, advice, documentation, code etc available? Thanks Mick Head of Bioinformatics Institute for Animal Health Compton Berks RG20 7NN 01635 578411 Please consider the environment and don't print this e-mail unless you really need to. The information contained in this message may be confidential or legally privileged and is intended solely for the addressee. If you have received this message in error please delete it & notify the originator immediately. Unauthorised use, disclosure, copying or alteration of this message is forbidden & may be unlawful. The contents of this e-mail are the views of the sender and do not necessarily represent the views of the Institute. This email, and associated attachments, has been checked locally for viruses but we can accept no responsibility once it has left our systems. Communications on Institute computers are monitored to secure the effective operation of the systems and for other lawful purposes. The Institute for Animal Health is a company limited by guarantee, registered in England no. 559784. The Institute is also a registered charity, Charity Commissioners Reference No. 228824 From hlapp at gmx.net Mon May 18 09:24:34 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 18 May 2009 09:24:34 -0400 Subject: [BioSQL-l] Full text indexing/Searching in MySQL In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk> References: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk> Message-ID: I've done that using Oracle, not MySQL. I assume that's therefore not what you want to hear about and hence will shut up :) -hilmar On May 18, 2009, at 8:45 AM, michael watson (IAH-C) wrote: > Hi > > > > Has anyone implemented full text indexing/searching for BioSQL in > MySQL, > either using MySQL's full text features or any other solution? > > > > Any tips, advice, documentation, code etc available? > > > > Thanks > > Mick > > > > Head of Bioinformatics > Institute for Animal Health > Compton > Berks > RG20 7NN > 01635 578411 > > > > Please consider the environment and don't print this e-mail unless you > really need to. > > The information contained in this message may be confidential or > legally > privileged and is intended solely for the addressee. If you have > received this message in error please delete it & notify the > originator > immediately. Unauthorised use, disclosure, copying or alteration of > this message is forbidden & may be unlawful. The contents of this > e-mail are the views of the sender and do not necessarily represent > the > views of the Institute. This email, and associated attachments, has > been checked locally for viruses but we can accept no responsibility > once it has left our systems. Communications on Institute computers > are > monitored to secure the effective operation of the systems and for > other > lawful purposes. > > > > The Institute for Animal Health is a company limited by guarantee, > registered in England no. 559784. > > The Institute is also a registered charity, Charity Commissioners > Reference No. 228824 > > > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Mon May 18 09:26:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 18 May 2009 14:26:40 +0100 Subject: [BioSQL-l] Full text indexing/Searching in MySQL In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk> References: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk> Message-ID: <320fb6e00905180626o4855aa06v6c6ae665885a3fce@mail.gmail.com> On Mon, May 18, 2009 at 1:45 PM, michael watson (IAH-C) wrote: > > Hi > > Has anyone implemented full text indexing/searching for BioSQL in MySQL, > either using MySQL's full text features or any other solution? > > Any tips, advice, documentation, code etc available? > > Thanks > > Mick Hilmar mentioned he has done something like this on this thread, where he was storing XML strings as annotation values: http://lists.open-bio.org/pipermail/biosql-l/2009-May/001534.html (You've probably read that - but just in case, worth mentioning). Peter From biopython at maubp.freeserve.co.uk Mon May 18 09:38:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 18 May 2009 14:38:03 +0100 Subject: [BioSQL-l] [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com> Message-ID: <320fb6e00905180638q29de63c4if0627eff416c4481@mail.gmail.com> On Sun, May 17, 2009 at 4:21 PM, Hilmar Lapp wrote: > > On May 17, 2009, at 8:40 AM, Peter wrote: >> >> [...] Here you have mapped RecName and AltName fields in the DE lines to >> Name and Synonyms (shouldn't that be Synonym singular?). > > The example is for the GN lines in SwissProt, not the DE lines. Ah, that probably explains some of my confusion. >> In this example, searching the database using one of the SwissProt >> AltNames (synonyms), or filtering on the Flags sounds like a >> reasonable request - but this would be very difficult if the data is >> stored inside XML strings. > > Actually no. Modern full-text indexers (inside or outside the database) can > index XML text columns right away and very well. In fact, for the last > project that I built a full-text search for (on top of a BioSQL database) I > did that by writing custom XML documents to a separate table for each > record I wanted indexed. Oracle's full text indexer did the rest. I also built a > separate identifier/name/accession index that pulled all the gene names, > symbols, accession numbers, identifiers etc into a single table for > indexing. OK, when I said searching "would be very difficult if the data is stored inside XML strings", maybe it wasn't so difficult for you - but that still sounds complicated! Sticking with the GN lines and the synonym, if this was stored as a simple tag/value as usual in BioSQL, I would write my SQL statement to search the annotation table where the term id was that associated with a GN synonym, and the annotation value was "HABP1". Simple. Using the XML approach, are you suggesting you could do a full text search on the annotation value field, looking for any rows where the field contains "HABP1", where the term id matches the GN lines' XML string? This sounds simplistic and probably rather slow - presumably why you resorted to the more complicated indexing scheme described above? > What I mean is, a fully normalized relational representation, especially if > nested, is often not the most efficient data structure for efficient > searching and filtering. OK. But do we really need to worry about complex nested structures for the SwissProt annotation (or in general)? Peter From jimp at compbio.dundee.ac.uk Mon May 18 10:01:28 2009 From: jimp at compbio.dundee.ac.uk (James Procter) Date: Mon, 18 May 2009 15:01:28 +0100 Subject: [BioSQL-l] BioSQL at BOSC 2009? In-Reply-To: <74D4CC78-FC7B-4595-9D24-EB6B3ED43318@gmx.net> References: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com> <1BD503B3-D805-4882-87DD-820138792DB2@gmx.net> <320fb6e00905161423l4df26525hbc9a824419c7a370@mail.gmail.com> <74D4CC78-FC7B-4595-9D24-EB6B3ED43318@gmx.net> Message-ID: <4A116A38.9050705@compbio.dundee.ac.uk> Hi all. Hilmar Lapp wrote: > On May 16, 2009, at 5:23 PM, Peter wrote: > >> I will be staying for all of ISMB Same here. > > > I am too. Should we doodle something once the program is out? I'll watch out for the URL if you post it to the list! Jim. -- ------------------------------------------------------------------- J. B. Procter (ENFIN/VAMSAS) Barton Bioinformatics Research Group Phone/Fax:+44(0)1382 388734/345764 http://www.compbio.dundee.ac.uk The University of Dundee is a Scottish Registered Charity, No. SC015096. From roy.chaudhuri at gmail.com Mon May 18 13:37:39 2009 From: roy.chaudhuri at gmail.com (Roy Chaudhuri) Date: Mon, 18 May 2009 18:37:39 +0100 Subject: [BioSQL-l] Full text indexing/Searching in MySQL In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk> References: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk> Message-ID: <4A119CE3.3080208@gmail.com> Hi Mick, > Has anyone implemented full text indexing/searching for BioSQL in MySQL, > either using MySQL's full text features or any other solution? I've kind of done this. The trouble is that full text is only implemented on the non-transactional MyISAM tables, not InnoDB (it has long been promised for InnoDB, but no sign yet). My hack solution was to parse out the fields I was interested in (feature tags such as gene and product) and include them in a separate MyISAM table, cross-referenced to BioSQL using seqfeature_id. This involves duplicating data (which is a bad thing), but should be okay if database updates are infrequent. I mimic atomic changes by building an updated version of the MyISAM table separately, then switching to use the new version at the same time as I commit the BioSQL updates. There's also Sphinx (http://www.sphinxsearch.com), which is a plug-in that can implement full-text searches in InnoDB, but I haven't experimented with that so have no idea how well it works. Cheers. Roy. From holland at eaglegenomics.com Mon May 18 14:20:52 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Mon, 18 May 2009 19:20:52 +0100 Subject: [BioSQL-l] Full text indexing/Searching in MySQL In-Reply-To: <4A119CE3.3080208@gmail.com> References: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk> <4A119CE3.3080208@gmail.com> Message-ID: <1242670852.28726.2.camel@buzzybee> There's also Lucene, which is a Java-based full-text indexer which can be attached to all kinds of data sources, including MySQL databases: http://lucene.apache.org/java/docs/ cheers, Richard On Mon, 2009-05-18 at 18:37 +0100, Roy Chaudhuri wrote: > Hi Mick, > > > Has anyone implemented full text indexing/searching for BioSQL in MySQL, > > either using MySQL's full text features or any other solution? > > I've kind of done this. The trouble is that full text is only > implemented on the non-transactional MyISAM tables, not InnoDB (it has > long been promised for InnoDB, but no sign yet). My hack solution was to > parse out the fields I was interested in (feature tags such as gene and > product) and include them in a separate MyISAM table, cross-referenced > to BioSQL using seqfeature_id. This involves duplicating data (which is > a bad thing), but should be okay if database updates are infrequent. I > mimic atomic changes by building an updated version of the MyISAM table > separately, then switching to use the new version at the same time as I > commit the BioSQL updates. > > There's also Sphinx (http://www.sphinxsearch.com), which is a plug-in > that can implement full-text searches in InnoDB, but I haven't > experimented with that so have no idea how well it works. > > Cheers. > Roy. > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From michael.watson at bbsrc.ac.uk Tue May 19 04:17:32 2009 From: michael.watson at bbsrc.ac.uk (michael watson (IAH-C)) Date: Tue, 19 May 2009 09:17:32 +0100 Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors Message-ID: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk> Hi I'm using: biosql-1.0.1 bioperl-db-1.5.2_100 bioperl-1.5.2_102 When I run load_seqdatabase.pl on about 3000 GenBank sequences, I get: Loading fmd_180509.gbk ... -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were ("AY312586S1","32307407","AY312586","Foot-and-mouth disease virus O isolate O/SKR/2000 S fragment, complete 1,9762) Duplicate entry 'AY312586-1-1' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, values were ("","1") FKs (324,3,4) Duplicate entry '324-3-4-1' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were ("AY312586S2","32307408","AY312587","Foot-and-mouth disease virus O isolate O/SKR/2000 L fragment, complete 1,9762) Duplicate entry 'AY312587-1-1' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, values were ("","1") FKs (323,3,4) Duplicate entry '323-3-4-1' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, values were ("","2") FKs (323,22,4) Duplicate entry '323-22-4-2' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, values were ("","3") FKs (323,15,4) Duplicate entry '323-15-4-3' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were ("AY312588S1","32307403","AY312588","Foot-and-mouth disease virus O isolate O/SKR/2002 S fragment, complete 1,9762) Duplicate entry 'AY312588-1-1' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, values were ("","1") FKs (326,3,4) Duplicate entry '326-3-4-1' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were ("AY312588S2","32307404","AY312589","Foot-and-mouth disease virus O isolate O/SKR/2002 L fragment, complete 1,9762) Duplicate entry 'AY312589-1-1' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, values were ("","1") FKs (325,3,4) Duplicate entry '325-3-4-1' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, values were ("","2") FKs (325,22,4) Duplicate entry '325-22-4-2' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, values were ("","3") FKs (325,15,4) Duplicate entry '325-15-4-3' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were ("S87919S2","247466","S87923","L [foot-and-mouth disease virus FMDV, strain CS8, Genomic RNA, 10 nt, segmen 1,9754) Duplicate entry 'S87923-1-1' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, values were ("","1") FKs (782,3,4) Duplicate entry '782-3-4-1' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, values were ("","2") FKs (782,13,4) Duplicate entry '782-13-4-2' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were ("S87919S1","247464","S87919","L [foot-and-mouth disease virus FMDV, strain CS8, Genomic RNA, 35 nt, segmen 1,9754) Duplicate entry 'S87919-1-1' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, values were ("","1") FKs (781,3,4) Duplicate entry '781-3-4-1' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values were ("","Direct Submission","Submitted (12-AUG-2004) National Center for Biotechnology Information, NIH, C-E8D3CBBD80002FA1","1","8170","") FKs () Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3 --------------------------------------------------- Could not store NC_011452: ------------- EXCEPTION ------------- MSG: create: object (Bio::Annotation::Reference) failed to insert or to be found by unique key STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/BasePersistenceAdaptor.pm:206 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/BasePersistenceAdaptor.pm:251 STACK Bio::DB::Persistent::PersistentObject::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/Persistent/PersistentObject.pm:271 STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/AnnotationCollectionAdaptor.pm: STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/BasePersistenceAdaptor.pm:214 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/BasePersistenceAdaptor.pm:251 STACK Bio::DB::Persistent::PersistentObject::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/Persistent/PersistentObject.pm:271 STACK Bio::DB::BioSQL::SeqAdaptor::store_children /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/SeqAdaptor.pm:224 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/BasePersistenceAdaptor.pm:214 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/BasePersistenceAdaptor.pm:251 STACK Bio::DB::Persistent::PersistentObject::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/Persistent/PersistentObject.pm:271 STACK (eval) load_seqdatabase.pl:622 STACK toplevel load_seqdatabase.pl:604 -------------------------------------- at load_seqdatabase.pl line 635 Any clues? Thanks Mick Head of Bioinformatics Institute for Animal Health Compton Berks RG20 7NN 01635 578411 Please consider the environment and don't print this e-mail unless you really need to. The information contained in this message may be confidential or legally privileged and is intended solely for the addressee. If you have received this message in error please delete it & notify the originator immediately. Unauthorised use, disclosure, copying or alteration of this message is forbidden & may be unlawful. The contents of this e-mail are the views of the sender and do not necessarily represent the views of the Institute. This email, and associated attachments, has been checked locally for viruses but we can accept no responsibility once it has left our systems. Communications on Institute computers are monitored to secure the effective operation of the systems and for other lawful purposes. The Institute for Animal Health is a company limited by guarantee, registered in England no. 559784. The Institute is also a registered charity, Charity Commissioners Reference No. 228824 From biopython at maubp.freeserve.co.uk Tue May 19 05:31:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 19 May 2009 10:31:05 +0100 Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk> References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk> Message-ID: <320fb6e00905190231t79ac1dc9j49585929e9b5304a@mail.gmail.com> On Tue, May 19, 2009 at 9:17 AM, michael watson (IAH-C) wrote: > > Hi > > I'm using: > > biosql-1.0.1 > bioperl-db-1.5.2_100 > bioperl-1.5.2_102 > > When I run load_seqdatabase.pl on about 3000 GenBank sequences, > I get: > > Loading fmd_180509.gbk ... > ... > --------------------------------------------------- > > Could not store NC_011452: > > ------------- EXCEPTION ?------------- > > MSG: create: object (Bio::Annotation::Reference) failed to insert or to > be found by unique key > > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create > /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D > B/BioSQL/BasePersistenceAdaptor.pm:206 > > ... > > STACK Bio::DB::Persistent::PersistentObject::store > /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D > B/Persistent/PersistentObject.pm:271 > > STACK (eval) load_seqdatabase.pl:622 > > STACK toplevel load_seqdatabase.pl:604 > > -------------------------------------- > > ?at load_seqdatabase.pl line 635 > > Any clues? You got a lot of warning about feature keys (which I am guessing are from different GenBank entries), but the failure seems to be from something to do with the annotation in NC_011452. Try downloading just NC_011452 in GenBank format, and testing that: http://www.ncbi.nlm.nih.gov/nuccore/NC_011452 I would expect that to fail in the same way, and you would at least have isolated the issue to a smaller test case. If it works, then maybe the copy of NC_011452 in your file is corrupted somehow - check for differences. Peter From hlapp at gmx.net Tue May 19 08:25:25 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 19 May 2009 08:25:25 -0400 Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk> References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk> Message-ID: <68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net> On May 19, 2009, at 4:17 AM, michael watson (IAH-C) wrote: > [...] > -------------------- WARNING --------------------- > > MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values > were > ("AY312586S1","32307407","AY312586","Foot-and-mouth disease virus O > isolate O/SKR/2000 S fragment, complete > > 1,9762) > > Duplicate entry 'AY312586-1-1' for key 2 > > --------------------------------------------------- This suggests that a sequence with the above accession or GI number was already in the database, or occurs in the file twice. If this situation is possible, you will have to pass the --lookup (or --flatlookup) flag to the script, and specify how you want updates to take place when they are necessary (options --noupdate, --remove, and --mergeobjs). > -------------------- WARNING --------------------- > > MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, > values were ("","1") FKs (324,3,4) > > Duplicate entry '324-3-4-1' for key 2 > --------------------------------------------------- I suspect that 324 is the primary key of the sequence record that raised the duplicate entry warning above. Can you check that? If the insert is turned into an update, these warnings should go away too. > [...] > -------------------- WARNING --------------------- > > MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, > values were ("","1") FKs (323,3,4) > > Duplicate entry '323-3-4-1' for key 2 > > --------------------------------------------------- Similar to before, except 323 is probably the primary key for AY312587. > [...] > -------------------- WARNING --------------------- > > MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, > values were ("","1") FKs (325,3,4) > > Duplicate entry '325-3-4-1' for key 2 > > --------------------------------------------------- And if the order of messages is preserved correctly, 325 would be the primary key of AY312589. > [...] > -------------------- WARNING --------------------- > > MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, > values > were ("","Direct Submission","Submitted (12-AUG-2004) National Center > for Biotechnology Information, NIH, > > C-E8D3CBBD80002FA1","1","8170","") FKs () > > Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3 > > --------------------------------------------------- This one is odd. Can you check which existing entry you have with reference.crc = 'CRC-E8D3CBBD80002FA1'? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From michael.watson at bbsrc.ac.uk Wed May 20 05:52:13 2009 From: michael.watson at bbsrc.ac.uk (michael watson (IAH-C)) Date: Wed, 20 May 2009 10:52:13 +0100 Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors In-Reply-To: <68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net> References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk> <68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net> Message-ID: <8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk> Hi Guys Ok, the warnings were due to duplicate sequences - I had downloaded a stream using Bio::DB::GenBank and I guess I assumed that would mean only unique entries were sent back. Using "--flatlookup --remove" gets rid of the warnings. Now for NC_003992.gbk... To answer Hilmar's question: mysql> select * from reference where crc = "CRC-E8D3CBBD80002FA1"; +--------------+-----------+-------------------------------------------- ---------------------------------------------------------+-------------- -----+---------+----------------------+ | reference_id | dbxref_id | location | title | authors | crc | +--------------+-----------+-------------------------------------------- ---------------------------------------------------------+-------------- -----+---------+----------------------+ | 152 | NULL | Submitted (12-AUG-2004) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USA | Direct Submission | NULL | CRC-E8D3CBBD80002FA1 | +--------------+-----------+-------------------------------------------- ---------------------------------------------------------+-------------- -----+---------+----------------------+ And when I run load_seqdatabase.pl on NC_003992.gbk alone I still get: perl load_seqdatabase.pl --host localhost --dbname fmd_biosql --format genbank --dbuser removed --dbpass removed --flatlookup --remove NC_003992.gbk Loading NC_003992.gbk ... -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values were ("","Direct Submission","Submitted (12-AUG-2004) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs () Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3 --------------------------------------------------- Could not store NC_003992: ------------- EXCEPTION ------------- MSG: create: object (Bio::Annotation::Reference) failed to insert or to be found by unique key STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/BasePersistenceAdaptor.pm:206 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/BasePersistenceAdaptor.pm:251 STACK Bio::DB::Persistent::PersistentObject::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/Persistent/PersistentObject.pm:271 STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/AnnotationCollectionAdaptor.pm:217 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/BasePersistenceAdaptor.pm:214 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/BasePersistenceAdaptor.pm:251 STACK Bio::DB::Persistent::PersistentObject::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/Persistent/PersistentObject.pm:271 STACK Bio::DB::BioSQL::SeqAdaptor::store_children /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/SeqAdaptor.pm:224 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/BasePersistenceAdaptor.pm:214 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/BasePersistenceAdaptor.pm:251 STACK Bio::DB::Persistent::PersistentObject::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/Persistent/PersistentObject.pm:271 STACK (eval) load_seqdatabase.pl:622 STACK toplevel load_seqdatabase.pl:604 -------------------------------------- at load_seqdatabase.pl line 635 And I still have: mysql> select * from reference where crc = "CRC-E8D3CBBD80002FA1"; +--------------+-----------+-------------------------------------------- ---------------------------------------------------------+-------------- -----+---------+----------------------+ | reference_id | dbxref_id | location | title | authors | crc | +--------------+-----------+-------------------------------------------- ---------------------------------------------------------+-------------- -----+---------+----------------------+ | 152 | NULL | Submitted (12-AUG-2004) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USA | Direct Submission | NULL | CRC-E8D3CBBD80002FA1 | +--------------+-----------+-------------------------------------------- ---------------------------------------------------------+-------------- -----+---------+----------------------+ 1 row in set (0.01 sec) Could this be because bases 1 to 8203 of the sequence have three references, and the crc is created on the first and then duplicated on the second, thus causing a problem? Cheers Mick -----Original Message----- From: Hilmar Lapp [mailto:hlapp at gmx.net] Sent: 19 May 2009 13:25 To: michael watson (IAH-C) Cc: biosql-l at lists.open-bio.org Subject: Re: [BioSQL-l] load_seqdatabase.pl warnings and errors On May 19, 2009, at 4:17 AM, michael watson (IAH-C) wrote: > [...] > -------------------- WARNING --------------------- > > MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values > were > ("AY312586S1","32307407","AY312586","Foot-and-mouth disease virus O > isolate O/SKR/2000 S fragment, complete > > 1,9762) > > Duplicate entry 'AY312586-1-1' for key 2 > > --------------------------------------------------- This suggests that a sequence with the above accession or GI number was already in the database, or occurs in the file twice. If this situation is possible, you will have to pass the --lookup (or --flatlookup) flag to the script, and specify how you want updates to take place when they are necessary (options --noupdate, --remove, and --mergeobjs). > -------------------- WARNING --------------------- > > MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, > values were ("","1") FKs (324,3,4) > > Duplicate entry '324-3-4-1' for key 2 > --------------------------------------------------- I suspect that 324 is the primary key of the sequence record that raised the duplicate entry warning above. Can you check that? If the insert is turned into an update, these warnings should go away too. > [...] > -------------------- WARNING --------------------- > > MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, > values were ("","1") FKs (323,3,4) > > Duplicate entry '323-3-4-1' for key 2 > > --------------------------------------------------- Similar to before, except 323 is probably the primary key for AY312587. > [...] > -------------------- WARNING --------------------- > > MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, > values were ("","1") FKs (325,3,4) > > Duplicate entry '325-3-4-1' for key 2 > > --------------------------------------------------- And if the order of messages is preserved correctly, 325 would be the primary key of AY312589. > [...] > -------------------- WARNING --------------------- > > MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, > values > were ("","Direct Submission","Submitted (12-AUG-2004) National Center > for Biotechnology Information, NIH, > > C-E8D3CBBD80002FA1","1","8170","") FKs () > > Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3 > > --------------------------------------------------- This one is odd. Can you check which existing entry you have with reference.crc = 'CRC-E8D3CBBD80002FA1'? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Wed May 20 06:59:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 20 May 2009 11:59:19 +0100 Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk> References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk> <68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net> <8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk> Message-ID: <320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com> On Wed, May 20, 2009 at 10:52 AM, michael watson (IAH-C) wrote: > > Hi Guys > > Ok, the warnings were due to duplicate sequences - I had downloaded a > stream using Bio::DB::GenBank and I guess I assumed that would mean only > unique entries were sent back. ?Using "--flatlookup --remove" gets rid > of the warnings. Great - easy :) > Now for NC_003992.gbk... > > To answer Hilmar's question: > ... > And when I run load_seqdatabase.pl on NC_003992.gbk alone I still get: > > perl load_seqdatabase.pl --host localhost --dbname fmd_biosql --format > genbank --dbuser removed --dbpass removed --flatlookup --remove > NC_003992.gbk > > Loading NC_003992.gbk ... > > -------------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values > were ("","Direct Submission","Submitted (12-AUG-2004) National Center > for Biotechnology Information, NIH, Bethesda, MD 20894, > USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs () > Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3 > --------------------------------------------------- > Could not store NC_003992: > ------------- EXCEPTION ?------------- > MSG: create: object (Bio::Annotation::Reference) failed to insert or to > be found by unique key > ... I would guess that the problem is this rather generic reference in NC_003992 may be repeated exactly in another genome (causing the CRC collision): CONSRTM NCBI Genome Project TITLE Direct Submission JOURNAL Submitted (12-AUG-2004) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USA See http://www.ncbi.nlm.nih.gov/nuccore/NC_011452 i.e. Could there be another direct submission by the NCBI on that date in your collection? You could search the database looking for that CRC and trace it back to a bioentry, or just try grep for "JOURNAL Submitted (12-AUG-2004) National Center for Biotechnology" on your GenBank files. e.g. Something like this SQL statement might be interesting: SELECT bioentry.accession, reference.title FROM bioentry, bioentry_reference, reference WHERE bioentry.bioentry_id=bioentry_reference.bioentry_id AND bioentry_reference.reference_id=reference.reference_id AND reference.crc="CRC-E8D3CBBD80002FA1"; Peter From michael.watson at bbsrc.ac.uk Wed May 20 07:25:52 2009 From: michael.watson at bbsrc.ac.uk (michael watson (IAH-C)) Date: Wed, 20 May 2009 12:25:52 +0100 Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors In-Reply-To: <320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com> References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk> <68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net> <8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk> <320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com> Message-ID: <8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk> We have a winner :) NC_003992, NC_011452, NC_011451, NC_011450 all share at least one reference. Would changing --flatlookup to --lookup change the behaviour so it checks for an existing reference before trying to insert the duplicate? The answer is no :( (see below). I guess this may need some coding then! Thanks! Mick perl load_seqdatabase.pl --host localhost --dbname fmd_biosql --format genbank --dbuser removed --dbpass removed --lookup --remove NC_003992.gbk Loading NC_003992.gbk ... -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values were ("","Direct Submission","Submitted (12-AUG-2004) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs () Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3 --------------------------------------------------- Could not store NC_003992: ------------- EXCEPTION ------------- MSG: create: object (Bio::Annotation::Reference) failed to insert or to be found by unique key STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251 STACK Bio::DB::Persistent::PersistentObject::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm:271 STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/AnnotationCollectionAdaptor.pm:217 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:214 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251 STACK Bio::DB::Persistent::PersistentObject::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm:271 STACK Bio::DB::BioSQL::SeqAdaptor::store_children /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/SeqAdaptor.pm:224 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:214 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251 STACK Bio::DB::Persistent::PersistentObject::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm:271 STACK (eval) load_seqdatabase.pl:622 STACK toplevel load_seqdatabase.pl:604 -------------------------------------- at load_seqdatabase.pl line 635 -----Original Message----- From: p.j.a.cock at googlemail.com [mailto:p.j.a.cock at googlemail.com] On Behalf Of Peter Sent: 20 May 2009 11:59 To: michael watson (IAH-C) Cc: Hilmar Lapp; biosql-l at lists.open-bio.org Subject: Re: [BioSQL-l] load_seqdatabase.pl warnings and errors On Wed, May 20, 2009 at 10:52 AM, michael watson (IAH-C) wrote: > > Hi Guys > > Ok, the warnings were due to duplicate sequences - I had downloaded a > stream using Bio::DB::GenBank and I guess I assumed that would mean only > unique entries were sent back. ?Using "--flatlookup --remove" gets rid > of the warnings. Great - easy :) > Now for NC_003992.gbk... > > To answer Hilmar's question: > ... > And when I run load_seqdatabase.pl on NC_003992.gbk alone I still get: > > perl load_seqdatabase.pl --host localhost --dbname fmd_biosql --format > genbank --dbuser removed --dbpass removed --flatlookup --remove > NC_003992.gbk > > Loading NC_003992.gbk ... > > -------------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values > were ("","Direct Submission","Submitted (12-AUG-2004) National Center > for Biotechnology Information, NIH, Bethesda, MD 20894, > USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs () > Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3 > --------------------------------------------------- > Could not store NC_003992: > ------------- EXCEPTION ?------------- > MSG: create: object (Bio::Annotation::Reference) failed to insert or to > be found by unique key > ... I would guess that the problem is this rather generic reference in NC_003992 may be repeated exactly in another genome (causing the CRC collision): CONSRTM NCBI Genome Project TITLE Direct Submission JOURNAL Submitted (12-AUG-2004) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USA See http://www.ncbi.nlm.nih.gov/nuccore/NC_011452 i.e. Could there be another direct submission by the NCBI on that date in your collection? You could search the database looking for that CRC and trace it back to a bioentry, or just try grep for "JOURNAL Submitted (12-AUG-2004) National Center for Biotechnology" on your GenBank files. e.g. Something like this SQL statement might be interesting: SELECT bioentry.accession, reference.title FROM bioentry, bioentry_reference, reference WHERE bioentry.bioentry_id=bioentry_reference.bioentry_id AND bioentry_reference.reference_id=reference.reference_id AND reference.crc="CRC-E8D3CBBD80002FA1"; Peter From biopython at maubp.freeserve.co.uk Wed May 20 07:34:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 20 May 2009 12:34:51 +0100 Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk> References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk> <68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net> <8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk> <320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com> <8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk> Message-ID: <320fb6e00905200434x3e1c7978ue1c58382f7478354@mail.gmail.com> On Wed, May 20, 2009 at 12:25 PM, michael watson (IAH-C) wrote: > > We have a winner :) > > NC_003992, NC_011452, NC_011451, NC_011450 all share > at least one reference. > > Would changing --flatlookup to --lookup change the behaviour > so it checks for an existing reference before trying to insert the > duplicate? > > The answer is no :( (see below). > > I guess this may need some coding then! My crude idea for a simple ad-hoc solution would be to remove these pointless references from the records, before loading them into BioSQL. One way would be to edit the four GenBank files by hand (e.g. to remove the reference or make them unique). You might also do this in a BioPerl script that loads the records, edits the references, and then puts them in the database. Personally I use Python not Perl, so I can't tell you how you might do that with BioPerl. Hilmar may be able to comment from a BioPerl/BioSQL point of view - clearly CRC collisions of this nature will happen again in future. Peter From holland at eaglegenomics.com Wed May 20 07:44:58 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 20 May 2009 12:44:58 +0100 Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors In-Reply-To: <320fb6e00905200434x3e1c7978ue1c58382f7478354@mail.gmail.com> References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk> <68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net> <8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk> <320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com> <8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk> <320fb6e00905200434x3e1c7978ue1c58382f7478354@mail.gmail.com> Message-ID: <1242819898.18348.1.camel@buzzybee> Theoretically, although unlikely, it is statistically entirely possible for two completely different references to share the same CRC. Hence the CRC shouldn't really be used as an indicator of uniqueness, although it is still useful as a hashing function for indexing and quick lookup. cheers, Richard On Wed, 2009-05-20 at 12:34 +0100, Peter wrote: > On Wed, May 20, 2009 at 12:25 PM, michael watson (IAH-C) > wrote: > > > > We have a winner :) > > > > NC_003992, NC_011452, NC_011451, NC_011450 all share > > at least one reference. > > > > Would changing --flatlookup to --lookup change the behaviour > > so it checks for an existing reference before trying to insert the > > duplicate? > > > > The answer is no :( (see below). > > > > I guess this may need some coding then! > > My crude idea for a simple ad-hoc solution would be to remove these > pointless references from the records, before loading them into > BioSQL. > > One way would be to edit the four GenBank files by hand (e.g. to > remove the reference or make them unique). You might also do this in a > BioPerl script that loads the records, edits the references, and then > puts them in the database. Personally I use Python not Perl, so I > can't tell you how you might do that with BioPerl. > > Hilmar may be able to comment from a BioPerl/BioSQL point of view - > clearly CRC collisions of this nature will happen again in future. > > Peter > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From hlapp at gmx.net Wed May 20 11:10:20 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 20 May 2009 11:10:20 -0400 Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk> References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk> <68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net> <8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk> <320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com> <8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk> Message-ID: <0212C167-7618-4761-A191-C6CE4B41EC2A@gmx.net> Indeed changing the lookup will have no effect since deletion of bioentries doesn't cascade to references (only to bioentry-to- reference associations). What I don't understand yet is how you get the CRC clash. Normally this kind of situation can happen if the first occurrence does not and the second does have PMID, by which it will be looked up, lookup fails (b/c the first occurrence didn't come with PMID), resulting in an insert of the erroneously deemed "new" reference, which then fails with a CRC clash. However, there is no PMID nor any other identifier here, so I'll have to look into the code to find out why the second occurrence is either not looked up before an insert is attempted, or if it is looked up, why the lookup fails to find the record stored earlier. -hilmar On May 20, 2009, at 7:25 AM, michael watson (IAH-C) wrote: > We have a winner :) > > NC_003992, NC_011452, NC_011451, NC_011450 all share at least one > reference. > > Would changing --flatlookup to --lookup change the behaviour so it > checks for an existing reference before trying to insert the > duplicate? > > The answer is no :( (see below). > > I guess this may need some coding then! > > Thanks! > Mick > > perl load_seqdatabase.pl --host localhost --dbname fmd_biosql -- > format genbank --dbuser removed --dbpass removed --lookup --remove > NC_003992.gbk > Loading NC_003992.gbk ... > > -------------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, > values were ("","Direct Submission","Submitted (12-AUG-2004) > National Center for Biotechnology Information, NIH, Bethesda, MD > 20894, USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs () > Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3 > --------------------------------------------------- > Could not store NC_003992: > ------------- EXCEPTION ------------- > MSG: create: object (Bio::Annotation::Reference) failed to insert or > to be found by unique key > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/BasePersistenceAdaptor.pm:206 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/BasePersistenceAdaptor.pm:251 > STACK Bio::DB::Persistent::PersistentObject::store /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > Persistent/PersistentObject.pm:271 > STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children / > usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/ > Bio/DB/BioSQL/AnnotationCollectionAdaptor.pm:217 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/BasePersistenceAdaptor.pm:214 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/BasePersistenceAdaptor.pm:251 > STACK Bio::DB::Persistent::PersistentObject::store /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > Persistent/PersistentObject.pm:271 > STACK Bio::DB::BioSQL::SeqAdaptor::store_children /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/SeqAdaptor.pm:224 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/BasePersistenceAdaptor.pm:214 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/BasePersistenceAdaptor.pm:251 > STACK Bio::DB::Persistent::PersistentObject::store /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > Persistent/PersistentObject.pm:271 > STACK (eval) load_seqdatabase.pl:622 > STACK toplevel load_seqdatabase.pl:604 > > -------------------------------------- > > at load_seqdatabase.pl line 635 > > -----Original Message----- > From: p.j.a.cock at googlemail.com [mailto:p.j.a.cock at googlemail.com] > On Behalf Of Peter > Sent: 20 May 2009 11:59 > To: michael watson (IAH-C) > Cc: Hilmar Lapp; biosql-l at lists.open-bio.org > Subject: Re: [BioSQL-l] load_seqdatabase.pl warnings and errors > > On Wed, May 20, 2009 at 10:52 AM, michael watson (IAH-C) > wrote: >> >> Hi Guys >> >> Ok, the warnings were due to duplicate sequences - I had downloaded a >> stream using Bio::DB::GenBank and I guess I assumed that would mean >> only >> unique entries were sent back. Using "--flatlookup --remove" gets >> rid >> of the warnings. > > Great - easy :) > >> Now for NC_003992.gbk... >> >> To answer Hilmar's question: >> ... >> And when I run load_seqdatabase.pl on NC_003992.gbk alone I still >> get: >> >> perl load_seqdatabase.pl --host localhost --dbname fmd_biosql -- >> format >> genbank --dbuser removed --dbpass removed --flatlookup --remove >> NC_003992.gbk >> >> Loading NC_003992.gbk ... >> >> -------------------- WARNING --------------------- >> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, >> values >> were ("","Direct Submission","Submitted (12-AUG-2004) National Center >> for Biotechnology Information, NIH, Bethesda, MD 20894, >> USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs () >> Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3 >> --------------------------------------------------- >> Could not store NC_003992: >> ------------- EXCEPTION ------------- >> MSG: create: object (Bio::Annotation::Reference) failed to insert >> or to >> be found by unique key >> ... > > I would guess that the problem is this rather generic reference in > NC_003992 may be repeated exactly in another genome (causing the CRC > collision): > > CONSRTM NCBI Genome Project > TITLE Direct Submission > JOURNAL Submitted (12-AUG-2004) National Center for Biotechnology > Information, NIH, Bethesda, MD 20894, USA > > See http://www.ncbi.nlm.nih.gov/nuccore/NC_011452 > > i.e. Could there be another direct submission by the NCBI on that date > in your collection? You could search the database looking for that > CRC and trace it back to a bioentry, or just try grep for "JOURNAL > Submitted (12-AUG-2004) National Center for Biotechnology" on your > GenBank files. e.g. Something like this SQL statement might be > interesting: > > SELECT bioentry.accession, reference.title FROM bioentry, > bioentry_reference, reference WHERE > bioentry.bioentry_id=bioentry_reference.bioentry_id AND > bioentry_reference.reference_id=reference.reference_id AND > reference.crc="CRC-E8D3CBBD80002FA1"; > > Peter -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Fri May 22 08:27:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 22 May 2009 13:27:06 +0100 Subject: [BioSQL-l] RULES in BioSQL PostgreSQL schema Message-ID: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com> Hi all, This is a continuation of a thread / bug report from Biopython (Bug 2833) where attempting to import duplicate entries into BioSQL did not raise an error on PostgreSQL (but does on MySQL). Cymon traced this to the RULES present in the schema to help bioperl-db. On Fri, May 22, 2009 at 3:05 AM, Hilmar Lapp wrote: > > On May 21, 2009, at 6:52 PM, Cymon Cox wrote: > >> [...] >> >> Hi Andrea, >> >> The problem appears to be related to the BioSQL schema/PostGreSQL. >> >> As you indicated, adding a duplicate entry to bioentry returns a "INSERT 0 >> 0" and doesnt throw an IntegrityError which is what the code is looking >> from and presumably what MySQL throws. >> >> The reason it doesnt throw an error is because of one (or both) of the >> RULES in the schema: > > Indeed, I'd almost forgotten. The rules are there mostly as a remnant from > earlier versions of PostgreSQL to support transactional loading the way > bioperl-db (the object-relational mapping for BioPerl) is optimized. You > probably don't need them anywhere else. > > ? ? ? ?-hilmar > > > Bioperl-db is optimized such that entities that very likely don't exist yet > in the database are attempted for insert right away. If the insert fails due > to a unique key violation, the record is looked up (and then expected to be > found). In Oracle and MySQL you can do this and the transaction remains > healthy; i.e., you can commit the transaction later and all statements > except those that failed will be committed. In PostgreSQL any failed > statement dooms the entire transaction, and the only way out is a rollback. > In this case, if you want the loading of one sequence record as one > transaction, failing to insert a single feature record will doom the entire > sequence load and you would need to start over with the sequence. To fix > this, I wrote the rules, which in essence do do the lookups for PostgreSQL > that the bioperl-db code would otherwise avoid, and on insert do nothing if > the record is found, which results in zero rows affected when you would > expect one (which is what bioperl-db cues off of and then triggers a > lookup). > The right way to do this meanwhile is to use nested transactions, which > PostgreSQL supports since v8.0.x, but I haven't gotten around to implement > support for that in Bioperl-db. > Hilmar, It seems for Biopython to work properly with BioSQL on PostgreSQL these bioentry rules should be removed from the schema (as the comments in the schema do suggest). Obviously doing this would break any installation also using the current version of bioperl-db. Do the RULES affect BioJava or BioRuby using BioSQL on PostgreSQL? Are you happy to remove these RULES in BioSQL v1.0.x (after making the outlined transactional changes in bioperl-db)? Thanks, Peter From hlapp at gmx.net Fri May 22 11:03:11 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 22 May 2009 11:03:11 -0400 Subject: [BioSQL-l] RULES in BioSQL PostgreSQL schema In-Reply-To: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com> References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com> Message-ID: On May 22, 2009, at 8:27 AM, Peter wrote: > Are you happy to remove these RULES in BioSQL v1.0.x (after > making the outlined transactional changes in bioperl-db)? In principle yes. It would also mean dropping support for PostgreSQL v7.x, but I would hope that that's a non-issue. But if anyone here is still using and relying on PostgreSQL v7.x (or earlier?) do let us know, please. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Fri May 22 11:57:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 22 May 2009 16:57:38 +0100 Subject: [BioSQL-l] [Biopython-dev] RULES in BioSQL PostgreSQL schema In-Reply-To: References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com> Message-ID: <320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com> On Fri, May 22, 2009 at 4:03 PM, Hilmar Lapp wrote: > > On May 22, 2009, at 8:27 AM, Peter wrote: > >> Are you happy to remove these RULES in BioSQL v1.0.x (after >> making the outlined transactional changes in bioperl-db)? > > In principle yes. It would also mean dropping support for PostgreSQL v7.x, > but I would hope that that's a non-issue. > > But if anyone here is still using and relying on PostgreSQL v7.x (or > earlier?) do let us know, please. Great. In the meantime could you add a big warning about this issue to the INSTALL notes for PostgreSQL (i.e. recommend removing the RULES section if not using bioper-db)? http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/INSTALL Peter From hlapp at gmx.net Fri May 22 14:20:58 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 22 May 2009 14:20:58 -0400 Subject: [BioSQL-l] [Biopython-dev] RULES in BioSQL PostgreSQL schema In-Reply-To: <320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com> References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com> <320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com> Message-ID: <410BDC8E-1305-4FE5-860D-694E6A3EA9E6@gmx.net> Yes, agree. Would you mind filing this in Bugzilla for BioSQL? -hilmar On May 22, 2009, at 11:57 AM, Peter wrote: > On Fri, May 22, 2009 at 4:03 PM, Hilmar Lapp wrote: >> >> On May 22, 2009, at 8:27 AM, Peter wrote: >> >>> Are you happy to remove these RULES in BioSQL v1.0.x (after >>> making the outlined transactional changes in bioperl-db)? >> >> In principle yes. It would also mean dropping support for >> PostgreSQL v7.x, >> but I would hope that that's a non-issue. >> >> But if anyone here is still using and relying on PostgreSQL v7.x (or >> earlier?) do let us know, please. > > Great. > > In the meantime could you add a big warning about this issue to the > INSTALL notes for PostgreSQL (i.e. recommend removing the RULES > section if not using bioper-db)? > http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/INSTALL > > Peter -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Fri May 22 18:46:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 22 May 2009 23:46:54 +0100 Subject: [BioSQL-l] [Biopython-dev] RULES in BioSQL PostgreSQL schema In-Reply-To: <410BDC8E-1305-4FE5-860D-694E6A3EA9E6@gmx.net> References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com> <320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com> <410BDC8E-1305-4FE5-860D-694E6A3EA9E6@gmx.net> Message-ID: <320fb6e00905221546i26edc7a2u2a02fb0d01c374ea@mail.gmail.com> On 5/22/09, Hilmar Lapp wrote: > Yes, agree. Would you mind filing this in Bugzilla for BioSQL? -hilmar I've filed Bug 2839, hopefully this is what you had in mind: http://bugzilla.open-bio.org/show_bug.cgi?id=2839 Peter From michael.watson at bbsrc.ac.uk Wed May 27 08:50:45 2009 From: michael.watson at bbsrc.ac.uk (michael watson (IAH-C)) Date: Wed, 27 May 2009 13:50:45 +0100 Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors In-Reply-To: <0212C167-7618-4761-A191-C6CE4B41EC2A@gmx.net> References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk> <68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net> <8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk> <320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com> <8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk> <0212C167-7618-4761-A191-C6CE4B41EC2A@gmx.net> Message-ID: <8975119BCD0AC5419D61A9CF1A923E9507E27A82@iahce2ksrv1.iah.bbsrc.ac.uk> Hi Hilmar I tried to dig around in the code, but quite frankly I quickly got lost. What is clear is that the existing reference is not being found in the cache nor the database, and therefore a unique key violation occurs when the code tries to insert the object. I'm pretty stuffed on this project until I can get this sorted out. If someone tells me where to look I can try and sort out why this happens, but at the moment (for me) it's like looking for a needle in a haystack. Thanks in advance Mick -----Original Message----- From: Hilmar Lapp [mailto:hlapp at gmx.net] Sent: 20 May 2009 16:10 To: michael watson (IAH-C) Cc: Peter; biosql-l at lists.open-bio.org Subject: Re: [BioSQL-l] load_seqdatabase.pl warnings and errors Indeed changing the lookup will have no effect since deletion of bioentries doesn't cascade to references (only to bioentry-to- reference associations). What I don't understand yet is how you get the CRC clash. Normally this kind of situation can happen if the first occurrence does not and the second does have PMID, by which it will be looked up, lookup fails (b/c the first occurrence didn't come with PMID), resulting in an insert of the erroneously deemed "new" reference, which then fails with a CRC clash. However, there is no PMID nor any other identifier here, so I'll have to look into the code to find out why the second occurrence is either not looked up before an insert is attempted, or if it is looked up, why the lookup fails to find the record stored earlier. -hilmar On May 20, 2009, at 7:25 AM, michael watson (IAH-C) wrote: > We have a winner :) > > NC_003992, NC_011452, NC_011451, NC_011450 all share at least one > reference. > > Would changing --flatlookup to --lookup change the behaviour so it > checks for an existing reference before trying to insert the > duplicate? > > The answer is no :( (see below). > > I guess this may need some coding then! > > Thanks! > Mick > > perl load_seqdatabase.pl --host localhost --dbname fmd_biosql -- > format genbank --dbuser removed --dbpass removed --lookup --remove > NC_003992.gbk > Loading NC_003992.gbk ... > > -------------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, > values were ("","Direct Submission","Submitted (12-AUG-2004) > National Center for Biotechnology Information, NIH, Bethesda, MD > 20894, USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs () > Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3 > --------------------------------------------------- > Could not store NC_003992: > ------------- EXCEPTION ------------- > MSG: create: object (Bio::Annotation::Reference) failed to insert or > to be found by unique key > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/BasePersistenceAdaptor.pm:206 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/BasePersistenceAdaptor.pm:251 > STACK Bio::DB::Persistent::PersistentObject::store /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > Persistent/PersistentObject.pm:271 > STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children / > usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/ > Bio/DB/BioSQL/AnnotationCollectionAdaptor.pm:217 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/BasePersistenceAdaptor.pm:214 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/BasePersistenceAdaptor.pm:251 > STACK Bio::DB::Persistent::PersistentObject::store /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > Persistent/PersistentObject.pm:271 > STACK Bio::DB::BioSQL::SeqAdaptor::store_children /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/SeqAdaptor.pm:224 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/BasePersistenceAdaptor.pm:214 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/BasePersistenceAdaptor.pm:251 > STACK Bio::DB::Persistent::PersistentObject::store /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > Persistent/PersistentObject.pm:271 > STACK (eval) load_seqdatabase.pl:622 > STACK toplevel load_seqdatabase.pl:604 > > -------------------------------------- > > at load_seqdatabase.pl line 635 > > -----Original Message----- > From: p.j.a.cock at googlemail.com [mailto:p.j.a.cock at googlemail.com] > On Behalf Of Peter > Sent: 20 May 2009 11:59 > To: michael watson (IAH-C) > Cc: Hilmar Lapp; biosql-l at lists.open-bio.org > Subject: Re: [BioSQL-l] load_seqdatabase.pl warnings and errors > > On Wed, May 20, 2009 at 10:52 AM, michael watson (IAH-C) > wrote: >> >> Hi Guys >> >> Ok, the warnings were due to duplicate sequences - I had downloaded a >> stream using Bio::DB::GenBank and I guess I assumed that would mean >> only >> unique entries were sent back. Using "--flatlookup --remove" gets >> rid >> of the warnings. > > Great - easy :) > >> Now for NC_003992.gbk... >> >> To answer Hilmar's question: >> ... >> And when I run load_seqdatabase.pl on NC_003992.gbk alone I still >> get: >> >> perl load_seqdatabase.pl --host localhost --dbname fmd_biosql -- >> format >> genbank --dbuser removed --dbpass removed --flatlookup --remove >> NC_003992.gbk >> >> Loading NC_003992.gbk ... >> >> -------------------- WARNING --------------------- >> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, >> values >> were ("","Direct Submission","Submitted (12-AUG-2004) National Center >> for Biotechnology Information, NIH, Bethesda, MD 20894, >> USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs () >> Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3 >> --------------------------------------------------- >> Could not store NC_003992: >> ------------- EXCEPTION ------------- >> MSG: create: object (Bio::Annotation::Reference) failed to insert >> or to >> be found by unique key >> ... > > I would guess that the problem is this rather generic reference in > NC_003992 may be repeated exactly in another genome (causing the CRC > collision): > > CONSRTM NCBI Genome Project > TITLE Direct Submission > JOURNAL Submitted (12-AUG-2004) National Center for Biotechnology > Information, NIH, Bethesda, MD 20894, USA > > See http://www.ncbi.nlm.nih.gov/nuccore/NC_011452 > > i.e. Could there be another direct submission by the NCBI on that date > in your collection? You could search the database looking for that > CRC and trace it back to a bioentry, or just try grep for "JOURNAL > Submitted (12-AUG-2004) National Center for Biotechnology" on your > GenBank files. e.g. Something like this SQL statement might be > interesting: > > SELECT bioentry.accession, reference.title FROM bioentry, > bioentry_reference, reference WHERE > bioentry.bioentry_id=bioentry_reference.bioentry_id AND > bioentry_reference.reference_id=reference.reference_id AND > reference.crc="CRC-E8D3CBBD80002FA1"; > > Peter -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Thu May 14 18:20:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 14 May 2009 19:20:47 +0100 Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL Message-ID: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> Hi, This is cross-posted between biopython-dev and biosql-l as it regards parsing the description (DE) lines in SwissProt files and how they are stored in BioSQL. This follows from an earlier discussion on biopython-dev Older SwissProt files just had one or two DE lines, and it made sense to treat this as a simple string mapped onto the description field in the bioentry table in BioSQL. This appears to what happens with BioPerl 1.5.x and in Biopython (although the details regarding white space differ). However, newer SwissProt files have many DE lines with additional structure. The example Michiel gave earlier on the biopython-dev list was: http://www.uniprot.org/uniprot/Q9XHP0.txt This has the following DE lines: DE RecName: Full=11S globulin seed storage protein 2; DE AltName: Full=11S globulin seed storage protein II; DE AltName: Full=Alpha-globulin; DE Contains: DE RecName: Full=11S globulin seed storage protein 2 acidic chain; DE AltName: Full=11S globulin seed storage protein II acidic chain; DE Contains: DE RecName: Full=11S globulin seed storage protein 2 basic chain; DE AltName: Full=11S globulin seed storage protein II basic chain; DE Flags: Precursor; I had to fight with perl to get my old copy of BioPerl working again (some week reference thing), but I managed, and then loaded this file into my test BioSQL database with: $ perl load_seqdatabase.pl --dbname biosql_test --dbuser root --dbpass XXX --namespace biosql_test --format swiss Q9XHP0.txt Then I looked at the resulting description in the main bioentry table: $ mysql --user=root -p biosql_test -e 'SELECT description FROM bioentry WHERE accession="Q9XHP0";' This is stored as one huge long string (without the newlines, I'm not sure if BioPerl strips those in parsing the file, or when loading it into the database): RecName: Full=11S globulin seed storage protein 2; AltName: Full=11S globulin seed storage protein II; AltName: Full=Alpha-globulin; Contains: RecName: Full=11S globulin seed storage protein 2 acidic chain; AltName: Full=11S globulin seed storage protein II acidic chain; Contains: RecName: Full=11S globulin seed storage protein 2 basic chain; AltName: Full=11S globulin seed storage protein II basic chain; Flags: Precursor; For Biopython, I emptied the database then did: >>> from Bio import SeqIO >>> from BioSQL import BioSeqDatabase >>> server = BioSeqDatabase.open_database(driver="MySQLdb", user="root", passwd = "XXX", host = "localhost", db="biosql_test") >>> db = server["biosql-test"] #namespace >>> db.load(SeqIO.parse(open("Q9XHP0.txt"), "swiss")) 1 >>> server.commit() As before, I looked in the table with mysql. Again - this stores the full description from the DE line, although with the newlines embedded. So, Biopython is consistent with my old copy of BioPerl (1.5.x) if we ignore the white space. However, how does this look in BioPerl 1.6? If this is the same, are there any plans to change this? For Biopython we have discussed recording most of the DE information under the annotations instead (keyed off RecName, AltName, Contains, Flags), but I would like to be consistent with BioPerl+BioSQL. Thanks Peter From biopython at maubp.freeserve.co.uk Sat May 16 11:53:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 16 May 2009 12:53:07 +0100 Subject: [BioSQL-l] Recording "nucleotide" in the sequence table? Message-ID: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com> Hi all, You may recall a year ago or so, we talked about how BioPerl and Biopython used lower case alphabet names ("dna", "rna", "protein") while BioJava was inconsistent and used upper (or even mixed case). http://lists.open-bio.org/pipermail/biopython/2007-November/003894.html http://lists.open-bio.org/pipermail/biojava-l/2007-November/006034.html http://lists.open-bio.org/pipermail/biosql-l/2008-March/001185.html You'll notice that thread was split over several mailing lists (and looking back, I think I missed some posts as I only read the Biopython and BioSQL lists). Anyway, this lead to the following proposal: http://www.biosql.org/wiki/Enhancement_Requests#Check_constraint_on_biosequence.alphabet In Biopython we also use "unknown" for sequences which are not known to be "dna", "rna", "protein". I presume this was copying BioPerl. In a recent bug report (Bug 2829) it was pointed out that we (Biopython) don't attempt to record nucleotide alphabets in BioSQL (i.e. a sequence which could be DNA or RNA but we don't know which), they just get "unknown" as their biosequence.alphabet entry. Is there any precedent in BioPerl, BioJava or BioRuby for how to handle this? If not, I'd like to introduce and agree on "nucleotide" for this situation. Peter From biopython at maubp.freeserve.co.uk Sat May 16 12:12:01 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 16 May 2009 13:12:01 +0100 Subject: [BioSQL-l] BioSQL at BOSC 2009? Message-ID: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com> Hi, Will any of the key BioSQL people from the Bio* projects be at BOSC (and ISMB) this year? http://open-bio.org/wiki/BOSC_2009 There will be several people from Biopython there this year, including me and Brad Chapman who are both familiar with BioSQL. This would be a nice opportunity for further improving BioSQL compatibility between the Bio* projects - something that has been suggested in the past, e.g. http://lists.open-bio.org/pipermail/biopython/2007-November/003893.html http://lists.open-bio.org/pipermail/biojava-l/2007-November/006037.html I don't follow the BioPerl, BioJava or BioRuby mailing lists - and I doubt many of their developers follow the Biopython mailing lists. So, rather than having any BioSQL compatibility discussions split over individual Bio* project specific mailing lists, it seems using the BioSQL mailing list is most appropriate. I have CC'd a few key people just in case they are not on the BioSQL mailing list, if I have missed anyone please forward this to them and ask them to sign up. Thanks, Peter From markjschreiber at gmail.com Sat May 16 14:58:19 2009 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sat, 16 May 2009 22:58:19 +0800 Subject: [BioSQL-l] Recording "nucleotide" in the sequence table? In-Reply-To: <93b45ca50905160755o4e5c9520n55bc5b84774f277a@mail.gmail.com> References: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com> <93b45ca50905160755o4e5c9520n55bc5b84774f277a@mail.gmail.com> Message-ID: <93b45ca50905160758j7c9f1d78k9ec49008d10f2e4f@mail.gmail.com> I don't think you can do this with certainty. If you don't know the source alphabet then an amino acid sequence could look like dna if it is only using acgt and some of the ambiguity codes. If it is a long sequence it will become increasingly unlikey it is amino acid but never certain. On 16 May 2009, 7:54 PM, "Peter" wrote: Hi all, You may recall a year ago or so, we talked about how BioPerl and Biopython used lower case alphabet names ("dna", "rna", "protein") while BioJava was inconsistent and used upper (or even mixed case). http://lists.open-bio.org/pipermail/biopython/2007-November/003894.html http://lists.open-bio.org/pipermail/biojava-l/2007-November/006034.html http://lists.open-bio.org/pipermail/biosql-l/2008-March/001185.html You'll notice that thread was split over several mailing lists (and looking back, I think I missed some posts as I only read the Biopython and BioSQL lists). Anyway, this lead to the following proposal: http://www.biosql.org/wiki/Enhancement_Requests#Check_constraint_on_biosequence.alphabet In Biopython we also use "unknown" for sequences which are not known to be "dna", "rna", "protein". I presume this was copying BioPerl. In a recent bug report (Bug 2829) it was pointed out that we (Biopython) don't attempt to record nucleotide alphabets in BioSQL (i.e. a sequence which could be DNA or RNA but we don't know which), they just get "unknown" as their biosequence.alphabet entry. Is there any precedent in BioPerl, BioJava or BioRuby for how to handle this? If not, I'd like to introduce and agree on "nucleotide" for this situation. Peter _______________________________________________ BioSQL-l mailing list BioSQL-l at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biosql-l From hlapp at gmx.net Sat May 16 15:17:39 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 16 May 2009 11:17:39 -0400 Subject: [BioSQL-l] BioSQL at BOSC 2009? In-Reply-To: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com> References: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com> Message-ID: <1BD503B3-D805-4882-87DD-820138792DB2@gmx.net> On May 16, 2009, at 8:12 AM, Peter wrote: > Will any of the key BioSQL people from the Bio* projects be at BOSC > (and ISMB) this year? http://open-bio.org/wiki/BOSC_2009 Yes, I'll be there (though I am not presenting this year). > [...] This would be a nice opportunity for further improving BioSQL > compatibility between the Bio* projects - something that has been > suggested in the past, Indeed, excellent idea. Should we plan for a BoF? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sat May 16 16:48:40 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 16 May 2009 12:48:40 -0400 Subject: [BioSQL-l] Recording "nucleotide" in the sequence table? In-Reply-To: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com> References: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com> Message-ID: <48684763-5657-40F3-BE75-31E8CDE42613@gmx.net> On May 16, 2009, at 7:53 AM, Peter wrote: > In a recent bug report (Bug 2829) it was pointed out that we > (Biopython) don't attempt to record nucleotide alphabets in BioSQL > (i.e. a sequence which could be DNA or RNA but we don't know which), > they just get "unknown" as their biosequence.alphabet entry. I'm assuming that you do know that it's not protein, right? I.e., assigning alphabet "unknown" isn't exactly right. > Is there any precedent in BioPerl, BioJava or BioRuby for how to > handle this? If not, I'd like to introduce and agree on "nucleotide" > for this situation. So which letters (symbols) does the "nucleotide" alphabet contain? Getting back to Mark's question, how do you know that it's either dna or rna but not protein? Is the problem that the user can't tell you whether it's dna or rna but they know it's not protein, or is it that the user doesn't say anything and all you have is the symbols of the sequence, which are a, c, g, and t only. In BioPerl we'll guess the alphabet if the user doesn't say what it is, and at present if what we're seeing are the symbols a, c, g, and t only, then the guess is dna. If we're seeing u rather than t, we guess it's rna. An "unknown" alphabet would be for the user to expressly choose. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Sat May 16 20:25:21 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 16 May 2009 21:25:21 +0100 Subject: [BioSQL-l] Recording "nucleotide" in the sequence table? In-Reply-To: <48684763-5657-40F3-BE75-31E8CDE42613@gmx.net> References: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com> <48684763-5657-40F3-BE75-31E8CDE42613@gmx.net> Message-ID: <320fb6e00905161325m497882ddge853e1525acdfede@mail.gmail.com> Hilmar wrote: > I'm assuming that you do know that it's not protein, right? > I.e., assigning alphabet "unknown" isn't exactly right. Yes, if the sequence is using the generic nucleotide alphabet this means it is NOT protein, and could be DNA or RNA. So yes, downgrading a "nucleotide" alphabet to just "unknown" when storing it in BioSQL (as we do now) is losing information - hence me starting this thread. > > Is there any precedent in BioPerl, BioJava or BioRuby for how to > > handle this? If not, I'd like to introduce and agree on "nucleotide" > > for this situation. > > So which letters (symbols) does the "nucleotide" alphabet contain? Potentially anything - although I would expect the standard (ambiguous) letters using in RNA or DNA, plus perhaps gap symbols. > Getting back to Mark's question, how do you know that it's either dna or > rna but not protein? We know because the user (or parser) has explicitly used the generic nucleotide alphabet, this means it is not protein, and is either DNA or RNA. From the point of loading the sequence into BioSQL, we don't know or care where the sequence came from - we just get given the data with a declared alphabet. > Is the problem that the user can't tell you whether it's dna or > rna but they know it's not protein, or is it that the user doesn't > say anything and all you have is the symbols of the sequence, > which are a, c, g, and t only. In the situation I'm talking about, either the user has explicitly picked the alphabet, or perhaps one of our parsers has done so. This would be because the user don't know, of the file format doesn't specify this information. This is admittedly a corner case - generally there will be either be T or U entries in the sequence so DNA or RNA can be deduced unambiguously. > In BioPerl we'll guess the alphabet if the user doesn't say what it is, and > at present if what we're seeing are the symbols a, c, g, and t only, then > the guess is dna. If we're seeing u rather than t, we guess it's rna. An > "unknown" alphabet would be for the user to expressly choose. What would BioPerl do with the nucleotide sequence GCGCGCGA? Presumably you guess, thus record either "dna" or "rna" in BioSQL, so the issue of wanting to record "nucleotide" never arises. In python "guessing" is discouraged. If we have a nucleotide sequence like GCGCGCGA, this could be DNA or RNA - you can't tell. Our nucleotide alphabet covers this situation , although another strong reason for having it is as a common base class for the RNA and DNA alphabets. On 5/16/09, Mark Schreiber wrote: > I don't think you can do this with certainty. If you don't know the source > alphabet then an amino acid sequence could look like dna if it is only > using acgt and some of the ambiguity codes. > > If it is a long sequence it will become increasingly unlikey it is amino > acid but never certain. The python answer is don't guess. If you read in a FASTA file with Biopython it will by default be given a generic alphabet, unless you explicitly specify otherwise (and in BioSQL the alphabet will be stored as "unknown"). i.e. the onus is on the user to be explicit. Peter From biopython at maubp.freeserve.co.uk Sat May 16 21:23:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 16 May 2009 22:23:04 +0100 Subject: [BioSQL-l] BioSQL at BOSC 2009? In-Reply-To: <1BD503B3-D805-4882-87DD-820138792DB2@gmx.net> References: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com> <1BD503B3-D805-4882-87DD-820138792DB2@gmx.net> Message-ID: <320fb6e00905161423l4df26525hbc9a824419c7a370@mail.gmail.com> On 5/16/09, Hilmar Lapp wrote: > > On May 16, 2009, at 8:12 AM, Peter wrote: > > > Will any of the key BioSQL people from the Bio* projects be at BOSC > > (and ISMB) this year? http://open-bio.org/wiki/BOSC_2009 > > > > Yes, I'll be there (though I am not presenting this year). > > > [...] This would be a nice opportunity for further improving BioSQL > > compatibility between the Bio* projects - something that has been > > suggested in the past, > > Indeed, excellent idea. Should we plan for a BoF? If you want to do this as a formal BoF, then sure. Brad and I (plus other Biopython folk like Tiago and Bartek, who I believe are not so interested in BioSQL) are already talking about a Bioython BoF/hackathon session at BOSC. It would be easier if that didn't overlap with a BioSQL session ;) (but not impossible - Brad and I can perhaps split our time?) I will be staying for all of ISMB, and I think Brad is about for the Monday and maybe Tuesday, so that might be an alternative for scheduling. Peter From hlapp at gmx.net Sat May 16 21:57:15 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 16 May 2009 17:57:15 -0400 Subject: [BioSQL-l] BioSQL at BOSC 2009? In-Reply-To: <320fb6e00905161423l4df26525hbc9a824419c7a370@mail.gmail.com> References: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com> <1BD503B3-D805-4882-87DD-820138792DB2@gmx.net> <320fb6e00905161423l4df26525hbc9a824419c7a370@mail.gmail.com> Message-ID: <74D4CC78-FC7B-4595-9D24-EB6B3ED43318@gmx.net> On May 16, 2009, at 5:23 PM, Peter wrote: > I will be staying for all of ISMB I am too. Should we doodle something once the program is out? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sat May 16 22:10:43 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 16 May 2009 18:10:43 -0400 Subject: [BioSQL-l] Recording "nucleotide" in the sequence table? In-Reply-To: <320fb6e00905161325m497882ddge853e1525acdfede@mail.gmail.com> References: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com> <48684763-5657-40F3-BE75-31E8CDE42613@gmx.net> <320fb6e00905161325m497882ddge853e1525acdfede@mail.gmail.com> Message-ID: <9A3473A7-2125-486C-BAAE-27820FF19D8D@gmx.net> I think we'll have to define carefully what we mean by "generic nucleotide alphabet". (Normally I hear nucleotide used as the type of a sequence, but not its alphabet.) A nucleotide alphabet in the way you describe it also can't really be the "base class" for either a DNA or RNA alphabet, can it? Typically in OOP, derived classes expand on a base class, not restrict it. So isn't there potential for confusion? What you are essentially talking about is the case when a sequence contains only A, C, and G. In that case, we don't know either that it's not protein, do we? > [...] In python "guessing" is discouraged. If we have a nucleotide > sequence > like GCGCGCGA, this could be DNA or RNA - you can't tell. And how do you tell it's nucleotide to begin with? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sat May 16 22:34:57 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 16 May 2009 18:34:57 -0400 Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> Message-ID: <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> Don't you love SwissProt (or UniProt as we must call it now I suppose). They (understandably) try to squeeze ever more annotation into the existing tags, rather than adding new tags. So, of the following structure: DE RecName: Full=11S globulin seed storage protein 2; DE AltName: Full=11S globulin seed storage protein II; DE AltName: Full=Alpha-globulin; DE Contains: DE RecName: Full=11S globulin seed storage protein 2 acidic chain; DE AltName: Full=11S globulin seed storage protein II acidic chain; DE Contains: DE RecName: Full=11S globulin seed storage protein 2 basic chain; DE AltName: Full=11S globulin seed storage protein II basic chain; DE Flags: Precursor; really only the first line, with the 'RecName: Full=' removed, is the description line as we know it. The rest, I would say, is annotation, such as two alternative names, amino acid chains contained in the full record (shouldn't this be feature annotation, really? and indeed it is - why it needs to be repeated here is beyond me) and their names as well as alternative names, and the fact that the sequence is a precursor form. Leaving all this in one string has the advantage that we can round- trip it (and there is probably hardly any other way to accomplish that), but clearly in terms of semantics this isn't the sequence description as we know it anymore. Does anyone else think too that completely changing the semantics of sequence annotation fields is a bad idea? My inclination from a BioPerl perspective is to extract the part following 'RecName: Full=' as the description, and attach the rest as annotation. We could in fact use the TagTree class for this. I'm cross- posting to BioPerl too to gather what other BioPerl'ers think about this. -hilmar On May 14, 2009, at 2:20 PM, Peter wrote: > Hi, > > This is cross-posted between biopython-dev and biosql-l as it regards > parsing the description (DE) lines in SwissProt files and how they are > stored in BioSQL. This follows from an earlier discussion on > biopython-dev > > Older SwissProt files just had one or two DE lines, and it made sense > to treat this as a simple string mapped onto the description field in > the bioentry table in BioSQL. This appears to what happens with > BioPerl 1.5.x and in Biopython (although the details regarding white > space differ). However, newer SwissProt files have many DE lines with > additional structure. The example Michiel gave earlier on the > biopython-dev list was: > > http://www.uniprot.org/uniprot/Q9XHP0.txt > > This has the following DE lines: > > DE RecName: Full=11S globulin seed storage protein 2; > DE AltName: Full=11S globulin seed storage protein II; > DE AltName: Full=Alpha-globulin; > DE Contains: > DE RecName: Full=11S globulin seed storage protein 2 acidic chain; > DE AltName: Full=11S globulin seed storage protein II acidic > chain; > DE Contains: > DE RecName: Full=11S globulin seed storage protein 2 basic chain; > DE AltName: Full=11S globulin seed storage protein II basic chain; > DE Flags: Precursor; > > I had to fight with perl to get my old copy of BioPerl working again > (some week reference thing), but I managed, and then loaded this file > into my test BioSQL database with: > > $ perl load_seqdatabase.pl --dbname biosql_test --dbuser root --dbpass > XXX --namespace biosql_test --format swiss Q9XHP0.txt > > Then I looked at the resulting description in the main bioentry table: > > $ mysql --user=root -p biosql_test -e 'SELECT description FROM > bioentry WHERE accession="Q9XHP0";' > > This is stored as one huge long string (without the newlines, I'm not > sure if BioPerl strips those in parsing the file, or when loading it > into the database): > > RecName: Full=11S globulin seed storage protein 2; AltName: Full=11S > globulin seed storage protein II; AltName: Full=Alpha-globulin; > Contains: RecName: Full=11S globulin seed storage protein 2 acidic > chain; AltName: Full=11S globulin seed storage protein II acidic > chain; Contains: RecName: Full=11S globulin seed storage protein 2 > basic chain; AltName: Full=11S globulin seed storage protein II basic > chain; Flags: Precursor; > > For Biopython, I emptied the database then did: > >>>> from Bio import SeqIO >>>> from BioSQL import BioSeqDatabase >>>> server = BioSeqDatabase.open_database(driver="MySQLdb", >>>> user="root", passwd = "XXX", host = "localhost", db="biosql_test") >>>> db = server["biosql-test"] #namespace >>>> db.load(SeqIO.parse(open("Q9XHP0.txt"), "swiss")) > 1 >>>> server.commit() > > As before, I looked in the table with mysql. Again - this stores the > full description from the DE line, although with the newlines > embedded. So, Biopython is consistent with my old copy of BioPerl > (1.5.x) if we ignore the white space. > > However, how does this look in BioPerl 1.6? If this is the same, are > there any plans to change this? For Biopython we have discussed > recording most of the DE information under the annotations instead > (keyed off RecName, AltName, Contains, Flags), but I would like to be > consistent with BioPerl+BioSQL. > > Thanks > > Peter > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Sat May 16 23:06:41 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 17 May 2009 00:06:41 +0100 Subject: [BioSQL-l] Recording "nucleotide" in the sequence table? In-Reply-To: <9A3473A7-2125-486C-BAAE-27820FF19D8D@gmx.net> References: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com> <48684763-5657-40F3-BE75-31E8CDE42613@gmx.net> <320fb6e00905161325m497882ddge853e1525acdfede@mail.gmail.com> <9A3473A7-2125-486C-BAAE-27820FF19D8D@gmx.net> Message-ID: <320fb6e00905161606l5fdb0862mf25a45dad07dac8@mail.gmail.com> On 5/16/09, Hilmar Lapp wrote: > > I think we'll have to define carefully what we mean by "generic nucleotide > alphabet". (Normally I hear nucleotide used as the type of a sequence, but > not its alphabet.) In Biopython the type of a sequence (e.g. DNA, RNA or Protein) is recorded by an alphabet object (which may also record the expected range of letters). > A nucleotide alphabet in the way you describe it also can't really be the > "base class" for either a DNA or RNA alphabet, can it? Typically in OOP, > derived classes expand on a base class, not restrict it. So isn't there > potential for confusion? Well, that's how it was done for the Biopython alphabet classes. I'm simplifying slightly, but at the top level we have a generic alphabet, which has as children generic protein and generic nucleotide (which has as its children generic dna and generic rna). Each of these then has IUPAC subclasses which are further restrictions where the valid letters are proscribed. > What you are essentially talking about is the case when a sequence > contains only A, C, and G. In that case, we don't know either that > it's not protein, do we? > > > [...] In python "guessing" is discouraged. If we have a nucleotide > > sequence like GCGCGCGA, this could be DNA or RNA - you can't > > tell. > > And how do you tell it's nucleotide to begin with? That is the whole point. When deciding what to record in the biosequence.alphabet field in BioSQL we (Bioython) can only go by what the alphabet associated with the sequence object. Whoever created the sequence specified the alphabet based on meta data, external knowledge, or guessed. If this was done by a parser, then the file format itself may have specified the sequence type. If none of BioPerl, BioJava and BioRuby have an analogous sequence representation for a nucleotide sequence which might be DNA or RNA, then perhaps the current situation with only "protein", "dna", "rna" and "unknown" in the biosequence.alphabet field in BioSQL is sufficient. Peter From biopython at maubp.freeserve.co.uk Sat May 16 23:14:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 17 May 2009 00:14:54 +0100 Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> Message-ID: <320fb6e00905161614p5a2d964fs6de110bb915ed066@mail.gmail.com> On 5/16/09, Hilmar Lapp wrote: > > Don't you love SwissProt (or UniProt as we must call it now I suppose). > They (understandably) try to squeeze ever more annotation into the existing > tags, rather than adding new tags. > > So, of the following structure: > > DE RecName: Full=11S globulin seed storage protein 2; > DE AltName: Full=11S globulin seed storage protein II; > DE AltName: Full=Alpha-globulin; > DE Contains: > DE RecName: Full=11S globulin seed storage protein 2 acidic chain; > DE AltName: Full=11S globulin seed storage protein II acidic chain; > DE Contains: > DE RecName: Full=11S globulin seed storage protein 2 basic chain; > DE AltName: Full=11S globulin seed storage protein II basic chain; > DE Flags: Precursor; > > really only the first line, with the 'RecName: Full=' removed, is the > description line as we know it. The rest, I would say, is annotation, such > as two alternative names, amino acid chains contained in the full record > (shouldn't this be feature annotation, really? and indeed it is - why it > needs to be repeated here is beyond me) and their names as well as > alternative names, and the fact that the sequence is a precursor form. > > Leaving all this in one string has the advantage that we can round-trip it > (and there is probably hardly any other way to accomplish that), but clearly > in terms of semantics this isn't the sequence description as we know it > anymore. > > Does anyone else think too that completely changing the semantics of > sequence annotation fields is a bad idea? +1 That's pretty much what I thought on seeing this the first time. > My inclination from a BioPerl perspective is to extract the part following > 'RecName: Full=' as the description, and attach the rest as annotation. We > could in fact use the TagTree class for this. I'm cross-posting to BioPerl > too to gather what other BioPerl'ers think about this. Am I right to infer that currently BioPerl 1.6.x, like BioPerl 1.5.x just treats the DE lines as only big long string? Could you translate your idea about the TagTree class into something concrete with BioSQL tables and fields for me? I'm not familiar with the TagTree (or Perl). Over on the Biopython list we'd talked about storing this annotation in a nested structured. However, in order to use the BioSQL annotations mechanisms, I think a simple flat structure is required :( Peter From cjfields at illinois.edu Sat May 16 23:16:05 2009 From: cjfields at illinois.edu (Chris Fields) Date: Sat, 16 May 2009 18:16:05 -0500 Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> Message-ID: <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> On May 16, 2009, at 5:34 PM, Hilmar Lapp wrote: > Don't you love SwissProt (or UniProt as we must call it now I > suppose). They (understandably) try to squeeze ever more annotation > into the existing tags, rather than adding new tags. > > So, of the following structure: > > DE RecName: Full=11S globulin seed storage protein 2; > DE AltName: Full=11S globulin seed storage protein II; > DE AltName: Full=Alpha-globulin; > DE Contains: > DE RecName: Full=11S globulin seed storage protein 2 acidic chain; > DE AltName: Full=11S globulin seed storage protein II acidic > chain; > DE Contains: > DE RecName: Full=11S globulin seed storage protein 2 basic chain; > DE AltName: Full=11S globulin seed storage protein II basic chain; > DE Flags: Precursor; > > really only the first line, with the 'RecName: Full=' removed, is > the description line as we know it. The rest, I would say, is > annotation, such as two alternative names, amino acid chains > contained in the full record (shouldn't this be feature annotation, > really? and indeed it is - why it needs to be repeated here is > beyond me) and their names as well as alternative names, and the > fact that the sequence is a precursor form. > > Leaving all this in one string has the advantage that we can round- > trip it (and there is probably hardly any other way to accomplish > that), but clearly in terms of semantics this isn't the sequence > description as we know it anymore. > > Does anyone else think too that completely changing the semantics of > sequence annotation fields is a bad idea? > > My inclination from a BioPerl perspective is to extract the part > following 'RecName: Full=' as the description, and attach the rest > as annotation. We could in fact use the TagTree class for this. I'm > cross-posting to BioPerl too to gather what other BioPerl'ers think > about this. > > -hilmar This is much like the GN issues we've run into before, and we *could* set this up using TagTree or similar. In the latter case of gene name the data is stored in a text tree as follows: gene_names: gene_name: Name: GC1QBP Synonyms: HABP1 Synonyms: SF2P32 Synonyms: C1QBP That could be changed to an XML string: GC1QBP HABP1 SF2P32 C1QBP Thinking about this we should attempt to coalesce around a standard instead of forcing the other Bio* to a specific format. chris From biopython at maubp.freeserve.co.uk Sat May 16 23:28:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 17 May 2009 00:28:43 +0100 Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> Message-ID: <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> On 5/17/09, Chris Fields wrote: > > On May 16, 2009, at 5:34 PM, Hilmar Lapp wrote: > > My inclination from a BioPerl perspective is to extract the part following > > 'RecName: Full=' as the description, and attach the rest as annotation. We > > could in fact use the TagTree class for this. I'm cross-posting to BioPerl > > too to gather what other BioPerl'ers think about this. > > > > -hilmar > > > > This is much like the GN issues we've run into before, and we *could* set > this up using TagTree or similar. In the latter case of gene name the data > is stored in a text tree as follows: > > gene_names: > gene_name: > Name: GC1QBP > Synonyms: HABP1 > Synonyms: SF2P32 > Synonyms: C1QBP > > That could be changed to an XML string: > > > > > GC1QBP > HABP1 > SF2P32 > C1QBP > > > > Thinking about this we should attempt to coalesce around a standard instead > of forcing the other Bio* to a specific format. How would you record this in BioSQL? As an XML string for an annotation value? Brad has suggested JSON might be useful for this kind of thing (see also per-letter-annotation discussion). Peter From hlapp at gmx.net Sat May 16 23:37:14 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 16 May 2009 19:37:14 -0400 Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> Message-ID: <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> On May 16, 2009, at 7:28 PM, Peter wrote: >> That could be changed to an XML string: >> >> >> >> >> GC1QBP >> HABP1 >> SF2P32 >> C1QBP >> >> >> >> Thinking about this we should attempt to coalesce around a standard >> instead >> of forcing the other Bio* to a specific format. > > How would you record this in BioSQL? As an XML string for an > annotation value? Yes. A TagTree object can be serialized to XML, and the XML can be stored as the annotation value in BioSQL. As the XML can be read back in, it allows full round-tripping. > Brad has suggested JSON might be useful for this kind of thing (see > also per-letter-annotation discussion). JSON could be another serialization format, but XML is equally or better supported in all languages except JavaScript. Furthermore, you could just send the XML to the browser and have an XSLT (either directly, or indirectly through JavaScript doing the transformation) do the rendering. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sat May 16 23:42:17 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 16 May 2009 19:42:17 -0400 Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <320fb6e00905161614p5a2d964fs6de110bb915ed066@mail.gmail.com> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <320fb6e00905161614p5a2d964fs6de110bb915ed066@mail.gmail.com> Message-ID: <8CD4EED1-A689-447F-8F6E-8D2204DD4E86@gmx.net> On May 16, 2009, at 7:14 PM, Peter wrote: > Am I right to infer that currently BioPerl 1.6.x, like BioPerl 1.5.x > just > treats the DE lines as only big long string? Yes. > Could you translate your idea about the TagTree class into something > concrete with BioSQL tables and fields for me? [...] Over on the > Biopython list we'd talked about storing this annotation in a nested > structured. That's more or less what TagTree is. > However, in order to use the BioSQL annotations mechanisms, I think > a simple flat structure is required :( Not necessarily. If you have a flat serialization (such as XML) the nested structure isn't needed. Of course that's not a fully normalized relational representation, but if you had one, how often would it be used, how efficient would those queries be (SQL is poor at nested or recursive data structures), and how much pain would it be to write the object-relational mappings? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Sun May 17 12:40:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 17 May 2009 13:40:47 +0100 Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> Message-ID: <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com> On 5/17/09, Hilmar Lapp wrote: > > On May 16, 2009, at 7:28 PM, Peter wrote: > > > That could be changed to an XML string: > > > > > > > > > > > > > > > GC1QBP > > > HABP1 > > > SF2P32 > > > C1QBP > > > > > > > > > > > > Thinking about this we should attempt to coalesce around a standard > > > instead of forcing the other Bio* to a specific format. Absolutely - some common standard should be agreed. Would you envision doing this for other structured fields, inventing a new mini XML format each time? That seems open ended and likely to cause a lot of work keeping all the Bio* project synchronised. Here you have mapped RecName and AltName fields in the DE lines to Name and Synonyms (shouldn't that be Synonym singular?). I also don't get why you have used a gene_name entry inside a gene_names list. Would you hold the contains information and the flags information from the DE lines in separate XML entries? I would have gone for something much closer to the original DE line markup i.e. using the field names UniProt use, RecName and AltName, rather than mapping these to Name and Synonym. > > How would you record this in BioSQL? As an XML string for an annotation > > value? > > Yes. A TagTree object can be serialized to XML, and the XML can be stored > as the annotation value in BioSQL. As the XML can be read back in, it allows > full round-tripping. Assuming you stored all the DE markup, then yes, a round trip back to the SwissProt file could be possible. And, depending on the details of the XML structure used, it would be possible to represent this in a python structure too. > > Brad has suggested JSON might be useful for this kind of thing (see > > also per-letter-annotation discussion). > > JSON could be another serialization format, but XML is equally or better > supported in all languages except JavaScript. Furthermore, you could just > send the XML to the browser and have an XSLT (either directly, or indirectly > through JavaScript doing the transformation) do the rendering. I have no strong preference for either XML or JSON (but would rather avoid them if they are not really needed). For other types of annotation there may be a clearer advantage for one over the other, e.g. per letter annotation like the secondary structure of a protein sequence, or the quality scores of a nucleotide contig. On 5/17/09, Hilmar Lapp wrote: > Not necessarily. If you have a flat serialization (such as XML) the nested > structure isn't needed. Of course that's not a fully normalized relational > representation, but if you had one, how often would it be used, how > efficient would those queries be (SQL is poor at nested or recursive data > structures), and how much pain would it be to write the object-relational > mappings? In this example, searching the database using one of the SwissProt AltNames (synonyms), or filtering on the Flags sounds like a reasonable request - but this would be very difficult if the data is stored inside XML strings. Of course, because the RecName and AltName entries are top level, we could just record them as normal - simple strings in the annotations table. This seems much nicer. Likewise the "Flags: Precursor;" line. i.e. listing the tag/value pairs which could be used in the bioentry_qualifier_value table: AltName = "Full=11S globulin seed storage protein II" AltName = "Full=Alpha-globulin" Flags = "Precursor" (the RecName field, "Full=11S globulin seed storage protein 2", could be used for the bioentry.description instead) The above are all pretty easy. We only need to consider nesting (or something like XML or JSON) for some of the DE information, in the example discussed the Contains lines. Even this could be even be done by storing each contains entry as a single long string (holding both the name and synonyms) directly from the DE line itself, something like this: Contains = "RecName: Full=11S globulin seed storage protein 2 acidic chain;\nAltName: Full=11S globulin seed storage protein II acidic chain;" Contains = "RecName: Full=11S globulin seed storage protein 2 basic chain;\nAltName: Full=11S globulin seed storage protein II basic chain;" Peter From sanjay.harke at gmail.com Sun May 17 13:17:14 2009 From: sanjay.harke at gmail.com (Sanjay Harke) Date: Sun, 17 May 2009 18:47:14 +0530 Subject: [BioSQL-l] BioSQL-l Digest, Vol 62, Issue 3 In-Reply-To: References: Message-ID: <31bb4380905170617k47951f83ia5bed32577a02956@mail.gmail.com> Dear peter, Kindly guide me for developing the connectivity of BioSql to Bioperl? sanjay From hlapp at gmx.net Sun May 17 14:56:29 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 17 May 2009 10:56:29 -0400 Subject: [BioSQL-l] BioSQL-l Digest, Vol 62, Issue 3 In-Reply-To: <31bb4380905170617k47951f83ia5bed32577a02956@mail.gmail.com> References: <31bb4380905170617k47951f83ia5bed32577a02956@mail.gmail.com> Message-ID: http://dx.doi.org/10.1038/npre.2007.1233.1 On May 17, 2009, at 9:17 AM, Sanjay Harke wrote: > Dear peter, > > Kindly guide me for developing the connectivity of BioSql to Bioperl? > > sanjay > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sun May 17 15:21:59 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 17 May 2009 11:21:59 -0400 Subject: [BioSQL-l] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com> References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com> Message-ID: On May 17, 2009, at 8:40 AM, Peter wrote: > On 5/17/09, Hilmar Lapp wrote: >> >> On May 16, 2009, at 7:28 PM, Peter wrote: >>>> That could be changed to an XML string: >>>> >>>> >>>> >>>> >>>> GC1QBP >>>> HABP1 >>>> SF2P32 >>>> C1QBP >>>> >>>> >>>> >>>> Thinking about this we should attempt to coalesce around a standard >>>> instead of forcing the other Bio* to a specific format. > > [...] Here you have mapped RecName and AltName fields in the DE > lines to > Name and Synonyms (shouldn't that be Synonym singular?). The example is for the GN lines in SwissProt, not the DE lines. > [...] > On 5/17/09, Hilmar Lapp wrote: >> Not necessarily. If you have a flat serialization (such as XML) the >> nested >> structure isn't needed. Of course that's not a fully normalized >> relational >> representation, but if you had one, how often would it be used, how >> efficient would those queries be (SQL is poor at nested or >> recursive data >> structures), and how much pain would it be to write the object- >> relational >> mappings? > > In this example, searching the database using one of the SwissProt > AltNames (synonyms), or filtering on the Flags sounds like a > reasonable request - but this would be very difficult if the data is > stored inside XML strings. Actually no. Modern full-text indexers (inside or outside the database) can index XML text columns right away and very well. In fact, for the last project that I built a full-text search for (on top of a BioSQL database) I did that by writing custom XML documents to a separate table for each record I wanted indexed. Oracle's full text indexer did the rest. I also built a separate identifier/name/ accession index that pulled all the gene names, symbols, accession numbers, identifiers etc into a single table for indexing. What I mean is, a fully normalized relational representation, especially if nested, is often not the most efficient data structure for efficient searching and filtering. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Mon May 18 10:03:52 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 18 May 2009 11:03:52 +0100 Subject: [BioSQL-l] Recording "nucleotide" in the sequence table? In-Reply-To: <320fb6e00905161606l5fdb0862mf25a45dad07dac8@mail.gmail.com> References: <320fb6e00905160453q545cecf7he913b4bb71e56223@mail.gmail.com> <48684763-5657-40F3-BE75-31E8CDE42613@gmx.net> <320fb6e00905161325m497882ddge853e1525acdfede@mail.gmail.com> <9A3473A7-2125-486C-BAAE-27820FF19D8D@gmx.net> <320fb6e00905161606l5fdb0862mf25a45dad07dac8@mail.gmail.com> Message-ID: <320fb6e00905180303m19d0c6e0hdc22ff550e518c6c@mail.gmail.com> On Sun, May 17, 2009 at 12:06 AM, Peter wrote: > If none of BioPerl, BioJava and BioRuby have an analogous > sequence representation for a nucleotide sequence which > might be DNA or RNA, then perhaps the current situation > with only "protein", "dna", "rna" and "unknown" in the > biosequence.alphabet field in BioSQL is sufficient. The original Biopython bug reporter (Bug 2829, David Wyllie) has replied on the bug. In his case, rather than using the generic nucleotide alphabet, he can be a bit more explicit since he does actually know his sequence is DNA, and this does get recorded in BioSQL fine. Given the "nucleotide" alphabet is a corner case in Biopython, and has no analogue in BioPerl, the status quo is fine. i.e. The biosequence.alphabet field should contain "dna", "rna", "protein" or "unknown" (in lower case). Thanks for your thoughts everyone. Peter From michael.watson at bbsrc.ac.uk Mon May 18 12:45:19 2009 From: michael.watson at bbsrc.ac.uk (michael watson (IAH-C)) Date: Mon, 18 May 2009 13:45:19 +0100 Subject: [BioSQL-l] Full text indexing/Searching in MySQL Message-ID: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk> Hi Has anyone implemented full text indexing/searching for BioSQL in MySQL, either using MySQL's full text features or any other solution? Any tips, advice, documentation, code etc available? Thanks Mick Head of Bioinformatics Institute for Animal Health Compton Berks RG20 7NN 01635 578411 Please consider the environment and don't print this e-mail unless you really need to. The information contained in this message may be confidential or legally privileged and is intended solely for the addressee. If you have received this message in error please delete it & notify the originator immediately. Unauthorised use, disclosure, copying or alteration of this message is forbidden & may be unlawful. The contents of this e-mail are the views of the sender and do not necessarily represent the views of the Institute. This email, and associated attachments, has been checked locally for viruses but we can accept no responsibility once it has left our systems. Communications on Institute computers are monitored to secure the effective operation of the systems and for other lawful purposes. The Institute for Animal Health is a company limited by guarantee, registered in England no. 559784. The Institute is also a registered charity, Charity Commissioners Reference No. 228824 From hlapp at gmx.net Mon May 18 13:24:34 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 18 May 2009 09:24:34 -0400 Subject: [BioSQL-l] Full text indexing/Searching in MySQL In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk> References: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk> Message-ID: I've done that using Oracle, not MySQL. I assume that's therefore not what you want to hear about and hence will shut up :) -hilmar On May 18, 2009, at 8:45 AM, michael watson (IAH-C) wrote: > Hi > > > > Has anyone implemented full text indexing/searching for BioSQL in > MySQL, > either using MySQL's full text features or any other solution? > > > > Any tips, advice, documentation, code etc available? > > > > Thanks > > Mick > > > > Head of Bioinformatics > Institute for Animal Health > Compton > Berks > RG20 7NN > 01635 578411 > > > > Please consider the environment and don't print this e-mail unless you > really need to. > > The information contained in this message may be confidential or > legally > privileged and is intended solely for the addressee. If you have > received this message in error please delete it & notify the > originator > immediately. Unauthorised use, disclosure, copying or alteration of > this message is forbidden & may be unlawful. The contents of this > e-mail are the views of the sender and do not necessarily represent > the > views of the Institute. This email, and associated attachments, has > been checked locally for viruses but we can accept no responsibility > once it has left our systems. Communications on Institute computers > are > monitored to secure the effective operation of the systems and for > other > lawful purposes. > > > > The Institute for Animal Health is a company limited by guarantee, > registered in England no. 559784. > > The Institute is also a registered charity, Charity Commissioners > Reference No. 228824 > > > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Mon May 18 13:26:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 18 May 2009 14:26:40 +0100 Subject: [BioSQL-l] Full text indexing/Searching in MySQL In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk> References: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk> Message-ID: <320fb6e00905180626o4855aa06v6c6ae665885a3fce@mail.gmail.com> On Mon, May 18, 2009 at 1:45 PM, michael watson (IAH-C) wrote: > > Hi > > Has anyone implemented full text indexing/searching for BioSQL in MySQL, > either using MySQL's full text features or any other solution? > > Any tips, advice, documentation, code etc available? > > Thanks > > Mick Hilmar mentioned he has done something like this on this thread, where he was storing XML strings as annotation values: http://lists.open-bio.org/pipermail/biosql-l/2009-May/001534.html (You've probably read that - but just in case, worth mentioning). Peter From biopython at maubp.freeserve.co.uk Mon May 18 13:38:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 18 May 2009 14:38:03 +0100 Subject: [BioSQL-l] [Biopython-dev] SwissProt DE lines and bioentry.description field in BioSQL In-Reply-To: References: <320fb6e00905141120o65ab8b0ame4afaa28a7ece525@mail.gmail.com> <074A1006-2AE2-4569-AFF7-BB8931ADF776@gmx.net> <071960DD-2E14-4B00-B1B2-934F93064C89@illinois.edu> <320fb6e00905161628h7bf8e685of23305b4209585bd@mail.gmail.com> <0B3FE389-7C04-4862-B076-A159FF896CC2@gmx.net> <320fb6e00905170540u263945f6i16c70e9ce4ba9182@mail.gmail.com> Message-ID: <320fb6e00905180638q29de63c4if0627eff416c4481@mail.gmail.com> On Sun, May 17, 2009 at 4:21 PM, Hilmar Lapp wrote: > > On May 17, 2009, at 8:40 AM, Peter wrote: >> >> [...] Here you have mapped RecName and AltName fields in the DE lines to >> Name and Synonyms (shouldn't that be Synonym singular?). > > The example is for the GN lines in SwissProt, not the DE lines. Ah, that probably explains some of my confusion. >> In this example, searching the database using one of the SwissProt >> AltNames (synonyms), or filtering on the Flags sounds like a >> reasonable request - but this would be very difficult if the data is >> stored inside XML strings. > > Actually no. Modern full-text indexers (inside or outside the database) can > index XML text columns right away and very well. In fact, for the last > project that I built a full-text search for (on top of a BioSQL database) I > did that by writing custom XML documents to a separate table for each > record I wanted indexed. Oracle's full text indexer did the rest. I also built a > separate identifier/name/accession index that pulled all the gene names, > symbols, accession numbers, identifiers etc into a single table for > indexing. OK, when I said searching "would be very difficult if the data is stored inside XML strings", maybe it wasn't so difficult for you - but that still sounds complicated! Sticking with the GN lines and the synonym, if this was stored as a simple tag/value as usual in BioSQL, I would write my SQL statement to search the annotation table where the term id was that associated with a GN synonym, and the annotation value was "HABP1". Simple. Using the XML approach, are you suggesting you could do a full text search on the annotation value field, looking for any rows where the field contains "HABP1", where the term id matches the GN lines' XML string? This sounds simplistic and probably rather slow - presumably why you resorted to the more complicated indexing scheme described above? > What I mean is, a fully normalized relational representation, especially if > nested, is often not the most efficient data structure for efficient > searching and filtering. OK. But do we really need to worry about complex nested structures for the SwissProt annotation (or in general)? Peter From jimp at compbio.dundee.ac.uk Mon May 18 14:01:28 2009 From: jimp at compbio.dundee.ac.uk (James Procter) Date: Mon, 18 May 2009 15:01:28 +0100 Subject: [BioSQL-l] BioSQL at BOSC 2009? In-Reply-To: <74D4CC78-FC7B-4595-9D24-EB6B3ED43318@gmx.net> References: <320fb6e00905160512m66848611l432a81f22866b37f@mail.gmail.com> <1BD503B3-D805-4882-87DD-820138792DB2@gmx.net> <320fb6e00905161423l4df26525hbc9a824419c7a370@mail.gmail.com> <74D4CC78-FC7B-4595-9D24-EB6B3ED43318@gmx.net> Message-ID: <4A116A38.9050705@compbio.dundee.ac.uk> Hi all. Hilmar Lapp wrote: > On May 16, 2009, at 5:23 PM, Peter wrote: > >> I will be staying for all of ISMB Same here. > > > I am too. Should we doodle something once the program is out? I'll watch out for the URL if you post it to the list! Jim. -- ------------------------------------------------------------------- J. B. Procter (ENFIN/VAMSAS) Barton Bioinformatics Research Group Phone/Fax:+44(0)1382 388734/345764 http://www.compbio.dundee.ac.uk The University of Dundee is a Scottish Registered Charity, No. SC015096. From roy.chaudhuri at gmail.com Mon May 18 17:37:39 2009 From: roy.chaudhuri at gmail.com (Roy Chaudhuri) Date: Mon, 18 May 2009 18:37:39 +0100 Subject: [BioSQL-l] Full text indexing/Searching in MySQL In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk> References: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk> Message-ID: <4A119CE3.3080208@gmail.com> Hi Mick, > Has anyone implemented full text indexing/searching for BioSQL in MySQL, > either using MySQL's full text features or any other solution? I've kind of done this. The trouble is that full text is only implemented on the non-transactional MyISAM tables, not InnoDB (it has long been promised for InnoDB, but no sign yet). My hack solution was to parse out the fields I was interested in (feature tags such as gene and product) and include them in a separate MyISAM table, cross-referenced to BioSQL using seqfeature_id. This involves duplicating data (which is a bad thing), but should be okay if database updates are infrequent. I mimic atomic changes by building an updated version of the MyISAM table separately, then switching to use the new version at the same time as I commit the BioSQL updates. There's also Sphinx (http://www.sphinxsearch.com), which is a plug-in that can implement full-text searches in InnoDB, but I haven't experimented with that so have no idea how well it works. Cheers. Roy. From holland at eaglegenomics.com Mon May 18 18:20:52 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Mon, 18 May 2009 19:20:52 +0100 Subject: [BioSQL-l] Full text indexing/Searching in MySQL In-Reply-To: <4A119CE3.3080208@gmail.com> References: <8975119BCD0AC5419D61A9CF1A923E9507E279D7@iahce2ksrv1.iah.bbsrc.ac.uk> <4A119CE3.3080208@gmail.com> Message-ID: <1242670852.28726.2.camel@buzzybee> There's also Lucene, which is a Java-based full-text indexer which can be attached to all kinds of data sources, including MySQL databases: http://lucene.apache.org/java/docs/ cheers, Richard On Mon, 2009-05-18 at 18:37 +0100, Roy Chaudhuri wrote: > Hi Mick, > > > Has anyone implemented full text indexing/searching for BioSQL in MySQL, > > either using MySQL's full text features or any other solution? > > I've kind of done this. The trouble is that full text is only > implemented on the non-transactional MyISAM tables, not InnoDB (it has > long been promised for InnoDB, but no sign yet). My hack solution was to > parse out the fields I was interested in (feature tags such as gene and > product) and include them in a separate MyISAM table, cross-referenced > to BioSQL using seqfeature_id. This involves duplicating data (which is > a bad thing), but should be okay if database updates are infrequent. I > mimic atomic changes by building an updated version of the MyISAM table > separately, then switching to use the new version at the same time as I > commit the BioSQL updates. > > There's also Sphinx (http://www.sphinxsearch.com), which is a plug-in > that can implement full-text searches in InnoDB, but I haven't > experimented with that so have no idea how well it works. > > Cheers. > Roy. > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From michael.watson at bbsrc.ac.uk Tue May 19 08:17:32 2009 From: michael.watson at bbsrc.ac.uk (michael watson (IAH-C)) Date: Tue, 19 May 2009 09:17:32 +0100 Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors Message-ID: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk> Hi I'm using: biosql-1.0.1 bioperl-db-1.5.2_100 bioperl-1.5.2_102 When I run load_seqdatabase.pl on about 3000 GenBank sequences, I get: Loading fmd_180509.gbk ... -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were ("AY312586S1","32307407","AY312586","Foot-and-mouth disease virus O isolate O/SKR/2000 S fragment, complete 1,9762) Duplicate entry 'AY312586-1-1' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, values were ("","1") FKs (324,3,4) Duplicate entry '324-3-4-1' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were ("AY312586S2","32307408","AY312587","Foot-and-mouth disease virus O isolate O/SKR/2000 L fragment, complete 1,9762) Duplicate entry 'AY312587-1-1' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, values were ("","1") FKs (323,3,4) Duplicate entry '323-3-4-1' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, values were ("","2") FKs (323,22,4) Duplicate entry '323-22-4-2' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, values were ("","3") FKs (323,15,4) Duplicate entry '323-15-4-3' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were ("AY312588S1","32307403","AY312588","Foot-and-mouth disease virus O isolate O/SKR/2002 S fragment, complete 1,9762) Duplicate entry 'AY312588-1-1' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, values were ("","1") FKs (326,3,4) Duplicate entry '326-3-4-1' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were ("AY312588S2","32307404","AY312589","Foot-and-mouth disease virus O isolate O/SKR/2002 L fragment, complete 1,9762) Duplicate entry 'AY312589-1-1' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, values were ("","1") FKs (325,3,4) Duplicate entry '325-3-4-1' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, values were ("","2") FKs (325,22,4) Duplicate entry '325-22-4-2' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, values were ("","3") FKs (325,15,4) Duplicate entry '325-15-4-3' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were ("S87919S2","247466","S87923","L [foot-and-mouth disease virus FMDV, strain CS8, Genomic RNA, 10 nt, segmen 1,9754) Duplicate entry 'S87923-1-1' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, values were ("","1") FKs (782,3,4) Duplicate entry '782-3-4-1' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, values were ("","2") FKs (782,13,4) Duplicate entry '782-13-4-2' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were ("S87919S1","247464","S87919","L [foot-and-mouth disease virus FMDV, strain CS8, Genomic RNA, 35 nt, segmen 1,9754) Duplicate entry 'S87919-1-1' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, values were ("","1") FKs (781,3,4) Duplicate entry '781-3-4-1' for key 2 --------------------------------------------------- -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values were ("","Direct Submission","Submitted (12-AUG-2004) National Center for Biotechnology Information, NIH, C-E8D3CBBD80002FA1","1","8170","") FKs () Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3 --------------------------------------------------- Could not store NC_011452: ------------- EXCEPTION ------------- MSG: create: object (Bio::Annotation::Reference) failed to insert or to be found by unique key STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/BasePersistenceAdaptor.pm:206 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/BasePersistenceAdaptor.pm:251 STACK Bio::DB::Persistent::PersistentObject::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/Persistent/PersistentObject.pm:271 STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/AnnotationCollectionAdaptor.pm: STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/BasePersistenceAdaptor.pm:214 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/BasePersistenceAdaptor.pm:251 STACK Bio::DB::Persistent::PersistentObject::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/Persistent/PersistentObject.pm:271 STACK Bio::DB::BioSQL::SeqAdaptor::store_children /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/SeqAdaptor.pm:224 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/BasePersistenceAdaptor.pm:214 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/BasePersistenceAdaptor.pm:251 STACK Bio::DB::Persistent::PersistentObject::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/Persistent/PersistentObject.pm:271 STACK (eval) load_seqdatabase.pl:622 STACK toplevel load_seqdatabase.pl:604 -------------------------------------- at load_seqdatabase.pl line 635 Any clues? Thanks Mick Head of Bioinformatics Institute for Animal Health Compton Berks RG20 7NN 01635 578411 Please consider the environment and don't print this e-mail unless you really need to. The information contained in this message may be confidential or legally privileged and is intended solely for the addressee. If you have received this message in error please delete it & notify the originator immediately. Unauthorised use, disclosure, copying or alteration of this message is forbidden & may be unlawful. The contents of this e-mail are the views of the sender and do not necessarily represent the views of the Institute. This email, and associated attachments, has been checked locally for viruses but we can accept no responsibility once it has left our systems. Communications on Institute computers are monitored to secure the effective operation of the systems and for other lawful purposes. The Institute for Animal Health is a company limited by guarantee, registered in England no. 559784. The Institute is also a registered charity, Charity Commissioners Reference No. 228824 From biopython at maubp.freeserve.co.uk Tue May 19 09:31:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 19 May 2009 10:31:05 +0100 Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk> References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk> Message-ID: <320fb6e00905190231t79ac1dc9j49585929e9b5304a@mail.gmail.com> On Tue, May 19, 2009 at 9:17 AM, michael watson (IAH-C) wrote: > > Hi > > I'm using: > > biosql-1.0.1 > bioperl-db-1.5.2_100 > bioperl-1.5.2_102 > > When I run load_seqdatabase.pl on about 3000 GenBank sequences, > I get: > > Loading fmd_180509.gbk ... > ... > --------------------------------------------------- > > Could not store NC_011452: > > ------------- EXCEPTION ?------------- > > MSG: create: object (Bio::Annotation::Reference) failed to insert or to > be found by unique key > > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create > /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D > B/BioSQL/BasePersistenceAdaptor.pm:206 > > ... > > STACK Bio::DB::Persistent::PersistentObject::store > /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D > B/Persistent/PersistentObject.pm:271 > > STACK (eval) load_seqdatabase.pl:622 > > STACK toplevel load_seqdatabase.pl:604 > > -------------------------------------- > > ?at load_seqdatabase.pl line 635 > > Any clues? You got a lot of warning about feature keys (which I am guessing are from different GenBank entries), but the failure seems to be from something to do with the annotation in NC_011452. Try downloading just NC_011452 in GenBank format, and testing that: http://www.ncbi.nlm.nih.gov/nuccore/NC_011452 I would expect that to fail in the same way, and you would at least have isolated the issue to a smaller test case. If it works, then maybe the copy of NC_011452 in your file is corrupted somehow - check for differences. Peter From hlapp at gmx.net Tue May 19 12:25:25 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 19 May 2009 08:25:25 -0400 Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk> References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk> Message-ID: <68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net> On May 19, 2009, at 4:17 AM, michael watson (IAH-C) wrote: > [...] > -------------------- WARNING --------------------- > > MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values > were > ("AY312586S1","32307407","AY312586","Foot-and-mouth disease virus O > isolate O/SKR/2000 S fragment, complete > > 1,9762) > > Duplicate entry 'AY312586-1-1' for key 2 > > --------------------------------------------------- This suggests that a sequence with the above accession or GI number was already in the database, or occurs in the file twice. If this situation is possible, you will have to pass the --lookup (or --flatlookup) flag to the script, and specify how you want updates to take place when they are necessary (options --noupdate, --remove, and --mergeobjs). > -------------------- WARNING --------------------- > > MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, > values were ("","1") FKs (324,3,4) > > Duplicate entry '324-3-4-1' for key 2 > --------------------------------------------------- I suspect that 324 is the primary key of the sequence record that raised the duplicate entry warning above. Can you check that? If the insert is turned into an update, these warnings should go away too. > [...] > -------------------- WARNING --------------------- > > MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, > values were ("","1") FKs (323,3,4) > > Duplicate entry '323-3-4-1' for key 2 > > --------------------------------------------------- Similar to before, except 323 is probably the primary key for AY312587. > [...] > -------------------- WARNING --------------------- > > MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, > values were ("","1") FKs (325,3,4) > > Duplicate entry '325-3-4-1' for key 2 > > --------------------------------------------------- And if the order of messages is preserved correctly, 325 would be the primary key of AY312589. > [...] > -------------------- WARNING --------------------- > > MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, > values > were ("","Direct Submission","Submitted (12-AUG-2004) National Center > for Biotechnology Information, NIH, > > C-E8D3CBBD80002FA1","1","8170","") FKs () > > Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3 > > --------------------------------------------------- This one is odd. Can you check which existing entry you have with reference.crc = 'CRC-E8D3CBBD80002FA1'? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From michael.watson at bbsrc.ac.uk Wed May 20 09:52:13 2009 From: michael.watson at bbsrc.ac.uk (michael watson (IAH-C)) Date: Wed, 20 May 2009 10:52:13 +0100 Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors In-Reply-To: <68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net> References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk> <68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net> Message-ID: <8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk> Hi Guys Ok, the warnings were due to duplicate sequences - I had downloaded a stream using Bio::DB::GenBank and I guess I assumed that would mean only unique entries were sent back. Using "--flatlookup --remove" gets rid of the warnings. Now for NC_003992.gbk... To answer Hilmar's question: mysql> select * from reference where crc = "CRC-E8D3CBBD80002FA1"; +--------------+-----------+-------------------------------------------- ---------------------------------------------------------+-------------- -----+---------+----------------------+ | reference_id | dbxref_id | location | title | authors | crc | +--------------+-----------+-------------------------------------------- ---------------------------------------------------------+-------------- -----+---------+----------------------+ | 152 | NULL | Submitted (12-AUG-2004) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USA | Direct Submission | NULL | CRC-E8D3CBBD80002FA1 | +--------------+-----------+-------------------------------------------- ---------------------------------------------------------+-------------- -----+---------+----------------------+ And when I run load_seqdatabase.pl on NC_003992.gbk alone I still get: perl load_seqdatabase.pl --host localhost --dbname fmd_biosql --format genbank --dbuser removed --dbpass removed --flatlookup --remove NC_003992.gbk Loading NC_003992.gbk ... -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values were ("","Direct Submission","Submitted (12-AUG-2004) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs () Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3 --------------------------------------------------- Could not store NC_003992: ------------- EXCEPTION ------------- MSG: create: object (Bio::Annotation::Reference) failed to insert or to be found by unique key STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/BasePersistenceAdaptor.pm:206 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/BasePersistenceAdaptor.pm:251 STACK Bio::DB::Persistent::PersistentObject::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/Persistent/PersistentObject.pm:271 STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/AnnotationCollectionAdaptor.pm:217 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/BasePersistenceAdaptor.pm:214 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/BasePersistenceAdaptor.pm:251 STACK Bio::DB::Persistent::PersistentObject::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/Persistent/PersistentObject.pm:271 STACK Bio::DB::BioSQL::SeqAdaptor::store_children /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/SeqAdaptor.pm:224 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/BasePersistenceAdaptor.pm:214 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/BioSQL/BasePersistenceAdaptor.pm:251 STACK Bio::DB::Persistent::PersistentObject::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/D B/Persistent/PersistentObject.pm:271 STACK (eval) load_seqdatabase.pl:622 STACK toplevel load_seqdatabase.pl:604 -------------------------------------- at load_seqdatabase.pl line 635 And I still have: mysql> select * from reference where crc = "CRC-E8D3CBBD80002FA1"; +--------------+-----------+-------------------------------------------- ---------------------------------------------------------+-------------- -----+---------+----------------------+ | reference_id | dbxref_id | location | title | authors | crc | +--------------+-----------+-------------------------------------------- ---------------------------------------------------------+-------------- -----+---------+----------------------+ | 152 | NULL | Submitted (12-AUG-2004) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USA | Direct Submission | NULL | CRC-E8D3CBBD80002FA1 | +--------------+-----------+-------------------------------------------- ---------------------------------------------------------+-------------- -----+---------+----------------------+ 1 row in set (0.01 sec) Could this be because bases 1 to 8203 of the sequence have three references, and the crc is created on the first and then duplicated on the second, thus causing a problem? Cheers Mick -----Original Message----- From: Hilmar Lapp [mailto:hlapp at gmx.net] Sent: 19 May 2009 13:25 To: michael watson (IAH-C) Cc: biosql-l at lists.open-bio.org Subject: Re: [BioSQL-l] load_seqdatabase.pl warnings and errors On May 19, 2009, at 4:17 AM, michael watson (IAH-C) wrote: > [...] > -------------------- WARNING --------------------- > > MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values > were > ("AY312586S1","32307407","AY312586","Foot-and-mouth disease virus O > isolate O/SKR/2000 S fragment, complete > > 1,9762) > > Duplicate entry 'AY312586-1-1' for key 2 > > --------------------------------------------------- This suggests that a sequence with the above accession or GI number was already in the database, or occurs in the file twice. If this situation is possible, you will have to pass the --lookup (or --flatlookup) flag to the script, and specify how you want updates to take place when they are necessary (options --noupdate, --remove, and --mergeobjs). > -------------------- WARNING --------------------- > > MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, > values were ("","1") FKs (324,3,4) > > Duplicate entry '324-3-4-1' for key 2 > --------------------------------------------------- I suspect that 324 is the primary key of the sequence record that raised the duplicate entry warning above. Can you check that? If the insert is turned into an update, these warnings should go away too. > [...] > -------------------- WARNING --------------------- > > MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, > values were ("","1") FKs (323,3,4) > > Duplicate entry '323-3-4-1' for key 2 > > --------------------------------------------------- Similar to before, except 323 is probably the primary key for AY312587. > [...] > -------------------- WARNING --------------------- > > MSG: insert in Bio::DB::BioSQL::SeqFeatureAdaptor (driver) failed, > values were ("","1") FKs (325,3,4) > > Duplicate entry '325-3-4-1' for key 2 > > --------------------------------------------------- And if the order of messages is preserved correctly, 325 would be the primary key of AY312589. > [...] > -------------------- WARNING --------------------- > > MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, > values > were ("","Direct Submission","Submitted (12-AUG-2004) National Center > for Biotechnology Information, NIH, > > C-E8D3CBBD80002FA1","1","8170","") FKs () > > Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3 > > --------------------------------------------------- This one is odd. Can you check which existing entry you have with reference.crc = 'CRC-E8D3CBBD80002FA1'? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Wed May 20 10:59:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 20 May 2009 11:59:19 +0100 Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk> References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk> <68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net> <8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk> Message-ID: <320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com> On Wed, May 20, 2009 at 10:52 AM, michael watson (IAH-C) wrote: > > Hi Guys > > Ok, the warnings were due to duplicate sequences - I had downloaded a > stream using Bio::DB::GenBank and I guess I assumed that would mean only > unique entries were sent back. ?Using "--flatlookup --remove" gets rid > of the warnings. Great - easy :) > Now for NC_003992.gbk... > > To answer Hilmar's question: > ... > And when I run load_seqdatabase.pl on NC_003992.gbk alone I still get: > > perl load_seqdatabase.pl --host localhost --dbname fmd_biosql --format > genbank --dbuser removed --dbpass removed --flatlookup --remove > NC_003992.gbk > > Loading NC_003992.gbk ... > > -------------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values > were ("","Direct Submission","Submitted (12-AUG-2004) National Center > for Biotechnology Information, NIH, Bethesda, MD 20894, > USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs () > Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3 > --------------------------------------------------- > Could not store NC_003992: > ------------- EXCEPTION ?------------- > MSG: create: object (Bio::Annotation::Reference) failed to insert or to > be found by unique key > ... I would guess that the problem is this rather generic reference in NC_003992 may be repeated exactly in another genome (causing the CRC collision): CONSRTM NCBI Genome Project TITLE Direct Submission JOURNAL Submitted (12-AUG-2004) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USA See http://www.ncbi.nlm.nih.gov/nuccore/NC_011452 i.e. Could there be another direct submission by the NCBI on that date in your collection? You could search the database looking for that CRC and trace it back to a bioentry, or just try grep for "JOURNAL Submitted (12-AUG-2004) National Center for Biotechnology" on your GenBank files. e.g. Something like this SQL statement might be interesting: SELECT bioentry.accession, reference.title FROM bioentry, bioentry_reference, reference WHERE bioentry.bioentry_id=bioentry_reference.bioentry_id AND bioentry_reference.reference_id=reference.reference_id AND reference.crc="CRC-E8D3CBBD80002FA1"; Peter From michael.watson at bbsrc.ac.uk Wed May 20 11:25:52 2009 From: michael.watson at bbsrc.ac.uk (michael watson (IAH-C)) Date: Wed, 20 May 2009 12:25:52 +0100 Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors In-Reply-To: <320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com> References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk> <68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net> <8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk> <320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com> Message-ID: <8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk> We have a winner :) NC_003992, NC_011452, NC_011451, NC_011450 all share at least one reference. Would changing --flatlookup to --lookup change the behaviour so it checks for an existing reference before trying to insert the duplicate? The answer is no :( (see below). I guess this may need some coding then! Thanks! Mick perl load_seqdatabase.pl --host localhost --dbname fmd_biosql --format genbank --dbuser removed --dbpass removed --lookup --remove NC_003992.gbk Loading NC_003992.gbk ... -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values were ("","Direct Submission","Submitted (12-AUG-2004) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs () Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3 --------------------------------------------------- Could not store NC_003992: ------------- EXCEPTION ------------- MSG: create: object (Bio::Annotation::Reference) failed to insert or to be found by unique key STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251 STACK Bio::DB::Persistent::PersistentObject::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm:271 STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/AnnotationCollectionAdaptor.pm:217 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:214 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251 STACK Bio::DB::Persistent::PersistentObject::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm:271 STACK Bio::DB::BioSQL::SeqAdaptor::store_children /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/SeqAdaptor.pm:224 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:214 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251 STACK Bio::DB::Persistent::PersistentObject::store /usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/Persistent/PersistentObject.pm:271 STACK (eval) load_seqdatabase.pl:622 STACK toplevel load_seqdatabase.pl:604 -------------------------------------- at load_seqdatabase.pl line 635 -----Original Message----- From: p.j.a.cock at googlemail.com [mailto:p.j.a.cock at googlemail.com] On Behalf Of Peter Sent: 20 May 2009 11:59 To: michael watson (IAH-C) Cc: Hilmar Lapp; biosql-l at lists.open-bio.org Subject: Re: [BioSQL-l] load_seqdatabase.pl warnings and errors On Wed, May 20, 2009 at 10:52 AM, michael watson (IAH-C) wrote: > > Hi Guys > > Ok, the warnings were due to duplicate sequences - I had downloaded a > stream using Bio::DB::GenBank and I guess I assumed that would mean only > unique entries were sent back. ?Using "--flatlookup --remove" gets rid > of the warnings. Great - easy :) > Now for NC_003992.gbk... > > To answer Hilmar's question: > ... > And when I run load_seqdatabase.pl on NC_003992.gbk alone I still get: > > perl load_seqdatabase.pl --host localhost --dbname fmd_biosql --format > genbank --dbuser removed --dbpass removed --flatlookup --remove > NC_003992.gbk > > Loading NC_003992.gbk ... > > -------------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, values > were ("","Direct Submission","Submitted (12-AUG-2004) National Center > for Biotechnology Information, NIH, Bethesda, MD 20894, > USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs () > Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3 > --------------------------------------------------- > Could not store NC_003992: > ------------- EXCEPTION ?------------- > MSG: create: object (Bio::Annotation::Reference) failed to insert or to > be found by unique key > ... I would guess that the problem is this rather generic reference in NC_003992 may be repeated exactly in another genome (causing the CRC collision): CONSRTM NCBI Genome Project TITLE Direct Submission JOURNAL Submitted (12-AUG-2004) National Center for Biotechnology Information, NIH, Bethesda, MD 20894, USA See http://www.ncbi.nlm.nih.gov/nuccore/NC_011452 i.e. Could there be another direct submission by the NCBI on that date in your collection? You could search the database looking for that CRC and trace it back to a bioentry, or just try grep for "JOURNAL Submitted (12-AUG-2004) National Center for Biotechnology" on your GenBank files. e.g. Something like this SQL statement might be interesting: SELECT bioentry.accession, reference.title FROM bioentry, bioentry_reference, reference WHERE bioentry.bioentry_id=bioentry_reference.bioentry_id AND bioentry_reference.reference_id=reference.reference_id AND reference.crc="CRC-E8D3CBBD80002FA1"; Peter From biopython at maubp.freeserve.co.uk Wed May 20 11:34:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 20 May 2009 12:34:51 +0100 Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk> References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk> <68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net> <8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk> <320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com> <8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk> Message-ID: <320fb6e00905200434x3e1c7978ue1c58382f7478354@mail.gmail.com> On Wed, May 20, 2009 at 12:25 PM, michael watson (IAH-C) wrote: > > We have a winner :) > > NC_003992, NC_011452, NC_011451, NC_011450 all share > at least one reference. > > Would changing --flatlookup to --lookup change the behaviour > so it checks for an existing reference before trying to insert the > duplicate? > > The answer is no :( (see below). > > I guess this may need some coding then! My crude idea for a simple ad-hoc solution would be to remove these pointless references from the records, before loading them into BioSQL. One way would be to edit the four GenBank files by hand (e.g. to remove the reference or make them unique). You might also do this in a BioPerl script that loads the records, edits the references, and then puts them in the database. Personally I use Python not Perl, so I can't tell you how you might do that with BioPerl. Hilmar may be able to comment from a BioPerl/BioSQL point of view - clearly CRC collisions of this nature will happen again in future. Peter From holland at eaglegenomics.com Wed May 20 11:44:58 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 20 May 2009 12:44:58 +0100 Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors In-Reply-To: <320fb6e00905200434x3e1c7978ue1c58382f7478354@mail.gmail.com> References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk> <68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net> <8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk> <320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com> <8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk> <320fb6e00905200434x3e1c7978ue1c58382f7478354@mail.gmail.com> Message-ID: <1242819898.18348.1.camel@buzzybee> Theoretically, although unlikely, it is statistically entirely possible for two completely different references to share the same CRC. Hence the CRC shouldn't really be used as an indicator of uniqueness, although it is still useful as a hashing function for indexing and quick lookup. cheers, Richard On Wed, 2009-05-20 at 12:34 +0100, Peter wrote: > On Wed, May 20, 2009 at 12:25 PM, michael watson (IAH-C) > wrote: > > > > We have a winner :) > > > > NC_003992, NC_011452, NC_011451, NC_011450 all share > > at least one reference. > > > > Would changing --flatlookup to --lookup change the behaviour > > so it checks for an existing reference before trying to insert the > > duplicate? > > > > The answer is no :( (see below). > > > > I guess this may need some coding then! > > My crude idea for a simple ad-hoc solution would be to remove these > pointless references from the records, before loading them into > BioSQL. > > One way would be to edit the four GenBank files by hand (e.g. to > remove the reference or make them unique). You might also do this in a > BioPerl script that loads the records, edits the references, and then > puts them in the database. Personally I use Python not Perl, so I > can't tell you how you might do that with BioPerl. > > Hilmar may be able to comment from a BioPerl/BioSQL point of view - > clearly CRC collisions of this nature will happen again in future. > > Peter > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- Richard Holland, BSc MBCS Finance Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From hlapp at gmx.net Wed May 20 15:10:20 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 20 May 2009 11:10:20 -0400 Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors In-Reply-To: <8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk> References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk> <68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net> <8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk> <320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com> <8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk> Message-ID: <0212C167-7618-4761-A191-C6CE4B41EC2A@gmx.net> Indeed changing the lookup will have no effect since deletion of bioentries doesn't cascade to references (only to bioentry-to- reference associations). What I don't understand yet is how you get the CRC clash. Normally this kind of situation can happen if the first occurrence does not and the second does have PMID, by which it will be looked up, lookup fails (b/c the first occurrence didn't come with PMID), resulting in an insert of the erroneously deemed "new" reference, which then fails with a CRC clash. However, there is no PMID nor any other identifier here, so I'll have to look into the code to find out why the second occurrence is either not looked up before an insert is attempted, or if it is looked up, why the lookup fails to find the record stored earlier. -hilmar On May 20, 2009, at 7:25 AM, michael watson (IAH-C) wrote: > We have a winner :) > > NC_003992, NC_011452, NC_011451, NC_011450 all share at least one > reference. > > Would changing --flatlookup to --lookup change the behaviour so it > checks for an existing reference before trying to insert the > duplicate? > > The answer is no :( (see below). > > I guess this may need some coding then! > > Thanks! > Mick > > perl load_seqdatabase.pl --host localhost --dbname fmd_biosql -- > format genbank --dbuser removed --dbpass removed --lookup --remove > NC_003992.gbk > Loading NC_003992.gbk ... > > -------------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, > values were ("","Direct Submission","Submitted (12-AUG-2004) > National Center for Biotechnology Information, NIH, Bethesda, MD > 20894, USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs () > Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3 > --------------------------------------------------- > Could not store NC_003992: > ------------- EXCEPTION ------------- > MSG: create: object (Bio::Annotation::Reference) failed to insert or > to be found by unique key > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/BasePersistenceAdaptor.pm:206 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/BasePersistenceAdaptor.pm:251 > STACK Bio::DB::Persistent::PersistentObject::store /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > Persistent/PersistentObject.pm:271 > STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children / > usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/ > Bio/DB/BioSQL/AnnotationCollectionAdaptor.pm:217 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/BasePersistenceAdaptor.pm:214 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/BasePersistenceAdaptor.pm:251 > STACK Bio::DB::Persistent::PersistentObject::store /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > Persistent/PersistentObject.pm:271 > STACK Bio::DB::BioSQL::SeqAdaptor::store_children /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/SeqAdaptor.pm:224 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/BasePersistenceAdaptor.pm:214 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/BasePersistenceAdaptor.pm:251 > STACK Bio::DB::Persistent::PersistentObject::store /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > Persistent/PersistentObject.pm:271 > STACK (eval) load_seqdatabase.pl:622 > STACK toplevel load_seqdatabase.pl:604 > > -------------------------------------- > > at load_seqdatabase.pl line 635 > > -----Original Message----- > From: p.j.a.cock at googlemail.com [mailto:p.j.a.cock at googlemail.com] > On Behalf Of Peter > Sent: 20 May 2009 11:59 > To: michael watson (IAH-C) > Cc: Hilmar Lapp; biosql-l at lists.open-bio.org > Subject: Re: [BioSQL-l] load_seqdatabase.pl warnings and errors > > On Wed, May 20, 2009 at 10:52 AM, michael watson (IAH-C) > wrote: >> >> Hi Guys >> >> Ok, the warnings were due to duplicate sequences - I had downloaded a >> stream using Bio::DB::GenBank and I guess I assumed that would mean >> only >> unique entries were sent back. Using "--flatlookup --remove" gets >> rid >> of the warnings. > > Great - easy :) > >> Now for NC_003992.gbk... >> >> To answer Hilmar's question: >> ... >> And when I run load_seqdatabase.pl on NC_003992.gbk alone I still >> get: >> >> perl load_seqdatabase.pl --host localhost --dbname fmd_biosql -- >> format >> genbank --dbuser removed --dbpass removed --flatlookup --remove >> NC_003992.gbk >> >> Loading NC_003992.gbk ... >> >> -------------------- WARNING --------------------- >> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, >> values >> were ("","Direct Submission","Submitted (12-AUG-2004) National Center >> for Biotechnology Information, NIH, Bethesda, MD 20894, >> USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs () >> Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3 >> --------------------------------------------------- >> Could not store NC_003992: >> ------------- EXCEPTION ------------- >> MSG: create: object (Bio::Annotation::Reference) failed to insert >> or to >> be found by unique key >> ... > > I would guess that the problem is this rather generic reference in > NC_003992 may be repeated exactly in another genome (causing the CRC > collision): > > CONSRTM NCBI Genome Project > TITLE Direct Submission > JOURNAL Submitted (12-AUG-2004) National Center for Biotechnology > Information, NIH, Bethesda, MD 20894, USA > > See http://www.ncbi.nlm.nih.gov/nuccore/NC_011452 > > i.e. Could there be another direct submission by the NCBI on that date > in your collection? You could search the database looking for that > CRC and trace it back to a bioentry, or just try grep for "JOURNAL > Submitted (12-AUG-2004) National Center for Biotechnology" on your > GenBank files. e.g. Something like this SQL statement might be > interesting: > > SELECT bioentry.accession, reference.title FROM bioentry, > bioentry_reference, reference WHERE > bioentry.bioentry_id=bioentry_reference.bioentry_id AND > bioentry_reference.reference_id=reference.reference_id AND > reference.crc="CRC-E8D3CBBD80002FA1"; > > Peter -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Fri May 22 12:27:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 22 May 2009 13:27:06 +0100 Subject: [BioSQL-l] RULES in BioSQL PostgreSQL schema Message-ID: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com> Hi all, This is a continuation of a thread / bug report from Biopython (Bug 2833) where attempting to import duplicate entries into BioSQL did not raise an error on PostgreSQL (but does on MySQL). Cymon traced this to the RULES present in the schema to help bioperl-db. On Fri, May 22, 2009 at 3:05 AM, Hilmar Lapp wrote: > > On May 21, 2009, at 6:52 PM, Cymon Cox wrote: > >> [...] >> >> Hi Andrea, >> >> The problem appears to be related to the BioSQL schema/PostGreSQL. >> >> As you indicated, adding a duplicate entry to bioentry returns a "INSERT 0 >> 0" and doesnt throw an IntegrityError which is what the code is looking >> from and presumably what MySQL throws. >> >> The reason it doesnt throw an error is because of one (or both) of the >> RULES in the schema: > > Indeed, I'd almost forgotten. The rules are there mostly as a remnant from > earlier versions of PostgreSQL to support transactional loading the way > bioperl-db (the object-relational mapping for BioPerl) is optimized. You > probably don't need them anywhere else. > > ? ? ? ?-hilmar > > > Bioperl-db is optimized such that entities that very likely don't exist yet > in the database are attempted for insert right away. If the insert fails due > to a unique key violation, the record is looked up (and then expected to be > found). In Oracle and MySQL you can do this and the transaction remains > healthy; i.e., you can commit the transaction later and all statements > except those that failed will be committed. In PostgreSQL any failed > statement dooms the entire transaction, and the only way out is a rollback. > In this case, if you want the loading of one sequence record as one > transaction, failing to insert a single feature record will doom the entire > sequence load and you would need to start over with the sequence. To fix > this, I wrote the rules, which in essence do do the lookups for PostgreSQL > that the bioperl-db code would otherwise avoid, and on insert do nothing if > the record is found, which results in zero rows affected when you would > expect one (which is what bioperl-db cues off of and then triggers a > lookup). > The right way to do this meanwhile is to use nested transactions, which > PostgreSQL supports since v8.0.x, but I haven't gotten around to implement > support for that in Bioperl-db. > Hilmar, It seems for Biopython to work properly with BioSQL on PostgreSQL these bioentry rules should be removed from the schema (as the comments in the schema do suggest). Obviously doing this would break any installation also using the current version of bioperl-db. Do the RULES affect BioJava or BioRuby using BioSQL on PostgreSQL? Are you happy to remove these RULES in BioSQL v1.0.x (after making the outlined transactional changes in bioperl-db)? Thanks, Peter From hlapp at gmx.net Fri May 22 15:03:11 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 22 May 2009 11:03:11 -0400 Subject: [BioSQL-l] RULES in BioSQL PostgreSQL schema In-Reply-To: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com> References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com> Message-ID: On May 22, 2009, at 8:27 AM, Peter wrote: > Are you happy to remove these RULES in BioSQL v1.0.x (after > making the outlined transactional changes in bioperl-db)? In principle yes. It would also mean dropping support for PostgreSQL v7.x, but I would hope that that's a non-issue. But if anyone here is still using and relying on PostgreSQL v7.x (or earlier?) do let us know, please. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Fri May 22 15:57:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 22 May 2009 16:57:38 +0100 Subject: [BioSQL-l] [Biopython-dev] RULES in BioSQL PostgreSQL schema In-Reply-To: References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com> Message-ID: <320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com> On Fri, May 22, 2009 at 4:03 PM, Hilmar Lapp wrote: > > On May 22, 2009, at 8:27 AM, Peter wrote: > >> Are you happy to remove these RULES in BioSQL v1.0.x (after >> making the outlined transactional changes in bioperl-db)? > > In principle yes. It would also mean dropping support for PostgreSQL v7.x, > but I would hope that that's a non-issue. > > But if anyone here is still using and relying on PostgreSQL v7.x (or > earlier?) do let us know, please. Great. In the meantime could you add a big warning about this issue to the INSTALL notes for PostgreSQL (i.e. recommend removing the RULES section if not using bioper-db)? http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/INSTALL Peter From hlapp at gmx.net Fri May 22 18:20:58 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 22 May 2009 14:20:58 -0400 Subject: [BioSQL-l] [Biopython-dev] RULES in BioSQL PostgreSQL schema In-Reply-To: <320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com> References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com> <320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com> Message-ID: <410BDC8E-1305-4FE5-860D-694E6A3EA9E6@gmx.net> Yes, agree. Would you mind filing this in Bugzilla for BioSQL? -hilmar On May 22, 2009, at 11:57 AM, Peter wrote: > On Fri, May 22, 2009 at 4:03 PM, Hilmar Lapp wrote: >> >> On May 22, 2009, at 8:27 AM, Peter wrote: >> >>> Are you happy to remove these RULES in BioSQL v1.0.x (after >>> making the outlined transactional changes in bioperl-db)? >> >> In principle yes. It would also mean dropping support for >> PostgreSQL v7.x, >> but I would hope that that's a non-issue. >> >> But if anyone here is still using and relying on PostgreSQL v7.x (or >> earlier?) do let us know, please. > > Great. > > In the meantime could you add a big warning about this issue to the > INSTALL notes for PostgreSQL (i.e. recommend removing the RULES > section if not using bioper-db)? > http://code.open-bio.org/svnweb/index.cgi/biosql/view/biosql-schema/trunk/INSTALL > > Peter -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Fri May 22 22:46:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 22 May 2009 23:46:54 +0100 Subject: [BioSQL-l] [Biopython-dev] RULES in BioSQL PostgreSQL schema In-Reply-To: <410BDC8E-1305-4FE5-860D-694E6A3EA9E6@gmx.net> References: <320fb6e00905220527t529a4676s7dc42b571acee6b2@mail.gmail.com> <320fb6e00905220857u2eca146fq465a97969fe8fc87@mail.gmail.com> <410BDC8E-1305-4FE5-860D-694E6A3EA9E6@gmx.net> Message-ID: <320fb6e00905221546i26edc7a2u2a02fb0d01c374ea@mail.gmail.com> On 5/22/09, Hilmar Lapp wrote: > Yes, agree. Would you mind filing this in Bugzilla for BioSQL? -hilmar I've filed Bug 2839, hopefully this is what you had in mind: http://bugzilla.open-bio.org/show_bug.cgi?id=2839 Peter From michael.watson at bbsrc.ac.uk Wed May 27 12:50:45 2009 From: michael.watson at bbsrc.ac.uk (michael watson (IAH-C)) Date: Wed, 27 May 2009 13:50:45 +0100 Subject: [BioSQL-l] load_seqdatabase.pl warnings and errors In-Reply-To: <0212C167-7618-4761-A191-C6CE4B41EC2A@gmx.net> References: <8975119BCD0AC5419D61A9CF1A923E9507E279FA@iahce2ksrv1.iah.bbsrc.ac.uk> <68905D8B-1C33-416A-B9AD-A2318ACA4589@gmx.net> <8975119BCD0AC5419D61A9CF1A923E9507E27A0A@iahce2ksrv1.iah.bbsrc.ac.uk> <320fb6e00905200359s55d642d5y388cf1f625941f52@mail.gmail.com> <8975119BCD0AC5419D61A9CF1A923E9507E27A0F@iahce2ksrv1.iah.bbsrc.ac.uk> <0212C167-7618-4761-A191-C6CE4B41EC2A@gmx.net> Message-ID: <8975119BCD0AC5419D61A9CF1A923E9507E27A82@iahce2ksrv1.iah.bbsrc.ac.uk> Hi Hilmar I tried to dig around in the code, but quite frankly I quickly got lost. What is clear is that the existing reference is not being found in the cache nor the database, and therefore a unique key violation occurs when the code tries to insert the object. I'm pretty stuffed on this project until I can get this sorted out. If someone tells me where to look I can try and sort out why this happens, but at the moment (for me) it's like looking for a needle in a haystack. Thanks in advance Mick -----Original Message----- From: Hilmar Lapp [mailto:hlapp at gmx.net] Sent: 20 May 2009 16:10 To: michael watson (IAH-C) Cc: Peter; biosql-l at lists.open-bio.org Subject: Re: [BioSQL-l] load_seqdatabase.pl warnings and errors Indeed changing the lookup will have no effect since deletion of bioentries doesn't cascade to references (only to bioentry-to- reference associations). What I don't understand yet is how you get the CRC clash. Normally this kind of situation can happen if the first occurrence does not and the second does have PMID, by which it will be looked up, lookup fails (b/c the first occurrence didn't come with PMID), resulting in an insert of the erroneously deemed "new" reference, which then fails with a CRC clash. However, there is no PMID nor any other identifier here, so I'll have to look into the code to find out why the second occurrence is either not looked up before an insert is attempted, or if it is looked up, why the lookup fails to find the record stored earlier. -hilmar On May 20, 2009, at 7:25 AM, michael watson (IAH-C) wrote: > We have a winner :) > > NC_003992, NC_011452, NC_011451, NC_011450 all share at least one > reference. > > Would changing --flatlookup to --lookup change the behaviour so it > checks for an existing reference before trying to insert the > duplicate? > > The answer is no :( (see below). > > I guess this may need some coding then! > > Thanks! > Mick > > perl load_seqdatabase.pl --host localhost --dbname fmd_biosql -- > format genbank --dbuser removed --dbpass removed --lookup --remove > NC_003992.gbk > Loading NC_003992.gbk ... > > -------------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, > values were ("","Direct Submission","Submitted (12-AUG-2004) > National Center for Biotechnology Information, NIH, Bethesda, MD > 20894, USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs () > Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3 > --------------------------------------------------- > Could not store NC_003992: > ------------- EXCEPTION ------------- > MSG: create: object (Bio::Annotation::Reference) failed to insert or > to be found by unique key > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/BasePersistenceAdaptor.pm:206 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/BasePersistenceAdaptor.pm:251 > STACK Bio::DB::Persistent::PersistentObject::store /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > Persistent/PersistentObject.pm:271 > STACK Bio::DB::BioSQL::AnnotationCollectionAdaptor::store_children / > usr/users/bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/ > Bio/DB/BioSQL/AnnotationCollectionAdaptor.pm:217 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/BasePersistenceAdaptor.pm:214 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/BasePersistenceAdaptor.pm:251 > STACK Bio::DB::Persistent::PersistentObject::store /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > Persistent/PersistentObject.pm:271 > STACK Bio::DB::BioSQL::SeqAdaptor::store_children /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/SeqAdaptor.pm:224 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/BasePersistenceAdaptor.pm:214 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > BioSQL/BasePersistenceAdaptor.pm:251 > STACK Bio::DB::Persistent::PersistentObject::store /usr/users/ > bioinformatics/data/foot-and-mouth/bioperl-db-1.5.2_100/Bio/DB/ > Persistent/PersistentObject.pm:271 > STACK (eval) load_seqdatabase.pl:622 > STACK toplevel load_seqdatabase.pl:604 > > -------------------------------------- > > at load_seqdatabase.pl line 635 > > -----Original Message----- > From: p.j.a.cock at googlemail.com [mailto:p.j.a.cock at googlemail.com] > On Behalf Of Peter > Sent: 20 May 2009 11:59 > To: michael watson (IAH-C) > Cc: Hilmar Lapp; biosql-l at lists.open-bio.org > Subject: Re: [BioSQL-l] load_seqdatabase.pl warnings and errors > > On Wed, May 20, 2009 at 10:52 AM, michael watson (IAH-C) > wrote: >> >> Hi Guys >> >> Ok, the warnings were due to duplicate sequences - I had downloaded a >> stream using Bio::DB::GenBank and I guess I assumed that would mean >> only >> unique entries were sent back. Using "--flatlookup --remove" gets >> rid >> of the warnings. > > Great - easy :) > >> Now for NC_003992.gbk... >> >> To answer Hilmar's question: >> ... >> And when I run load_seqdatabase.pl on NC_003992.gbk alone I still >> get: >> >> perl load_seqdatabase.pl --host localhost --dbname fmd_biosql -- >> format >> genbank --dbuser removed --dbpass removed --flatlookup --remove >> NC_003992.gbk >> >> Loading NC_003992.gbk ... >> >> -------------------- WARNING --------------------- >> MSG: insert in Bio::DB::BioSQL::ReferenceAdaptor (driver) failed, >> values >> were ("","Direct Submission","Submitted (12-AUG-2004) National Center >> for Biotechnology Information, NIH, Bethesda, MD 20894, >> USA","CRC-E8D3CBBD80002FA1","1","8203","") FKs () >> Duplicate entry 'CRC-E8D3CBBD80002FA1' for key 3 >> --------------------------------------------------- >> Could not store NC_003992: >> ------------- EXCEPTION ------------- >> MSG: create: object (Bio::Annotation::Reference) failed to insert >> or to >> be found by unique key >> ... > > I would guess that the problem is this rather generic reference in > NC_003992 may be repeated exactly in another genome (causing the CRC > collision): > > CONSRTM NCBI Genome Project > TITLE Direct Submission > JOURNAL Submitted (12-AUG-2004) National Center for Biotechnology > Information, NIH, Bethesda, MD 20894, USA > > See http://www.ncbi.nlm.nih.gov/nuccore/NC_011452 > > i.e. Could there be another direct submission by the NCBI on that date > in your collection? You could search the database looking for that > CRC and trace it back to a bioentry, or just try grep for "JOURNAL > Submitted (12-AUG-2004) National Center for Biotechnology" on your > GenBank files. e.g. Something like this SQL statement might be > interesting: > > SELECT bioentry.accession, reference.title FROM bioentry, > bioentry_reference, reference WHERE > bioentry.bioentry_id=bioentry_reference.bioentry_id AND > bioentry_reference.reference_id=reference.reference_id AND > reference.crc="CRC-E8D3CBBD80002FA1"; > > Peter -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : ===========================================================