From cjfields at uiuc.edu Sat Mar 1 15:42:05 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 1 Mar 2008 14:42:05 -0600 Subject: [BioSQL-l] BioSQL bug in bugzilla Message-ID: Hilmar, Just wanted to point out a bug which I thought was bioperl-db-related but is really BioSQL. Could you take a look to see what you think? http://bugzilla.open-bio.org/show_bug.cgi?id=2389 chris From hlapp at gmx.net Sat Mar 1 19:06:55 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 1 Mar 2008 19:06:55 -0500 Subject: [BioSQL-l] biosql usage/user survey In-Reply-To: <9692f0e9a791c7d0bf942e497668fdce@gmx.net> References: <9692f0e9a791c7d0bf942e497668fdce@gmx.net> Message-ID: I sent this survey request back in 2005 and received a number of direct responses. I am assuming that since I said I was going to use them for the paper everyone was assuming that their BioSQL usage would be made public. I am going to assemble the responses into a Wiki page as Malcolm suggested; if you responded to me and do not want to appear on that page, please let me know. -hilmar On Nov 3, 2005, at 11:53 AM, Hilmar Lapp wrote: > Hi all, > > I am writing up a paper on BioSQL and would like to include some > current usage figures to support its utility. > > Therefore, if you are using BioSQL I'd be glad if you could drop me > an email; if you can include a word or two (not more than 1 > sentence) on what you use it for that'd be great too. > > Thanks in advance, > > -hilmar > -- > ------------------------------------------------------------- > Hilmar Lapp email: lapp at gnf.org > GNF, San Diego, Ca. 92121 phone: +1-858-812-1757 > ------------------------------------------------------------- > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Sat Mar 1 20:16:24 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 1 Mar 2008 19:16:24 -0600 Subject: [BioSQL-l] multiple species for a sequence Message-ID: <0CFEE9C6-4012-4E6F-B81F-33EE7379036F@uiuc.edu> I'm looking at a bioperl bug I filed a while back that deals with multiple species in a sequence file, such as found for AJ428955: ID AJ428955; SV 1; linear; mRNA; STD; VRL; 8027 BP. XX AC AJ428955; XX DT 09-JUL-2002 (Rel. 72, Created) DT 15-APR-2005 (Rel. 83, Last updated, Version 4) XX DE Hepatitis GB virus B subgenomic replicon neoRepB XX KW core-neo fusion protein; core-neo gene; polyprotein. XX OS Hepatitis GB virus B OC Viruses; ssRNA positive-strand viruses, no DNA stage; Flaviviridae. XX OS Encephalomyocarditis virus OC Viruses; ssRNA positive-strand viruses, no DNA stage; Picornaviridae; OC Cardiovirus. ... We could probably add support in bioperl fairly easily (Bio::Seq could just return an array or the first species object based on context), but would BioSQL support sequences like this? chris From hlapp at gmx.net Sun Mar 2 12:33:23 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 2 Mar 2008 12:33:23 -0500 Subject: [BioSQL-l] multiple species for a sequence In-Reply-To: <0CFEE9C6-4012-4E6F-B81F-33EE7379036F@uiuc.edu> References: <0CFEE9C6-4012-4E6F-B81F-33EE7379036F@uiuc.edu> Message-ID: <86EDE863-287A-48BF-9ED8-13219C3D342E@gmx.net> On Mar 1, 2008, at 8:16 PM, Chris Fields wrote: > I'm looking at a bioperl bug I filed a while back that deals with > multiple species in a sequence file, such as found for AJ428955: > > ID AJ428955; SV 1; linear; mRNA; STD; VRL; 8027 BP. > XX > AC AJ428955; > XX > DT 09-JUL-2002 (Rel. 72, Created) > DT 15-APR-2005 (Rel. 83, Last updated, Version 4) > XX > DE Hepatitis GB virus B subgenomic replicon neoRepB > XX > KW core-neo fusion protein; core-neo gene; polyprotein. > XX > OS Hepatitis GB virus B > OC Viruses; ssRNA positive-strand viruses, no DNA stage; > Flaviviridae. > XX > OS Encephalomyocarditis virus > OC Viruses; ssRNA positive-strand viruses, no DNA stage; > Picornaviridae; > OC Cardiovirus. > > ... > > We could probably add support in bioperl fairly easily (Bio::Seq > could just return an array or the first species object based on > context), but would BioSQL support sequences like this? No it wouldn't. There may only be one species (taxon) per sequence. There has been a lot of discussion about this in the past mostly driven by the former SwissProt peculiarity of collapsing sequences by sequence identity into a single record. We held out and eventually UniProt dropped this practice. I guess we never quite decided what to do about chimeric sequences like the above. Note that the GenBank record gives this differently: http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=21727885 Here, there's one taxon (ORGANISM line) reference, but two localized 'source' features in the feature table. (I'm actually not 100% sure what the genbank parser would do with this - i.e., whether the second source feature will override the taxon_id found in the first.) Because seqfeatures (in BioSQL) don't have a link to taxon, you wouldn't be able to hit the sequence by its second (chimeric) taxon if that were your query criteria (though you could store it fine, and if you queried by dbxrefs of features of type 'source', you would find it). At the end of the day, BioSQL will evolve (hopefully) quickly to support what the Bio* toolkits support, and will be much slower to change in ways that Bio* wouldn't be able to take advantage of anyway. At least that's my current vision of it, and of course is up for debate as to whether that's a useful vision as much as anything else. So, as you say, right now BioPerl, and AFAIAA any of the other Bio* toolkits, doesn't support more than one species per sequence, but as soon as that changes, there's a clear need for BioSQL to follow along. Does that make sense? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sun Mar 2 12:39:17 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 2 Mar 2008 12:39:17 -0500 Subject: [BioSQL-l] BioSQL bug in bugzilla In-Reply-To: References: Message-ID: I don't think it's a good idea to just replace all varchar() types with type text. First of all, having reasonable constraints is a Good Thing(tm) in my book as the majority of times I found them violated it revealed a parsing error, rather than the constraints not fitting the data. Second, this won't solve the problem for the other RDBMS versions for which there is a real performance penalty and other implications when having unreasonably large column widths. That said, if the constraint is indeed not compatible with current data (such as Uniprot) we have a problem that needs to be fixed. So, what I would like to find out is 1) is this in reality a parsing error, or is there indeed a value for a column that in BioSQL is constrained to 40 chars, and 2) if so, which column in which table is the problem. Erik - would you mind sending me the full error stack if you still have it? Usually load_seqdatabase.pl will also print an extra warning message saying what it couldn't store. That message would be great too. If you don't have either anymore, do you remember vaguely what those messsages said? Alternatively, do you have the offending uniprot entry (or its accession)? I suspect that it's actually the constraint on dbxref.accession. Does that ring a bell? -hilmar On Mar 1, 2008, at 3:42 PM, Chris Fields wrote: > Hilmar, > > Just wanted to point out a bug which I thought was bioperl-db- > related but is really BioSQL. Could you take a look to see what > you think? > > http://bugzilla.open-bio.org/show_bug.cgi?id=2389 > > chris > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Sun Mar 2 13:00:50 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 2 Mar 2008 12:00:50 -0600 Subject: [BioSQL-l] multiple species for a sequence In-Reply-To: <86EDE863-287A-48BF-9ED8-13219C3D342E@gmx.net> References: <0CFEE9C6-4012-4E6F-B81F-33EE7379036F@uiuc.edu> <86EDE863-287A-48BF-9ED8-13219C3D342E@gmx.net> Message-ID: On Mar 2, 2008, at 11:33 AM, Hilmar Lapp wrote: > On Mar 1, 2008, at 8:16 PM, Chris Fields wrote: > >> I'm looking at a bioperl bug I filed a while back that deals with >> multiple species in a sequence file, such as found for AJ428955: >> >> ID AJ428955; SV 1; linear; mRNA; STD; VRL; 8027 BP. >> XX >> AC AJ428955; >> XX >> DT 09-JUL-2002 (Rel. 72, Created) >> DT 15-APR-2005 (Rel. 83, Last updated, Version 4) >> XX >> DE Hepatitis GB virus B subgenomic replicon neoRepB >> XX >> KW core-neo fusion protein; core-neo gene; polyprotein. >> XX >> OS Hepatitis GB virus B >> OC Viruses; ssRNA positive-strand viruses, no DNA stage; >> Flaviviridae. >> XX >> OS Encephalomyocarditis virus >> OC Viruses; ssRNA positive-strand viruses, no DNA stage; >> Picornaviridae; >> OC Cardiovirus. >> >> ... >> >> We could probably add support in bioperl fairly easily (Bio::Seq >> could just return an array or the first species object based on >> context), but would BioSQL support sequences like this? > > No it wouldn't. There may only be one species (taxon) per sequence. > > There has been a lot of discussion about this in the past mostly > driven by the former SwissProt peculiarity of collapsing sequences > by sequence identity into a single record. We held out and > eventually UniProt dropped this practice. I'm unsure how often these pop up. The behavior of both EMBL and GenBank parsers assumes one species (as does Bio::Seq); the embl parser picks up both and just replaces the first with the second: ... DE Hepatitis GB virus B subgenomic replicon neoRepB XX KW core-neo fusion protein; core-neo gene; polyprotein. XX OS Encephalomyocarditis virus OC Viruses; ssRNA positive-strand viruses, no DNA stage; Picornaviridae; OC Cardiovirus. XX RN [1] ... > I guess we never quite decided what to do about chimeric sequences > like the above. Note that the GenBank record gives this differently: > > http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=21727885 > > Here, there's one taxon (ORGANISM line) reference, but two localized > 'source' features in the feature table. (I'm actually not 100% sure > what the genbank parser would do with this - i.e., whether the > second source feature will override the taxon_id found in the > first.) Because seqfeatures (in BioSQL) don't have a link to taxon, > you wouldn't be able to hit the sequence by its second (chimeric) > taxon if that were your query criteria (though you could store it > fine, and if you queried by dbxrefs of features of type 'source', > you would find it). The genbank parser gets the taxon and tax ID correct; I would think when it hit the next source feature key it would assign the wrong tax ID to the species object but maybe there's a secondary check. Both output the source in feature tables just fine. > At the end of the day, BioSQL will evolve (hopefully) quickly to > support what the Bio* toolkits support, and will be much slower to > change in ways that Bio* wouldn't be able to take advantage of > anyway. At least that's my current vision of it, and of course is up > for debate as to whether that's a useful vision as much as anything > else. > > So, as you say, right now BioPerl, and AFAIAA any of the other Bio* > toolkits, doesn't support more than one species per sequence, but as > soon as that changes, there's a clear need for BioSQL to follow along. > > Does that make sense? > > -hilmar Yes. I think we could add in support for multiple species fairly easily but I'll probably hold off on anything until after a 1.6 release (i.e. push it to the next developer series, which gives us more time to think on how to implement this in a BioSQL-friendly way). chris From er at xs4all.nl Sun Mar 2 13:34:10 2008 From: er at xs4all.nl (Erik Rijkers) Date: Sun, 2 Mar 2008 19:34:10 +0100 (CET) Subject: [BioSQL-l] BioSQL bug in bugzilla Message-ID: <25296.156.83.0.185.1204482850.squirrel@webmail.xs4all.nl> Hi Hilmar, Sorry, it's too long ago. I can run it again (with new versions) somewhere next week. I don't remember which of the two problems (parser or data size) it was in my case. If it is true what you say (that most errors are due to the parser), it might indeed be better to leave those constraints in until such time that the parser has become more trustworthy, and use the database as a test instrument... What is really needed of course is a place to run these loading scrips continually against any appearing new versions of parsable text, and against the different database backends. Does that already happen somewhere? Should we consider such a bioperl buildfarm / loadfarm? (I might be able to help with any postgres loading tests.) Thanks, Erik Rijkers On Sun, March 2, 2008 18:39, Hilmar Lapp wrote: > I don't think it's a good idea to just replace all > varchar() types > with type text. > > First of all, having reasonable constraints is a Good > Thing(tm) in my > book as the majority of times I found them violated it > revealed a > parsing error, rather than the constraints not fitting the > data. > Second, this won't solve the problem for the other RDBMS > versions for > which there is a real performance penalty and other > implications when > having unreasonably large column widths. > > That said, if the constraint is indeed not compatible with > current > data (such as Uniprot) we have a problem that needs to be > fixed. So, > what I would like to find out is > > 1) is this in reality a parsing error, or is there indeed > a value for > a column that in BioSQL is constrained to 40 chars, and > > 2) if so, which column in which table is the problem. > > Erik - would you mind sending me the full error stack if > you still > have it? Usually load_seqdatabase.pl will also print an > extra warning > message saying what it couldn't store. That message would > be great > too. If you don't have either anymore, do you remember > vaguely what > those messsages said? Alternatively, do you have the > offending > uniprot entry (or its accession)? > > I suspect that it's actually the constraint on > dbxref.accession. Does > that ring a bell? > > -hilmar > > > On Mar 1, 2008, at 3:42 PM, Chris Fields wrote: > >> Hilmar, >> >> Just wanted to point out a bug which I thought was >> bioperl-db- >> related but is really BioSQL. Could you take a look to >> see what >> you think? >> >> http://bugzilla.open-bio.org/show_bug.cgi?id=2389 >> >> chris >> _______________________________________________ >> BioSQL-l mailing list >> BioSQL-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biosql-l > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net > : > =========================================================== > > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > From hlapp at gmx.net Sun Mar 2 14:20:21 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 2 Mar 2008 14:20:21 -0500 Subject: [BioSQL-l] database loading test server (was: BioSQL bug in bugzilla) In-Reply-To: <25296.156.83.0.185.1204482850.squirrel@webmail.xs4all.nl> References: <25296.156.83.0.185.1204482850.squirrel@webmail.xs4all.nl> Message-ID: <52F7E02B-7192-4F3C-9B48-887583B25CCE@gmx.net> Hi Erik, On Mar 2, 2008, at 1:34 PM, Erik Rijkers wrote: > What is really needed of course is a place to run these > loading scrips continually against any appearing new > versions of parsable text, and against the different > database backends. very true indeed. > > Does that already happen somewhere? > > Should we consider such a bioperl buildfarm / loadfarm? > > (I might be able to help with any postgres loading tests.) Coincidentally we have been batting around the idea to have a OBF machine dedicated to serve for testing and proof-of-concept demonstrations of OBF projects. Indeed one of the services we had thought about setting up is a BioSQL database, and it's reassuring to hear independently that that would be useful. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From er at xs4all.nl Sun Mar 2 15:01:46 2008 From: er at xs4all.nl (Erik Rijkers) Date: Sun, 2 Mar 2008 21:01:46 +0100 (CET) Subject: [BioSQL-l] database loading test server (was: BioSQL bug in bugzilla) Message-ID: <9081.156.83.0.185.1204488106.squirrel@webmail.xs4all.nl> Maybe we can use some ideas from the way the PostgreSQL project has setup a distributed buildfarm (conceived by Andrew Dunstan, I think): see: http://www.pgbuildfarm.org/ it lets members of the community use a standardized setup for building postgresql on their own machines and automates all steps involved. I know the projects and the communities are different, but the general idea to have a standard process to set up machines for whomever wants to dedicate some hardware and time seems like a good idea. Erik Rijkers On Sun, March 2, 2008 20:20, Hilmar Lapp wrote: > Hi Erik, > > On Mar 2, 2008, at 1:34 PM, Erik Rijkers wrote: > >> What is really needed of course is a place to run these >> loading scrips continually against any appearing new >> versions of parsable text, and against the different >> database backends. > > very true indeed. > >> >> Does that already happen somewhere? >> >> Should we consider such a bioperl buildfarm / loadfarm? >> >> (I might be able to help with any postgres loading >> tests.) > > > Coincidentally we have been batting around the idea to > have a OBF > machine dedicated to serve for testing and > proof-of-concept > demonstrations of OBF projects. Indeed one of the services > we had > thought about setting up is a BioSQL database, and it's > reassuring to > hear independently that that would be useful. > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net > : > =========================================================== > > > > From hlapp at gmx.net Sun Mar 2 15:38:27 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 2 Mar 2008 15:38:27 -0500 Subject: [BioSQL-l] enhancement request scheduling Message-ID: <5D2BC733-9A44-4EEA-B1D7-6DF90116B50E@gmx.net> FYI, I have added the chimeric sequence problem and the character column width issue to the Enhancement Requests page on the wiki: http://www.biosql.org/wiki/Enhancement_Requests I've also started to arrange individual requests in a first draft towards scheduling them for implementation. This is very much up for debate, so let me know any feedback or disagreement you have or votes you might want to put in. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Sun Mar 2 17:53:34 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 2 Mar 2008 22:53:34 +0000 Subject: [BioSQL-l] database loading test server (was: BioSQL bug in bugzilla) In-Reply-To: <52F7E02B-7192-4F3C-9B48-887583B25CCE@gmx.net> References: <25296.156.83.0.185.1204482850.squirrel@webmail.xs4all.nl> <52F7E02B-7192-4F3C-9B48-887583B25CCE@gmx.net> Message-ID: <320fb6e00803021453h553a5c2ay8c50381ef39d0b6a@mail.gmail.com> > > Coincidentally we have been batting around the idea to have a OBF > machine dedicated to serve for testing and proof-of-concept > demonstrations of OBF projects. Indeed one of the services we had > thought about setting up is a BioSQL database, and it's reassuring to > hear independently that that would be useful. > The BioSQL test database would be especially useful if we have all the Bio* projects hooked up to it, to automatically check they can all read records written by each other. I still haven't made time to get BioPerl setup on my machine to check the BioSQL compatibility with Biopython... Peter From hlapp at gmx.net Sun Mar 2 22:18:47 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 2 Mar 2008 22:18:47 -0500 Subject: [BioSQL-l] small "bug" correction in package BioSql In-Reply-To: <473455BE.6040807@ebi.ac.uk> References: <762277.43372.qm@web26507.mail.ukl.yahoo.com> <473336E6.6000100@ebi.ac.uk> <9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net> <473455BE.6040807@ebi.ac.uk> Message-ID: <4C9ACC1A-8C61-4611-8083-EFAD34D186EF@gmx.net> Just FYI, I added a section to this extent to the Enhancement Requests: http://www.biosql.org/wiki/ Enhancement_Requests#Check_constraint_on_biosequence.alphabet Feel free to fix/add as appropriate. -hilmar On Nov 9, 2007, at 7:42 AM, Richard Holland wrote: > I did a bit of poking around in our code and internally BioJava > represents all the default alphabet names (Protein, DNA, etc.) in > upper > case. It also allows for mixed case alphabet names. > > It's not quite as easy as I thought to change these to lower case as > they are often referenced by text name, meaning other people's code > might break if I change them. > > Also, as it allows for mixed-case alphabet names, I can't do a > toUpper/toLower fudge on persistence to BioSQL, as I wouldn't > necessarily get out what I put in! -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sun Mar 2 22:38:59 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 2 Mar 2008 22:38:59 -0500 Subject: [BioSQL-l] Fwd: error on insert new sequences from GenBank: no annotations saved in BioSQL database References: Message-ID: <917E3FCD-5FDB-460D-8F63-B897ACA5CDD2@gmx.net> FYI, I used this to start a page on the recommended mapping of sequence annotation to BioSQL: http://www.biosql.org/wiki/Annotation_Mapping Obviously, this is very rudimentary, but everyone is welcome to add to it or comment with further questions. Also, one of the most important questions, namely a consistent vocabulary for annotation (qualifier) tags, isn't mentioned there (yet). -hilmar Begin forwarded message: > From: Hilmar Lapp > Date: November 8, 2007 3:28:19 PM EST > To: Eric Gibert > Cc: biopython at lists.open-bio.org, BioJava > Subject: Re: [Biojava-l] [BioPython] error on insert new sequences > from GenBank: no annotations saved in BioSQL database > > Maybe we need to hold some mini-hackathon to make the different > toolkits compatible in how they map annotation to the schema. > Obviously I don't know whether you have the latest Biojava setup > here, but I'll just comment how BioPerl/Bioperl-db would map this: > > 'ORIGIN' - if I'm not mistaken this is only a token that introduces > the actual sequence. I'm not sure what Biojava is storing as value > here. > > 'DIVISION' - this maps to column division in table bioentry (though I > agree that if perfectly following the weak typing principle this > should be tag/value association, but at present it's still an actual > column) > > 'genbank_accessions' - secondary accession numbers indeed go into the > qualifier value table. The primary accession maps to column accession > in table bioentry > > 'TITLE' - this is part of a publication reference, and should map to > column title in table reference (which it does in bioperl-db) > > 'cross_references' - not sure where these would be coming from in > GenBank format; for EMBL this will map to the dbxref table > > 'data_file_division' - not sure what this is (same as DIVISION?) > > 'VERSION' - in BioPerl we parse this apart into a version for the > accession (which is column version in table bioentry) and the GI > number, which maps to column identifier in table bioentry > > 'references' - these map to table reference (and bioentry_reference > for association with the bioentry) > > 'KEYWORDS' - indeed these map to bioentry_qualifier_value > > 'GI' - maps to column identifier in table bioentry > > 'SIZE' - not sure what size that is. If it is the length of the > sequence, it should (and in BioPerl/bioperl-db does) map to column > length in table biosequence > > 'DEFINITION' - maps to column description in table bioentry > > 'REFERENCE' - should be the same as for 'references' > > 'MDAT' - not sure what this is > > 'ORGANISM' - this is the organism and maps to the table taxon (and > taxon_name), with a foreign key in bioentry pointing to the taxon > > 'JOURNAL' - this is part of a reference, see 'references' > > 'ACCESSION' - the primary accession, maps to column accession in > table bioentry > > 'LOCUS' - in the file itself this is an entire line consisting of > multiple fields; BioPerl/bioperl-db maps the locus name (the first > token after the literal token LOCUS) to column name in table bioentry > > 'SOURCE' - this is the organism, see 'ORGANISM' > > 'PUBMED' - this is part of a literature reference, and maps to a > foreign key in the reference table (reference.dbxref) to a dbxref > entry with PUBMED or PMID as the database and the pubmed ID as the > accession > > 'AUTHORS' - part of a literature reference, maps to column authors in > table reference > > 'TYPE' - not sure what this is. If it's the alphabet, it maps to > table biosequence, column alphabet > > 'CIRCULAR' - this at present indeed maps to bioentry_qualifier_value, > though there have been plans to make it a column in table biosequence. > > Note that this could in fact be the way Biojava stores it too, but > upon retrieval represents it in the way you are seeing it. > > Hth, > > -hilmar > > On Nov 8, 2007, at 12:50 PM, Eric Gibert wrote: > >> Dear all, >> >> When I retrieve a BioSQL.BioSeq.DBSeqRecord which was inserted >> previously by my BioJava application, I have: >> >> print "Debug on Seq:", Seq.id, "=", Seq.annotations.keys() >> >> Debug on Seq: AJ459190.1 = ['ORIGIN', 'DIVISION', >> 'genbank_accessions', 'TITLE', 'cross_references', >> 'data_file_division', 'VERSION', 'references', 'KEYWORDS', 'GI', >> 'SIZE', 'DEFINITION', 'REFERENCE', 'MDAT', 'ORGANISM', 'JOURNAL', >> 'ACCESSION', 'LOCUS', 'SOURCE', 'PUBMED', 'AUTHORS', 'TYPE', >> 'CIRCULAR'] >> >> but a freshly inserted BioSeq by BioPython 1.44 only gives me: >> Debug on Seq: EF631597.1 = ['cross_references', 'dates', >> 'references', 'gi', 'data_file_division'] >> >> >> Once I look in the table bioentry_qualifier_value >> >> * 20 records for a Sequence imported by BioJava >> * 1 only for a Sequence inserted by BioPython: the date which >> should be inserted by "_load_bioentry_date" in BioSQL/Loader.py >> >> Quite a few annotations missing, no? >> >> Any idea? >> >> Eric >> >> >> >> >> >> _____________________________________________________________________ >> _ >> _______ >> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers >> Yahoo! Mail >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Sun Mar 2 23:36:56 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 2 Mar 2008 22:36:56 -0600 Subject: [BioSQL-l] Fwd: error on insert new sequences from GenBank: no annotations saved in BioSQL database In-Reply-To: <917E3FCD-5FDB-460D-8F63-B897ACA5CDD2@gmx.net> References: <917E3FCD-5FDB-460D-8F63-B897ACA5CDD2@gmx.net> Message-ID: On Mar 2, 2008, at 9:38 PM, Hilmar Lapp wrote: > FYI, I used this to start a page on the recommended mapping of > sequence annotation to BioSQL: > > http://www.biosql.org/wiki/Annotation_Mapping > > Obviously, this is very rudimentary, but everyone is welcome to add > to it or comment with further questions. Also, one of the most > important questions, namely a consistent vocabulary for annotation > (qualifier) tags, isn't mentioned there (yet). > > -hilmar > >> ... >> Maybe we need to hold some mini-hackathon to make the different >> toolkits compatible in how they map annotation to the schema. >> Obviously I don't know whether you have the latest Biojava setup >> here, but I'll just comment how BioPerl/Bioperl-db would map this: These are the ones I know of: >> 'cross_references' - not sure where these would be coming from in >> GenBank format; for EMBL this will map to the dbxref table GenPept has DBSOURCE, so maybe from there? >> 'data_file_division' - not sure what this is (same as DIVISION?) Note sure about that one, but division sounds right. >> 'MDAT' - not sure what this is Modification Date, I think. 'MDAT' is a field name used for limits in Entrez searches: Field code: MDAT name: Modification Date desc: Date of last update count: 4012 Attributes: is_date,is_singletoken chris From markjschreiber at gmail.com Tue Mar 4 21:06:17 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 5 Mar 2008 10:06:17 +0800 Subject: [BioSQL-l] multiple species for a sequence In-Reply-To: References: <0CFEE9C6-4012-4E6F-B81F-33EE7379036F@uiuc.edu> <86EDE863-287A-48BF-9ED8-13219C3D342E@gmx.net> Message-ID: <93b45ca50803041806o6f802548g4e408339d1a40c27@mail.gmail.com> BioJava doesn't support multiple taxa per sequence. It's something to consider though. Philosophically you really have to wonder about he meaning of species when you have a chimera : ) Should it not be a hybrid species all on it's own? I wonder what they will do when Craig Venter produces Craigus ventus... - Mark On Mon, Mar 3, 2008 at 2:00 AM, Chris Fields wrote: > > > On Mar 2, 2008, at 11:33 AM, Hilmar Lapp wrote: > > > On Mar 1, 2008, at 8:16 PM, Chris Fields wrote: > > > >> I'm looking at a bioperl bug I filed a while back that deals with > >> multiple species in a sequence file, such as found for AJ428955: > >> > >> ID AJ428955; SV 1; linear; mRNA; STD; VRL; 8027 BP. > >> XX > >> AC AJ428955; > >> XX > >> DT 09-JUL-2002 (Rel. 72, Created) > >> DT 15-APR-2005 (Rel. 83, Last updated, Version 4) > >> XX > >> DE Hepatitis GB virus B subgenomic replicon neoRepB > >> XX > >> KW core-neo fusion protein; core-neo gene; polyprotein. > >> XX > >> OS Hepatitis GB virus B > >> OC Viruses; ssRNA positive-strand viruses, no DNA stage; > >> Flaviviridae. > >> XX > >> OS Encephalomyocarditis virus > >> OC Viruses; ssRNA positive-strand viruses, no DNA stage; > >> Picornaviridae; > >> OC Cardiovirus. > >> > >> ... > >> > >> We could probably add support in bioperl fairly easily (Bio::Seq > >> could just return an array or the first species object based on > >> context), but would BioSQL support sequences like this? > > > > No it wouldn't. There may only be one species (taxon) per sequence. > > > > There has been a lot of discussion about this in the past mostly > > driven by the former SwissProt peculiarity of collapsing sequences > > by sequence identity into a single record. We held out and > > eventually UniProt dropped this practice. > > I'm unsure how often these pop up. The behavior of both EMBL and > GenBank parsers assumes one species (as does Bio::Seq); the embl > parser picks up both and just replaces the first with the second: > > ... > > DE Hepatitis GB virus B subgenomic replicon neoRepB > XX > KW core-neo fusion protein; core-neo gene; polyprotein. > XX > > OS Encephalomyocarditis virus > OC Viruses; ssRNA positive-strand viruses, no DNA stage; > Picornaviridae; > OC Cardiovirus. > XX > RN [1] > ... > > > > I guess we never quite decided what to do about chimeric sequences > > like the above. Note that the GenBank record gives this differently: > > > > http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=21727885 > > > > Here, there's one taxon (ORGANISM line) reference, but two localized > > 'source' features in the feature table. (I'm actually not 100% sure > > what the genbank parser would do with this - i.e., whether the > > second source feature will override the taxon_id found in the > > first.) Because seqfeatures (in BioSQL) don't have a link to taxon, > > you wouldn't be able to hit the sequence by its second (chimeric) > > taxon if that were your query criteria (though you could store it > > fine, and if you queried by dbxrefs of features of type 'source', > > you would find it). > > The genbank parser gets the taxon and tax ID correct; I would think > when it hit the next source feature key it would assign the wrong tax > ID to the species object but maybe there's a secondary check. Both > output the source in feature tables just fine. > > > > At the end of the day, BioSQL will evolve (hopefully) quickly to > > support what the Bio* toolkits support, and will be much slower to > > change in ways that Bio* wouldn't be able to take advantage of > > anyway. At least that's my current vision of it, and of course is up > > for debate as to whether that's a useful vision as much as anything > > else. > > > > So, as you say, right now BioPerl, and AFAIAA any of the other Bio* > > toolkits, doesn't support more than one species per sequence, but as > > soon as that changes, there's a clear need for BioSQL to follow along. > > > > Does that make sense? > > > > -hilmar > > Yes. I think we could add in support for multiple species fairly > easily but I'll probably hold off on anything until after a 1.6 > release (i.e. push it to the next developer series, which gives us > more time to think on how to implement this in a BioSQL-friendly way). > > chris > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > From cjfields at uiuc.edu Wed Mar 5 18:24:03 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 5 Mar 2008 17:24:03 -0600 Subject: [BioSQL-l] bioperl-db bugs Message-ID: Hilmar, I think I have two bioperl-db bugs sorted out, but I'm trying to determine whether the solution is a side-effect, a feature, or a bug. Dmitry has filed two bug reports which are somewhat related: http://bugzilla.open-bio.org/show_bug.cgi?id=2280 http://bugzilla.open-bio.org/show_bug.cgi?id=2281 I have added my comments to it, but maybe you can shed some more light on this. What he is trying to do is copy a persistent Seq object to a different namespace; load_seqdatabase.pl won't let him do that directly using the same sequence file. If he changes the namespace() and store()s it using a script, the seq is moved to the new namespace, not updated. My reasoning is this is a feature (by not changing the primary_key, you don't store a new sequence but update the current one). However, if the primary_key is unset (undef), then it appears you can copy the sequence over (from Dmitry's script, with my addition noted): ... my $ns1 = 'space1'; my $ns2 = 'space2'; my $seqadp = $db->get_object_adaptor('Bio::SeqI'); my $aux_seq = Bio::Seq::RichSeq->new( -accession_number => 'NC_005982', -version => 1, -namespace => $ns1); my $seq = $seqadp->find_by_unique_key($aux_seq); # store the found sequence in the second biodatabase: my $pseq = $seqadp->create_persistent($ns2); $pseq->namespace('bioperl2'); $pseq->primary_key(undef); # my addition, which appears to work $pseq->store(); $seqadp->commit; ... My question: is this an intended effect? The ability to assign undef to primary_key seems intentional based on the method code, but I'm a bit uncertain here. chris From hlapp at gmx.net Thu Mar 6 00:03:26 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 6 Mar 2008 00:03:26 -0500 Subject: [BioSQL-l] Announcement: BioSQL v1.0.0 released Message-ID: BioSQL v1.0.0 Release ===================== I am extremely pleased to announce the release of version 1.0.0 (code-named Tokyo, see below) of BioSQL. The release can be downloaded at the following location, in the following formats: http://biosql.org/DIST/biosql-1.0.0.tar.gz http://biosql.org/DIST/biosql-1.0.0.tar.bz2 http://biosql.org/DIST/biosql-1.0.0.zip (has Windows-style EOL) MD5 signatures (http://biosql.org/DIST/SIGNATURES.md5): MD5(biosql-1.0.0.tar.bz2)= 2b09a821b9d94bb1e94c3c79dc2f4cff MD5(biosql-1.0.0.tar.gz)= e47982d979ddb98aae640b5ab55ce2c6 MD5(biosql-1.0.0.zip)= 06913c8639ca4fe7f9000b556d8a04ed The core BioSQL schema is a generic, extensible relational model for sequences, sequence features, their annotation, and ontology terms. It is also designed as the interoperable persistence interface between the Bio* projects. This version of the schema has essentially been the same since November 2004. Software that worked with schema versions downloaded from CVS (or, as of lately, svn) after November 2004 should work with all 1.0.x releases. This release contains - the core BioSQL schema as DDL (Data Definition Language) for the following RDBMSs: MySQL, PostgreSQL, Oracle, HSQLDB, and Apache Derby, - ancillary (but optional) schema files for PostgreSQL, - documentation and an ERD (Entity-Relationship Diagram), and - a Perl script that can pre-load (and update) a BioSQL instance with the NCBI taxonomy. Installation instructions for MySQL and PostgreSQL are in the file INSTALL, and the file doc/bj_and_bsql_oracle_howto.htm has instructions for installing the Oracle version. Additional information regarding BioSQL, including links to language bindings, a roadmap to future releases and enhancements, and possible local optimizations is available from the BioSQL website at http://biosql.org. On behalf of the BioSQL developers, Hilmar Lapp Acknowledgments --------------- BioSQL in general and this releases in particular owes enormously to a number of number of people and would not exist without their contributions, the contributions of people on the biosql-l mailing list, and the support of other developers and users from the Bio* community. Ewan Birney created the first version of the schema and during the 2003 BioHackathon in Singapore tested and wrote much of the INSTALL document. Elia Stupka and Chris Mungall made significant changes at the 2002 BioHackathons in Tucson, AZ, and Cape Town, South Africa. Aaron Mackey was instrumental in the changes made at the Singapore BioHackathon, which set the path to the version (code-named 'post-Singapore') that eventually stabilized as v1.0. Matthew Pocock and Thomas Down provided important input for the ontology model. This release and the accompanying work on cleaning up, updating documentation, and jump-starting a useful (wiki) website was irreversibly set in motion at the BioHackathon 2008 in Tokyo, and would not have happened without the active encouragement from several participants, especially Heikki Lehvahslaiho, Mark Schreiber, Richard Holland, and Raoul Bonnal. Finally, without the superb and prompt help from Mauricio Herrera Cuadra and Jason Stajich with various wiki and other admin issues that occasionally reared their heads we wouldn't have made it to this point. In recognition of the role the BioHackathon 2008 played in getting this release out the door, and in keeping with an informal tradition held up since the first BioHackathon, I am code-naming the 1.0.x release series the Tokyo release series of BioSQL. Thank you to everyone! License ------- BioSQL is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sun Mar 9 19:38:18 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 9 Mar 2008 19:38:18 -0400 Subject: [BioSQL-l] bioperl-db bugs In-Reply-To: References: Message-ID: Hi Chris, I added comments to both bug reports. This belongs to BioPerl, though, as it has only to do with its language binding. The tidbit may be worth keeping in mind for a general BioSQL audience is that bioentry namespace (foreign key to biodatabase) is part of the (compound) bioentry unique keys. The identifier column used to be unique by itself (and could still be made such in a local instance, there's a comment to this effect in the DDL), but that was changed a while ago. (Also, if one uses any of the Bio* language bindings, changing a unique key constraint to something that differs from what the language binding assumes may be asking for a lot of trouble. Bioperl-db will expect the combination of primary_id() and namespace () to match if the latter is provided.) -hilmar On Mar 5, 2008, at 6:24 PM, Chris Fields wrote: > Hilmar, > > I think I have two bioperl-db bugs sorted out, but I'm trying to > determine whether the solution is a side-effect, a feature, or a > bug. Dmitry has filed two bug reports which are somewhat related: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2280 > http://bugzilla.open-bio.org/show_bug.cgi?id=2281 > > I have added my comments to it, but maybe you can shed some more > light on this. What he is trying to do is copy a persistent Seq > object to a different namespace; load_seqdatabase.pl won't let him > do that directly using the same sequence file. If he changes the > namespace() and store()s it using a script, the seq is moved to the > new namespace, not updated. > > My reasoning is this is a feature (by not changing the primary_key, > you don't store a new sequence but update the current one). > However, if the primary_key is unset (undef), then it appears you > can copy the sequence over (from Dmitry's script, with my addition > noted): > > ... > my $ns1 = 'space1'; > my $ns2 = 'space2'; > > my $seqadp = $db->get_object_adaptor('Bio::SeqI'); > my $aux_seq = Bio::Seq::RichSeq->new( > -accession_number => 'NC_005982', > -version => 1, > -namespace => $ns1); > my $seq = $seqadp->find_by_unique_key($aux_seq); > > # store the found sequence in the second biodatabase: > my $pseq = $seqadp->create_persistent($ns2); > $pseq->namespace('bioperl2'); > $pseq->primary_key(undef); # my addition, which appears to work > $pseq->store(); > $seqadp->commit; > ... > > My question: is this an intended effect? The ability to assign > undef to primary_key seems intentional based on the method code, > but I'm a bit uncertain here. > > chris > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From jswetnam at gmail.com Mon Mar 10 15:27:46 2008 From: jswetnam at gmail.com (James Swetnam) Date: Mon, 10 Mar 2008 15:27:46 -0400 Subject: [BioSQL-l] Possible Mysql 5.x bug Message-ID: <2ABD56A9-9632-4AB1-BC54-B0AF71037DC8@gmail.com> First off, thank you very much to the developers for creating and maintaining such a useful and interesting project. I think I have found a small syntactical bug; as a caveat, however, I am not a database developer and have very little experience in these matters. I do know how to read documentation though, which I've relied heavily on to write this email. As per the biopython setup tutorial I'm attempting to run the biosqldb- mysql.sql file on Mac OS X Leopard. Here is my mysql version string: cardozo13:sql james$ mysql -V mysql Ver 14.12 Distrib 5.0.54-20071214, for apple-darwin9.1.0 (powerpc) using EditLine wrapper And my procedure (after grabbing the biosql source via CVS). cardozo13:sql james$ mysqladmin -u root -p create bioseqdb Enter password: cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb- mysql.sqlEnter password: ERROR 1064 (42000) at line 169: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '--CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id)' at line 1 Interesting. Let's take a look at line 169: --CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id); And an excerpt from the documentation for my version of MySQL (5.0 reference manual), section 1.8.5.6. '--' as the Start of a Comment: Standard SQL uses ?--? as a start-comment sequence. MySQL Server uses ?#? as the start comment character. MySQL Server 3.23.3 and up also supports a variant of the ?--? comment style. That is, the ?--? start- comment sequence must be followed by a space (or by a control character such as a newline). The space is required to prevent problems with automatically generated SQL queries that use constructs such as the following, where we automatically insert the value of the payment for payment: OK. So after replacing all the lines in which -- is not followed by a space (thank you regexps), it works beautifully. cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb-mysql.sql Enter password: Should this change be implemented? Or am i missing something? James Swetnam Research Technician New York University School of Medicine - Done. ---------- Forwarded message ---------- From: "James Swetnam" To: biosql-l-request at lists.open-bio.org Date: Thu, 6 Mar 2008 18:10:07 -0500 Subject: Comment Syntax bug Generates error on Hello. First off, thank you very much to the developers for creating and maintaining such a useful and interesting project. I think I have found a small syntactical bug; as a caveat, however, I am not a database developer and have very little experience in these matters. I do know how to read documentation though, which I've relied heavily on to write this email. As per the biopython setup tutorial I'm attempting to run the biosqldb- mysql.sql file on Mac OS X Leopard. Here is my mysql version string: cardozo13:sql james$ mysql -V mysql Ver 14.12 Distrib 5.0.54-20071214, for apple-darwin9.1.0 (powerpc) using EditLine wrapper And my procedure (after grabbing the biosql source via CVS). cardozo13:sql james$ mysqladmin -u root -p create bioseqdb Enter password: cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb- mysql.sqlEnter password: ERROR 1064 (42000) at line 169: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '--CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id)' at line 1 Interesting. Let's take a look at line 169: --CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id); And an excerpt from the documentation for my version of MySQL (5.0 reference manual), section 1.8.5.6. '--' as the Start of a Comment: Standard SQL uses ?--? as a start-comment sequence. MySQL Server uses ?#? as the start comment character. MySQL Server 3.23.3 and up also supports a variant of the ?--? comment style. That is, the ?--? start- comment sequence must be followed by a space (or by a control character such as a newline). The space is required to prevent problems with automatically generated SQL queries that use constructs such as the following, where we automatically insert the value of the payment for payment: OK. So after replacing all the lines in which -- is not followed by a space (thank you regexps), it works beautifully. cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb-mysql.sql Enter password: Should this change be implemented? Or am i missing something? James Swetnam Research Technician New York University School of Medicine Reply Forward From hlapp at gmx.net Mon Mar 10 23:05:32 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 10 Mar 2008 23:05:32 -0400 Subject: [BioSQL-l] Possible Mysql 5.x bug In-Reply-To: <2ABD56A9-9632-4AB1-BC54-B0AF71037DC8@gmail.com> References: <2ABD56A9-9632-4AB1-BC54-B0AF71037DC8@gmail.com> Message-ID: <9051AFFE-8660-4E21-B25F-93D1FB70D98B@gmx.net> Hi James, thanks for reporting this. Sebastian Bassi beat you to it, though, and it has actually been fixed in svn, and is also fixed in the 1.0.0 release. BioSQL is meanwhile on svn; the anonymous cvs server is still up, but doesn't get updated since the switch-over to svn. Instructions for downloading from svn and download location of the 1.0.0 release are on the BioSQL wiki at http://biosql.org. Let us know if you encounter any difficulties. And great that you're finding the project useful! -hilmar On Mar 10, 2008, at 3:27 PM, James Swetnam wrote: > First off, thank you very much to the developers for creating and > maintaining such a useful and interesting project. I think I have > found a small syntactical bug; as a caveat, however, I am not a > database developer and have very little experience in these matters. > I do know how to read documentation though, which I've relied > heavily > on to write this email. > As per the biopython setup tutorial I'm attempting to run the > biosqldb- > mysql.sql file on Mac OS X Leopard. Here is my mysql version > string: > cardozo13:sql james$ mysql -V > mysql Ver 14.12 Distrib 5.0.54-20071214, for apple-darwin9.1.0 > (powerpc) using EditLine wrapper > And my procedure (after grabbing the biosql source via CVS). > cardozo13:sql james$ mysqladmin -u root -p create bioseqdb > Enter password: > cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb- > mysql.sqlEnter password: > ERROR 1064 (42000) at line 169: You have an error in your SQL > syntax; > check the manual that corresponds to your MySQL server version > for the > right syntax to use near '--CREATE INDEX ontrel_subjectid ON > term_relationship(subject_term_id)' at line 1 > Interesting. Let's take a look at line 169: > > --CREATE INDEX ontrel_subjectid ON term_relationship > (subject_term_id); > > And an excerpt from the documentation for my version of MySQL (5.0 > reference manual), section 1.8.5.6. '--' as the Start of a Comment: > > Standard SQL uses ?--? as a start-comment sequence. MySQL Server > uses > ?#? as the start comment character. MySQL Server 3.23.3 and up also > supports a variant of the ?--? comment style. That is, the ?--? > start- > comment sequence must be followed by a space (or by a control > character such as a newline). The space is required to prevent > problems with automatically generated SQL queries that use > constructs > such as the following, where we automatically insert the value > of the > payment for payment: > > OK. So after replacing all the lines in which -- is not followed > by a > space (thank you regexps), it works beautifully. > > cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb-mysql.sql > Enter password: > > Should this change be implemented? Or am i missing something? > > James Swetnam > Research Technician > New York University School of Medicine > > > > > > > > - Done. > > > > ---------- Forwarded message ---------- > From: "James Swetnam" > To: biosql-l-request at lists.open-bio.org > Date: Thu, 6 Mar 2008 18:10:07 -0500 > Subject: Comment Syntax bug Generates error on > Hello. > > First off, thank you very much to the developers for creating and > maintaining such a useful and interesting project. I think I have > found a small syntactical bug; as a caveat, however, I am not a > database developer and have very little experience in these matters. > I do know how to read documentation though, which I've relied heavily > on to write this email. > > As per the biopython setup tutorial I'm attempting to run the > biosqldb- > mysql.sql file on Mac OS X Leopard. Here is my mysql version string: > > cardozo13:sql james$ mysql -V > mysql Ver 14.12 Distrib 5.0.54-20071214, for apple-darwin9.1.0 > (powerpc) using EditLine wrapper > > And my procedure (after grabbing the biosql source via CVS). > > cardozo13:sql james$ mysqladmin -u root -p create bioseqdb > Enter password: > cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb- > mysql.sqlEnter password: > ERROR 1064 (42000) at line 169: You have an error in your SQL syntax; > check the manual that corresponds to your MySQL server version for the > right syntax to use near '--CREATE INDEX ontrel_subjectid ON > term_relationship(subject_term_id)' at line 1 > > Interesting. Let's take a look at line 169: > > --CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id); > > And an excerpt from the documentation for my version of MySQL (5.0 > reference manual), section 1.8.5.6. '--' as the Start of a Comment: > > Standard SQL uses ?--? as a start-comment sequence. MySQL Server uses > ?#? as the start comment character. MySQL Server 3.23.3 and up also > supports a variant of the ?--? comment style. That is, the ?--? start- > comment sequence must be followed by a space (or by a control > character such as a newline). The space is required to prevent > problems with automatically generated SQL queries that use constructs > such as the following, where we automatically insert the value of the > payment for payment: > > OK. So after replacing all the lines in which -- is not followed by a > space (thank you regexps), it works beautifully. > > cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb-mysql.sql > Enter password: > > Should this change be implemented? Or am i missing something? > > James Swetnam > Research Technician > New York University School of Medicine > > > > > > > > Reply > > Forward > > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Tue Mar 11 14:51:47 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 11 Mar 2008 18:51:47 +0000 Subject: [BioSQL-l] Biopython documentation in BioSQL SVN In-Reply-To: <320fb6e00803111148k52150fdcu3b3c22f72f59f514@mail.gmail.com> References: <320fb6e00803111148k52150fdcu3b3c22f72f59f514@mail.gmail.com> Message-ID: <320fb6e00803111151q645c1f0cgead7842e8ab0d0d@mail.gmail.com> Hello, Over on the Biopython mailing list, James Swetnam drew my attention to the fact that we still had documentation referring to installing BioSQL from CVS (predating both the move to SVN and the official 1.0 release). I've updated our wiki page, http://biopython.org/wiki/BioSQL However, there is some older LaTeX based documentation on our webpage, http://biopython.org/DIST/docs/biosql/python_biosql_basic.html http://biopython.org/DIST/docs/biosql/python_biosql_basic.pdf These are currently living in the BioSQL repository, which I don't think I have access to. http://code.open-bio.org/svnweb/index.cgi/biosql/browse/biosql-schema/trunk/doc/biopython/ Does it make sense to have this documentation separate from the Biopython code it refers to (which lives in the Biopython repository)? For one thing, it complicates access rights for developers. What I would suggest is just to: (*) add a disclaimer to the top of python_biosql_basic.tex saying this document is depreciated, and giving a link to the wiki page, http://biopython.org/wiki/BioSQL (*) regenerate the PDF and HTML files. (*) Update these three files in BioSQL's SVN repository. (*) Copy the new PDF and HTML files over to the Biopython webserver. Thanks Peter From hlapp at gmx.net Tue Mar 11 15:57:16 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 11 Mar 2008 15:57:16 -0400 Subject: [BioSQL-l] Biopython documentation in BioSQL SVN In-Reply-To: <320fb6e00803111151q645c1f0cgead7842e8ab0d0d@mail.gmail.com> References: <320fb6e00803111148k52150fdcu3b3c22f72f59f514@mail.gmail.com> <320fb6e00803111151q645c1f0cgead7842e8ab0d0d@mail.gmail.com> Message-ID: On Mar 11, 2008, at 2:51 PM, Peter wrote: > However, there is some older LaTeX based documentation on our webpage, > http://biopython.org/DIST/docs/biosql/python_biosql_basic.html > http://biopython.org/DIST/docs/biosql/python_biosql_basic.pdf > > These are currently living in the BioSQL repository, You mean that the originals are, i.e., the source .tex file, right? The files in the BioSQL repository have been updated, and the updates should be in the v1.0.0 release. > [...] > Does it make sense to have this documentation separate from the > Biopython code it refers to (which lives in the Biopython repository)? > For one thing, it complicates access rights for developers. Indeed. You can have write access but that doesn't mean it would then be easy to maintain for you folks (as it being in a non-biopython repository likely makes it slip from your mind again). However, at the end of the day it is your call. I'm happy to leave it there, especially if there is continuing interest from Biopython folks to keep it updated (if there isn't, I may schedule it for deletion for one of the 1.1 or higher releases). > > What I would suggest is just to: > > (*) add a disclaimer to the top of python_biosql_basic.tex saying this > document is depreciated, and giving a link to the wiki page, > http://biopython.org/wiki/BioSQL Just send me a patch of the change you would like to make. > (*) regenerate the PDF and HTML files. Those have been regenerated already, before the v1.0.0 release (by me, under some pains trying to get HeVeA to do what the original creators seemed to have gotten it to do). > (*) Update these three files in BioSQL's SVN repository. Done already as far as the change to svn is concerned. Actually, some Biopythonist (Sebastian?) walked through the file and made sure everything works as described, giving rise to an additional change. > (*) Copy the new PDF and HTML files over to the Biopython webserver. Feel free to grab them from svn (or from the BioSQL 1.0.0 release, there haven't been any changes since the release). -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Thu Mar 13 11:06:18 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Mar 2008 15:06:18 +0000 Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id Message-ID: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> Dear list, One of the unresolved issues with Biopython's BioSQL interface is dealing with the NCBI taxon ID when loading sequences into the database. As I understand it, ideally before loading any sequences, the user will have loaded in the entire NCBI taxonomy using the load_ncbi_taxonomy.pl script, as I described here: http://biopython.org/wiki/BioSQL#NCBI_Taxonomy When a new sequence is added to the database with a known taxon id, there is no problem. But happens if its a recently sequenced organism which isn't defined yet in the BioSQL taxonomy tables? Could/should the user re-run load_ncbi_taxonomy.pl, and then load in their new sequence? Right now in Biopython due what appears to have been intended as a short term hack, we simple don't record the taxon id at all (!), and I would like to fix this (bug 2422). http://bugzilla.open-bio.org/show_bug.cgi?id=2422 How do BioPerl et al deal with this issue? Do they try and update the taxonomy tables using the available information in the new record's annotation (i.e. the new taxon id and the species name)? Do they lookup the NCBI taxonomy definition via the internet? Do they throw an error and halt? Thanks, Peter (Biopython) From hlapp at gmx.net Thu Mar 13 18:51:13 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 13 Mar 2008 18:51:13 -0400 Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id In-Reply-To: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> Message-ID: <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> (this is more of a bioperl question than a biosql one) The load_ncbi_taxonomy.pl script is designed to update the taxon tables in a non-disruptive way, and if there weren't many changes shouldn't actually take that long (except that recalculating the nested set values may take a couple of minutes). Bioperl-db will store the taxon information it finds in the Bio::Species object if it can't locate the taxon by lookup, and will not raise an error. The problem with this is that it relies on the Bio::SeqIO parser to have gotten the species and lineage information correct, which is sometimes a wrong assumption for exotic species. Most often the error will not manifest itself at the time of storing the erroneously parsed information, but when it is re-retrieved and used to populate a Bio::Species object. For the SymAtlas project we had this situation (new species in sequence updates that the last NCBI taxonomy update hadn't yet brought in) quite regularly. I wrote a SQL script would fix those 'haphazard' additions such that load_ncbi_taxonomy would update them to their correct values come the next NCBI taxonomy update. I can send you the script (it would be for the Oracle version), but I'm not sure this is a widely viable strategy. -hilmar On Mar 13, 2008, at 11:06 AM, Peter wrote: > Dear list, > > One of the unresolved issues with Biopython's BioSQL interface is > dealing with the NCBI taxon ID when loading sequences into the > database. > > As I understand it, ideally before loading any sequences, the user > will have loaded in the entire NCBI taxonomy using the > load_ncbi_taxonomy.pl script, as I described here: > http://biopython.org/wiki/BioSQL#NCBI_Taxonomy > > When a new sequence is added to the database with a known taxon id, > there is no problem. But happens if its a recently sequenced organism > which isn't defined yet in the BioSQL taxonomy tables? Could/should > the user re-run load_ncbi_taxonomy.pl, and then load in their new > sequence? > > Right now in Biopython due what appears to have been intended as a > short term hack, we simple don't record the taxon id at all (!), and I > would like to fix this (bug 2422). > http://bugzilla.open-bio.org/show_bug.cgi?id=2422 > > How do BioPerl et al deal with this issue? Do they try and update the > taxonomy tables using the available information in the new record's > annotation (i.e. the new taxon id and the species name)? Do they > lookup the NCBI taxonomy definition via the internet? Do they throw > an error and halt? > > Thanks, > > Peter > (Biopython) > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Thu Mar 13 19:13:32 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Mar 2008 23:13:32 +0000 Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id In-Reply-To: <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> Message-ID: <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com> On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp wrote: > (this is more of a bioperl question than a biosql one) Well, yes and no. And I'm not subscribed to the Bioperl list, nor the BioJava one, nor the BioRuby one. > The load_ncbi_taxonomy.pl script is designed to update the taxon > tables in a non-disruptive way, and if there weren't many changes > shouldn't actually take that long (except that recalculating the > nested set values may take a couple of minutes). Do you think when faced with a novel taxon id, Biopython/BioPerl/... could write some minimal taxonomy entry (without any guess work based on the species name), in order to record the sequence's taxon - and then running an improved load_ncbi_taxonomy.pl at a later date would sort out the proper taxonomy? > Bioperl-db will store the taxon information it finds in the > Bio::Species object if it can't locate the taxon by lookup, and will > not raise an error. The problem with this is that it relies on the > Bio::SeqIO parser to have gotten the species and lineage information > correct, which is sometimes a wrong assumption for exotic species. > Most often the error will not manifest itself at the time of storing > the erroneously parsed information, but when it is re-retrieved and > used to populate a Bio::Species object. This is what I would like to avoid with Biopython. > For the SymAtlas project we had this situation (new species in > sequence updates that the last NCBI taxonomy update hadn't yet > brought in) quite regularly. I wrote a SQL script would fix those > 'haphazard' additions such that load_ncbi_taxonomy would update them > to their correct values come the next NCBI taxonomy update. I can > send you the script (it would be for the Oracle version), but I'm not > sure this is a widely viable strategy. So this wasn't integrated with load_ncbi_taxonomy.pl at all? Peter From hlapp at gmx.net Thu Mar 13 19:41:43 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 13 Mar 2008 19:41:43 -0400 Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id In-Reply-To: <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com> References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com> Message-ID: On Mar 13, 2008, at 7:13 PM, Peter wrote: > On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp wrote: >> [...] >> The load_ncbi_taxonomy.pl script is designed to update the taxon >> tables in a non-disruptive way, and if there weren't many changes >> shouldn't actually take that long (except that recalculating the >> nested set values may take a couple of minutes). > > Do you think when faced with a novel taxon id, Biopython/BioPerl/... > could write some minimal taxonomy entry (without any guess work based > on the species name), in order to record the sequence's taxon This is what Bioperl-db does. There isn't any guesswork. If Bio::Species has lineage information it will also insert the lineage information, though. > - and then running an improved load_ncbi_taxonomy.pl at a later > date would > sort out the proper taxonomy? If I remember correctly, the script makes (and hence expects) the primary key and the NCBI taxonomy ID to be identical. If your loading procedure can achieve that already then load_ncbi_taxonomy.pl should pick them up and fix them. You can try that by loading the taxonomy through the script, then arbitrarily choose a taxon, create a stub bioentry for it and set its taxon_id foreign key to the chosen taxon, change its taxon_name.name to some bogus value (for the 'scientific name' class, for example) (and feel free to change the left_id and right_id values in taxon too), and rerun the script. It should fix the change you made, and your bioentry should still point to the same taxon (because its primary key did not change, and did not get deleted either; otherwise the bioentry would now have a null value in the foreign key). The Bioperl-db way of storing things does not give control over primary key assignment to Bioperl-db, so the database will assign it. > [...] >> For the SymAtlas project we had this situation (new species in >> sequence updates that the last NCBI taxonomy update hadn't yet >> brought in) quite regularly. I wrote a SQL script would fix those >> 'haphazard' additions such that load_ncbi_taxonomy would update them >> to their correct values come the next NCBI taxonomy update. I can >> send you the script (it would be for the Oracle version), but I'm >> not >> sure this is a widely viable strategy. > > So this wasn't integrated with load_ncbi_taxonomy.pl at all? No, but now that you say it I don't see any reason why I couldn't. Maybe that's just what I should do. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From mrphysh at juno.com Thu Mar 13 21:58:25 2008 From: mrphysh at juno.com (mrphysh at juno.com) Date: Fri, 14 Mar 2008 01:58:25 GMT Subject: [BioSQL-l] bioperl basics Message-ID: <20080313.195825.6855.0@webmail20.vgs.untd.com> I am a molecular biologist studying bioinformatics from a Perl background and making progress. I am realizing that without tapping into the existing infrastructure, I will be writing code for ever. Bioperl is the path for me. I am moving forward. the error I encounter is can't locate Cache/FileCache in @INC (@INC contains /etc/perl/ /usr/locaql/lib/perl/5.8.8 .....) and so forth. I found the files in a home directory. I must have told the install to put them there...? anyway: How do I edit this environmental variable..... @INC. I cannot find anything in my book. thanks john brigham I will be writing code for years and need to tap into the _____________________________________________________________ Need cash? Click to get an emergency loan, bad credit ok http://thirdpartyoffers.juno.com/TGL2121/fc/Ioyw6i3mKmyQsg01zMPK1Qa0178ZfajwTEBgEXdzlmb9zLLZc8pLOU/ From barry.moore at genetics.utah.edu Thu Mar 13 23:08:19 2008 From: barry.moore at genetics.utah.edu (Barry Moore) Date: Thu, 13 Mar 2008 21:08:19 -0600 Subject: [BioSQL-l] bioperl basics In-Reply-To: <20080313.195825.6855.0@webmail20.vgs.untd.com> References: <20080313.195825.6855.0@webmail20.vgs.untd.com> Message-ID: John, @INC is not an environment variable, it is a perl variable that gets populated by the environment variable PERL5LIB. You would normally set that environment variable by doing something like 'export PERL5LIB='/path/to/perl/libraries':$PERL5LIB' if you use bash shell or setenv PERL5LIB "/path/to/perl/libraries:$PERL5LIB" if you use c shell and you'll want to put those lines into the appropriate start up files so that they get set everytime you log in. This will be different on a windows system but I'm afraid I can't help with that. If you are having trouble installing bioperl I would encourage you to read the installation documentation at http://www.bioperl.org/wiki/ Installing_BioPerl. Beyond that you will find a wealth of help with your beginning perl questions by searching the web with Google, asking at perlmonks.org or joining one of the many perl mailing lists that you can find at http://lists.cpan.org/. The bioperl mailing list and this mailing list (BioSQL) are devoted specifically to discussions directly related to Bioperl and BioSQL respectively. You should search for answers to questions like this one first on the web, then on one of the general perl mailing lists or web sites mentioned above. When you have questions (even beginner ones) that are specific to Bioperl or BioSQL you are welcome post to those lists at any time. Barry On Mar 13, 2008, at 7:58 PM, mrphysh at juno.com wrote: > I am a molecular biologist studying bioinformatics from a Perl > background and making progress. I am realizing that without > tapping into the existing infrastructure, I will be writing code > for ever. Bioperl is the path for me. I am moving forward. > > the error I encounter is > > can't locate Cache/FileCache in @INC (@INC contains /etc/perl/ /usr/ > locaql/lib/perl/5.8.8 .....) and so forth. > > I found the files in a home directory. I must have told the > install to put them there...? > > > anyway: How do I edit this environmental variable..... @INC. I > cannot find anything in my book. > > thanks > john brigham > > > I will be writing code for years and need to tap into the > _____________________________________________________________ > Need cash? Click to get an emergency loan, bad credit ok > http://thirdpartyoffers.juno.com/TGL2121/fc/ > Ioyw6i3mKmyQsg01zMPK1Qa0178ZfajwTEBgEXdzlmb9zLLZc8pLOU/ > > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l From markjschreiber at gmail.com Fri Mar 14 09:48:38 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Fri, 14 Mar 2008 21:48:38 +0800 Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id In-Reply-To: References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com> Message-ID: <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com> >From memory BioJava will add it if it is not already in there. If the taxid can be found then the system connects you with whatever is in that taxid, it doesn't overwrite it. This has two curious side effects. Because the details associated with a taxid sometimes change (eg common name changes a lot) you can get connected to an outdated version (if your record is newer than your NCBI taxonomy) or you can get connected with a version that is newer than your record which means when you round-trip you don't get complete identity. For compatibility across the projects some kind of consensus would be good. - Mark On Fri, Mar 14, 2008 at 7:41 AM, Hilmar Lapp wrote: > > > On Mar 13, 2008, at 7:13 PM, Peter wrote: > > > On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp wrote: > >> [...] > > >> The load_ncbi_taxonomy.pl script is designed to update the taxon > >> tables in a non-disruptive way, and if there weren't many changes > >> shouldn't actually take that long (except that recalculating the > >> nested set values may take a couple of minutes). > > > > Do you think when faced with a novel taxon id, Biopython/BioPerl/... > > could write some minimal taxonomy entry (without any guess work based > > on the species name), in order to record the sequence's taxon > > This is what Bioperl-db does. There isn't any guesswork. If > Bio::Species has lineage information it will also insert the lineage > information, though. > > > > - and then running an improved load_ncbi_taxonomy.pl at a later > > date would > > sort out the proper taxonomy? > > If I remember correctly, the script makes (and hence expects) the > primary key and the NCBI taxonomy ID to be identical. If your loading > procedure can achieve that already then load_ncbi_taxonomy.pl should > pick them up and fix them. You can try that by loading the taxonomy > through the script, then arbitrarily choose a taxon, create a stub > bioentry for it and set its taxon_id foreign key to the chosen > taxon, change its taxon_name.name to some bogus value (for the > 'scientific name' class, for example) (and feel free to change the > left_id and right_id values in taxon too), and rerun the script. It > should fix the change you made, and your bioentry should still point > to the same taxon (because its primary key did not change, and did > not get deleted either; otherwise the bioentry would now have a null > value in the foreign key). > > The Bioperl-db way of storing things does not give control over > primary key assignment to Bioperl-db, so the database will assign it. > > > [...] > > >> For the SymAtlas project we had this situation (new species in > >> sequence updates that the last NCBI taxonomy update hadn't yet > >> brought in) quite regularly. I wrote a SQL script would fix those > >> 'haphazard' additions such that load_ncbi_taxonomy would update them > >> to their correct values come the next NCBI taxonomy update. I can > >> send you the script (it would be for the Oracle version), but I'm > >> not > >> sure this is a widely viable strategy. > > > > So this wasn't integrated with load_ncbi_taxonomy.pl at all? > > No, but now that you say it I don't see any reason why I couldn't. > Maybe that's just what I should do. > > -hilmar > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > _______________________________________________ > > > > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > From cjfields at uiuc.edu Fri Mar 14 10:31:09 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 14 Mar 2008 09:31:09 -0500 Subject: [BioSQL-l] [Bioperl-l] Loading sequences with novel NCBI taxon id In-Reply-To: <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com> References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com> <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com> Message-ID: The counter to that perspective (using new sequences with old tax info) would be to regularly update NCBI taxonomy, particularly in circumstances prior to adding new sequences. Hilmar mentioned that once tax is loaded it doesn't take as long to update, so you could set up a cron job to update regularly. I remember someone mentioning weekly or monthly updates on the list quite a while ago, but I'm unsure how often NCBI updates tax information (i.e. with every release, monthly, weekly, etc). I can see instances popping up where you used the an up-to-date taxonomy but a new sequence contains a tax ID not present. I think bioperl-db handles these but I'm not sure what other Bio* do. chris On Mar 14, 2008, at 8:48 AM, Mark Schreiber wrote: >> From memory BioJava will add it if it is not already in there. If the > taxid can be found then the system connects you with whatever is in > that taxid, it doesn't overwrite it. > > This has two curious side effects. Because the details associated with > a taxid sometimes change (eg common name changes a lot) you can get > connected to an outdated version (if your record is newer than your > NCBI taxonomy) or you can get connected with a version that is newer > than your record which means when you round-trip you don't get > complete identity. > > For compatibility across the projects some kind of consensus would > be good. > > - Mark > On Fri, Mar 14, 2008 at 7:41 AM, Hilmar Lapp wrote: >> >> >> On Mar 13, 2008, at 7:13 PM, Peter wrote: >> >>> On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp wrote: >>>> [...] >> >>>> The load_ncbi_taxonomy.pl script is designed to update the taxon >>>> tables in a non-disruptive way, and if there weren't many changes >>>> shouldn't actually take that long (except that recalculating the >>>> nested set values may take a couple of minutes). >>> >>> Do you think when faced with a novel taxon id, Biopython/BioPerl/... >>> could write some minimal taxonomy entry (without any guess work >>> based >>> on the species name), in order to record the sequence's taxon >> >> This is what Bioperl-db does. There isn't any guesswork. If >> Bio::Species has lineage information it will also insert the lineage >> information, though. >> >> >>> - and then running an improved load_ncbi_taxonomy.pl at a later >>> date would >>> sort out the proper taxonomy? >> >> If I remember correctly, the script makes (and hence expects) the >> primary key and the NCBI taxonomy ID to be identical. If your loading >> procedure can achieve that already then load_ncbi_taxonomy.pl should >> pick them up and fix them. You can try that by loading the taxonomy >> through the script, then arbitrarily choose a taxon, create a stub >> bioentry for it and set its taxon_id foreign key to the chosen >> taxon, change its taxon_name.name to some bogus value (for the >> 'scientific name' class, for example) (and feel free to change the >> left_id and right_id values in taxon too), and rerun the script. It >> should fix the change you made, and your bioentry should still point >> to the same taxon (because its primary key did not change, and did >> not get deleted either; otherwise the bioentry would now have a null >> value in the foreign key). >> >> The Bioperl-db way of storing things does not give control over >> primary key assignment to Bioperl-db, so the database will assign it. >> >>> [...] >> >>>> For the SymAtlas project we had this situation (new species in >>>> sequence updates that the last NCBI taxonomy update hadn't yet >>>> brought in) quite regularly. I wrote a SQL script would fix those >>>> 'haphazard' additions such that load_ncbi_taxonomy would update >>>> them >>>> to their correct values come the next NCBI taxonomy update. I can >>>> send you the script (it would be for the Oracle version), but I'm >>>> not >>>> sure this is a widely viable strategy. >>> >>> So this wasn't integrated with load_ncbi_taxonomy.pl at all? >> >> No, but now that you say it I don't see any reason why I couldn't. >> Maybe that's just what I should do. >> >> -hilmar >> >> -- >> =========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >> =========================================================== >> >> >> >> _______________________________________________ >> >> >> >> BioSQL-l mailing list >> BioSQL-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biosql-l >> > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From markjschreiber at gmail.com Fri Mar 14 20:56:37 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sat, 15 Mar 2008 08:56:37 +0800 Subject: [BioSQL-l] [Bioperl-l] Loading sequences with novel NCBI taxon id In-Reply-To: References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com> <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com> Message-ID: <93b45ca50803141756m3d7f022cnb57bd39f37270682@mail.gmail.com> I agree. A regular update would be best. Of course if your BioSQL db is limited to one or a few organisms you can just keep a fragment of the db. - Mark On Fri, Mar 14, 2008 at 10:31 PM, Chris Fields wrote: > The counter to that perspective (using new sequences with old tax > info) would be to regularly update NCBI taxonomy, particularly in > circumstances prior to adding new sequences. Hilmar mentioned that > once tax is loaded it doesn't take as long to update, so you could set > up a cron job to update regularly. > > I remember someone mentioning weekly or monthly updates on the list > quite a while ago, but I'm unsure how often NCBI updates tax > information (i.e. with every release, monthly, weekly, etc). I can > see instances popping up where you used the an up-to-date taxonomy but > a new sequence contains a tax ID not present. I think bioperl-db > handles these but I'm not sure what other Bio* do. > > chris > > On Mar 14, 2008, at 8:48 AM, Mark Schreiber wrote: > > >> From memory BioJava will add it if it is not already in there. If the > > taxid can be found then the system connects you with whatever is in > > that taxid, it doesn't overwrite it. > > > > This has two curious side effects. Because the details associated with > > a taxid sometimes change (eg common name changes a lot) you can get > > connected to an outdated version (if your record is newer than your > > NCBI taxonomy) or you can get connected with a version that is newer > > than your record which means when you round-trip you don't get > > complete identity. > > > > For compatibility across the projects some kind of consensus would > > be good. > > > > - Mark > > On Fri, Mar 14, 2008 at 7:41 AM, Hilmar Lapp wrote: > >> > >> > >> On Mar 13, 2008, at 7:13 PM, Peter wrote: > >> > >>> On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp wrote: > >>>> [...] > >> > >>>> The load_ncbi_taxonomy.pl script is designed to update the taxon > >>>> tables in a non-disruptive way, and if there weren't many changes > >>>> shouldn't actually take that long (except that recalculating the > >>>> nested set values may take a couple of minutes). > >>> > >>> Do you think when faced with a novel taxon id, Biopython/BioPerl/... > >>> could write some minimal taxonomy entry (without any guess work > >>> based > >>> on the species name), in order to record the sequence's taxon > >> > >> This is what Bioperl-db does. There isn't any guesswork. If > >> Bio::Species has lineage information it will also insert the lineage > >> information, though. > >> > >> > >>> - and then running an improved load_ncbi_taxonomy.pl at a later > >>> date would > >>> sort out the proper taxonomy? > >> > >> If I remember correctly, the script makes (and hence expects) the > >> primary key and the NCBI taxonomy ID to be identical. If your loading > >> procedure can achieve that already then load_ncbi_taxonomy.pl should > >> pick them up and fix them. You can try that by loading the taxonomy > >> through the script, then arbitrarily choose a taxon, create a stub > >> bioentry for it and set its taxon_id foreign key to the chosen > >> taxon, change its taxon_name.name to some bogus value (for the > >> 'scientific name' class, for example) (and feel free to change the > >> left_id and right_id values in taxon too), and rerun the script. It > >> should fix the change you made, and your bioentry should still point > >> to the same taxon (because its primary key did not change, and did > >> not get deleted either; otherwise the bioentry would now have a null > >> value in the foreign key). > >> > >> The Bioperl-db way of storing things does not give control over > >> primary key assignment to Bioperl-db, so the database will assign it. > >> > >>> [...] > >> > >>>> For the SymAtlas project we had this situation (new species in > >>>> sequence updates that the last NCBI taxonomy update hadn't yet > >>>> brought in) quite regularly. I wrote a SQL script would fix those > >>>> 'haphazard' additions such that load_ncbi_taxonomy would update > >>>> them > >>>> to their correct values come the next NCBI taxonomy update. I can > >>>> send you the script (it would be for the Oracle version), but I'm > >>>> not > >>>> sure this is a widely viable strategy. > >>> > >>> So this wasn't integrated with load_ncbi_taxonomy.pl at all? > >> > >> No, but now that you say it I don't see any reason why I couldn't. > >> Maybe that's just what I should do. > >> > >> -hilmar > >> > >> -- > >> =========================================================== > >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > >> =========================================================== > >> > >> > >> > >> _______________________________________________ > >> > >> > >> > >> BioSQL-l mailing list > >> BioSQL-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biosql-l > >> > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > From biopython at maubp.freeserve.co.uk Sun Mar 16 15:16:04 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 16 Mar 2008 19:16:04 +0000 Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id In-Reply-To: <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com> References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com> <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com> Message-ID: <320fb6e00803161216q51488e5bo8d1538fb616edb88@mail.gmail.com> On Fri, Mar 14, 2008 Mark Schreiber wrote: > From memory BioJava will add it if it is not already in there. If the > taxid can be found then the system connects you with whatever is in > that taxid, it doesn't overwrite it. BioPerl does this to, so there is consensus on this at least. But see below regarding the lineage. > This has two curious side effects. Because the details associated with > a taxid sometimes change (eg common name changes a lot) you can get > connected to an outdated version (if your record is newer than your > NCBI taxonomy) or you can get connected with a version that is newer > than your record which means when you round-trip you don't get > complete identity. This is understandable, even if a little unexpected. I (Peter) wrote: > > > Do you think when faced with a novel taxon id, Biopython/BioPerl/... > > > could write some minimal taxonomy entry (without any guess work based > > > on the species name), in order to record the sequence's taxon Hilmar Lapp replied: > > This is what Bioperl-db does. There isn't any guesswork. If > > Bio::Species has lineage information it will also insert the lineage > > information, though. I am planing to fix Biopython so that once again, it will record the taxon id against new sequences if the species is already in the table, and add it to the taxonomy if it isn't there already. Should we also try and add the lineage into the taxon/taxon_name tables, linking to existing entries based on matching scientific names where possible? Or, should we just add a single taxonomy entry for the new species, with no lineage links at all? The old Biopython code also used to add taxon table entries for the full lineage - trying to reuse existing entries based on string matching to the scientific name field in the taxon_name table. This strikes me as a little unreliable (which is why I used the term "guess work" in my earlier email). I am also concerned that this complicates the clean up operation for load_ncbi_taxonomy.pl, but have not looked into this. Hilmar Lapp wrote: > > If I remember correctly, the script makes (and hence expects) the > > primary key and the NCBI taxonomy ID to be identical. Really? Perhaps I have misunderstood you. That would cause problems if we want to record a new sequence entry with species information but no NCBI taxonomy ID (e.g. an in house sequencing project). The Biopython code doesn't seem to assume the taxon table ID bears any resemblance to the the NCBI taxonomy ID. When creating new taxon table entries, we let the database will assign the taxon table id (primary key). Peter From hlapp at gmx.net Sun Mar 16 18:00:12 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 16 Mar 2008 18:00:12 -0400 Subject: [BioSQL-l] BioSQL + Embl + Comments In-Reply-To: <1205182450.18769.20.camel@Graco> References: <1205182450.18769.20.camel@Graco> Message-ID: Hi Raoul, On Mar 10, 2008, at 4:54 PM, Raoul Jean Pierre Bonnal wrote: > Dear Hilmar, > I'm here for asking you some help. > > BioRuby guys chosen as example for round trip tests the sequence ID > AJ224122; SV 3; linear; genomic DNA; STD; PLN; 3827 BP. > > I have problem with the references/comments informations. > In biosql "comment" seems to be something generic not directly > binded to > a reference. Comment in BioSQL is a piece of annotation of type comment. The schema at present only allows you to attach those to bioentries, and in fact one particular comment can be assigned to only one bioentry (1:n relationship). > If you look at the AJ224122's embl format a comment is > connected with the reference. You're referring to the following line, right? RC revised by [3] > There is no problem with genbank because there is only a generic > comment > and BioSQL works correctly in this case. > So, how can I manage the problem with Embl ? I was thinking to add a > column the "comment_id" to "bioentry_reference" as fk to "comment" > table > in a way that a bioentry_reference can have more comments. One question here is whether the comment is specific to the association of the reference with the bioentry, or to the reference in general. The next thing to note is that the comment above is not just text, it actually establishes a relationship to another reference (or to another reference to bioentry association). So to really capture it you would want a typed link between bioentry_reference rows (in this case the relationship type would be 'revises' or 'revised by', depending on direction). The question is whether this depth of modeling is needed or useful, aside from the fact that I'm pretty sure that none of the Bio* libraries supports it (but maybe they want to?). So if not, I guess this goes back to the use-case of round-tripping? Maybe to satisfy that a bioentry_reference_qualifier table would suffice (assuming that the comment does apply rather to the reference/ bioentry association than directly to the reference). > > PS: I don't know if this stuff should be emailed to biosql list Yes, I actually hadn't realized that you hadn't posted this to the list. Should have forwarded right away, sorry for sitting on it. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sun Mar 16 18:54:45 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 16 Mar 2008 18:54:45 -0400 Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id In-Reply-To: <320fb6e00803161216q51488e5bo8d1538fb616edb88@mail.gmail.com> References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com> <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com> <320fb6e00803161216q51488e5bo8d1538fb616edb88@mail.gmail.com> Message-ID: <9BC888F9-1DB1-40CC-93DA-C27E30019C04@gmx.net> On Mar 16, 2008, at 3:16 PM, Peter wrote: > [...] I (Peter) wrote: >>>> Do you think when faced with a novel taxon id, Biopython/ >>>> BioPerl/... >>>> could write some minimal taxonomy entry (without any guess work >>>> based >>>> on the species name), in order to record the sequence's taxon > > Hilmar Lapp replied: >>> This is what Bioperl-db does. There isn't any guesswork. If >>> Bio::Species has lineage information it will also insert the lineage >>> information, though. > > I am planing to fix Biopython so that once again, it will record the > taxon id against new sequences if the species is already in the table, > and add it to the taxonomy if it isn't there already. > > Should we also try and add the lineage into the taxon/taxon_name > tables, linking to existing entries based on matching scientific names > where possible? Or, should we just add a single taxonomy entry for > the new species, with no lineage links at all? This should probably depend on how good or complete the lineage information is that you have. BioPerl parses this out of the sequence files (for formats that have it, such as GenBank, EMBL, UniProt), and so except for exotic clades that don't follow the typical patterns it is usually in good shape (though one might say that the majority of clades are exotic). Moreover, it's worth noting that the NCBI taxonomy often contains more nodes in a lineage than are shown in the GenBank record. In this case, unless you know which levels (ranks) to print and which not to, having the full NCBI taxonomy information may in fact cause problems for round-tripping. > > The old Biopython code also used to add taxon table entries for the > full lineage - trying to reuse existing entries based on string > matching to the scientific name field in the taxon_name table. This > strikes me as a little unreliable (which is why I used the term "guess > work" in my earlier email). It's pretty unreliable actually. There is not only synonymy but also rampant homonymy in taxonomic names. There are plenty of examples for the same scientific name in use for a plant and for some animal, for example. So in order to be unambiguous you will need to know (and check) the kingdom. > I am also concerned that this complicates the clean up operation > for load_ncbi_taxonomy.pl, but have not looked into this. It shouldn't. The script makes no difference between tip (species or subspecies) nodes or internal nodes. > > Hilmar Lapp wrote: >>> If I remember correctly, the script makes (and hence expects) the >>> primary key and the NCBI taxonomy ID to be identical. > > Really? Perhaps I have misunderstood you. That would cause problems > if we want to record a new sequence entry with species information but > no NCBI taxonomy ID (e.g. an in house sequencing project). The > Biopython code doesn't seem to assume the taxon table ID bears any > resemblance to the the NCBI taxonomy ID. When creating new taxon > table entries, we let the database will assign the taxon table id > (primary key). Right, that's what I said Bioperl-db does too, and is the reason I had to regularly run that SQL script that would migrate the primary keys. Doing that isn't a big deal but I guess this could also be fixed in load_ncbi_taxonomy.pl so that it doesn't need to rely on this assumption. Would someone mind filing the bug report? (We have a BioSQL category now on bugzilla.open-bio.org.) Cheers, -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Mon Mar 17 12:08:43 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Mar 2008 16:08:43 +0000 Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id In-Reply-To: <9BC888F9-1DB1-40CC-93DA-C27E30019C04@gmx.net> References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com> <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com> <320fb6e00803161216q51488e5bo8d1538fb616edb88@mail.gmail.com> <9BC888F9-1DB1-40CC-93DA-C27E30019C04@gmx.net> Message-ID: <320fb6e00803170908x76f0b9a3he57f4653d2fd433@mail.gmail.com> On Sun, Mar 16, 2008 at 10:54 PM, Hilmar Lapp wrote: > > Should we [Biopython] also try and add the lineage into the taxon/ > > taxon_name tables, linking to existing entries based on matching scientific > > names where possible? Or, should we just add a single taxonomy entry > > for the new species, with no lineage links at all? > > This should probably depend on how good or complete the lineage > information is that you have. BioPerl parses this out of the sequence > files (for formats that have it, such as GenBank, EMBL, UniProt), and > so except for exotic clades that don't follow the typical patterns it > is usually in good shape (though one might say that the majority of > clades are exotic). I'm currently testing with GenBank, EMBL and SwissProt/UniProt files. Some of these files are several years old, and include have horrible multi-species SwissProt files with "species" names longer than 255 characters etc. The good news is that as you pointed out on another thread on the BioSQL mailing list earlier this month, they don't seem to do this anymore. > Moreover, it's worth noting that the NCBI taxonomy often contains > more nodes in a lineage than are shown in the GenBank record. In this > case, unless you know which levels (ranks) to print and which not to, > having the full NCBI taxonomy information may in fact cause problems > for round-tripping. I've come to accept that taxonomy information won't always survive a round trip. > > The old Biopython code also used to add taxon table entries for the > > full lineage - trying to reuse existing entries based on string > > matching to the scientific name field in the taxon_name table. This > > strikes me as a little unreliable (which is why I used the term "guess > > work" in my earlier email). > > It's pretty unreliable actually. There is not only synonymy but also > rampant homonymy in taxonomic names. There are plenty of examples for > the same scientific name in use for a plant and for some animal, for > example. So in order to be unambiguous you will need to know (and > check) the kingdom. I don't think the current Biopython code for recording the lineages checks the kingdom... could someone point me at the relevant bit of BioPerl and I'll see if I can understand exactly what they do? Hilmar Lapp wrote: > If I remember correctly, the script makes (and hence expects) the > primary key and the NCBI taxonomy ID to be identical. > ... > Doing that isn't a big deal but I guess this could also be fixed in > load_ncbi_taxonomy.pl so that it doesn't need to rely on this > assumption. Would someone mind filing the bug report? (We have a > BioSQL category now on bugzilla.open-bio.org.) I've filed Bug 2470 on this, http://bugzilla.open-bio.org/show_bug.cgi?id=2470 Regards, Peter From hlapp at gmx.net Tue Mar 18 08:30:34 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 18 Mar 2008 08:30:34 -0400 Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id In-Reply-To: <320fb6e00803170908x76f0b9a3he57f4653d2fd433@mail.gmail.com> References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com> <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com> <320fb6e00803161216q51488e5bo8d1538fb616edb88@mail.gmail.com> <9BC888F9-1DB1-40CC-93DA-C27E30019C04@gmx.net> <320fb6e00803170908x76f0b9a3he57f4653d2fd433@mail.gmail.com> Message-ID: <418EB160-7848-4F1A-A88B-99B00003F8A2@gmx.net> On Mar 17, 2008, at 12:08 PM, Peter wrote: >> [...] >> It's pretty unreliable actually. There is not only synonymy but also >> rampant homonymy in taxonomic names. There are plenty of examples >> for >> the same scientific name in use for a plant and for some animal, for >> example. So in order to be unambiguous you will need to know (and >> check) the kingdom. > > I don't think the current Biopython code for recording the lineages > checks the > kingdom... could someone point me at the relevant bit of BioPerl > and I'll see > if I can understand exactly what they do? Bioperl-db locates by NCBI taxon id first and then by scientific name. It does not take kingdom into account. You can find the persisted columns, unique key queries etc in Bio/DB/ BioSQL and then the respective adapter, in this case SpeciesAdapter.pm. The unique key queries are defined in get_unique_key_query(). > > Hilmar Lapp wrote: >> If I remember correctly, the script makes (and hence expects) the >> primary key and the NCBI taxonomy ID to be identical. >> ... >> Doing that isn't a big deal but I guess this could also be fixed in >> load_ncbi_taxonomy.pl so that it doesn't need to rely on this >> assumption. Would someone mind filing the bug report? (We have a >> BioSQL category now on bugzilla.open-bio.org.) > > I've filed Bug 2470 on this, http://bugzilla.open-bio.org/ > show_bug.cgi?id=2470 Thanks for the help, great, appreciated! -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at nescent.org Sun Mar 9 19:36:26 2008 From: hlapp at nescent.org (Hilmar Lapp) Date: Sun, 9 Mar 2008 19:36:26 -0400 Subject: [BioSQL-l] bioperl-db bugs In-Reply-To: References: Message-ID: Hi Chris, I added comments to both bug reports. This belongs to BioPerl, though, as it has only to do with its language binding. The tidbit may be worth keeping in mind for a general BioSQL audience is that bioentry namespace (foreign key to biodatabase) is part of the (compound) bioentry unique keys. The identifier column used to be unique by itself (and could still be made such in a local instance, there's a comment to this effect in the DDL), but that was changed a while ago. (Also, if one uses any of the Bio* language bindings, changing a unique key constraint to something that differs from what the language binding assumes may be asking for a lot of trouble. Bioperl-db will expect the combination of primary_id() and namespace () to match if the latter is provided.) -hilmar On Mar 5, 2008, at 6:24 PM, Chris Fields wrote: > Hilmar, > > I think I have two bioperl-db bugs sorted out, but I'm trying to > determine whether the solution is a side-effect, a feature, or a > bug. Dmitry has filed two bug reports which are somewhat related: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2280 > http://bugzilla.open-bio.org/show_bug.cgi?id=2281 > > I have added my comments to it, but maybe you can shed some more > light on this. What he is trying to do is copy a persistent Seq > object to a different namespace; load_seqdatabase.pl won't let him > do that directly using the same sequence file. If he changes the > namespace() and store()s it using a script, the seq is moved to the > new namespace, not updated. > > My reasoning is this is a feature (by not changing the primary_key, > you don't store a new sequence but update the current one). > However, if the primary_key is unset (undef), then it appears you > can copy the sequence over (from Dmitry's script, with my addition > noted): > > ... > my $ns1 = 'space1'; > my $ns2 = 'space2'; > > my $seqadp = $db->get_object_adaptor('Bio::SeqI'); > my $aux_seq = Bio::Seq::RichSeq->new( > -accession_number => 'NC_005982', > -version => 1, > -namespace => $ns1); > my $seq = $seqadp->find_by_unique_key($aux_seq); > > # store the found sequence in the second biodatabase: > my $pseq = $seqadp->create_persistent($ns2); > $pseq->namespace('bioperl2'); > $pseq->primary_key(undef); # my addition, which appears to work > $pseq->store(); > $seqadp->commit; > ... > > My question: is this an intended effect? The ability to assign > undef to primary_key seems intentional based on the method code, > but I'm a bit uncertain here. > > chris > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : =========================================================== From darin.london at duke.edu Tue Mar 18 14:16:59 2008 From: darin.london at duke.edu (darin.london at duke.edu) Date: Tue, 18 Mar 2008 13:16:59 -0500 Subject: [BioSQL-l] BOSC 2008 Announcement and Call For Submissions Message-ID: <200803181816.m2IIGx2k007275@tenero.duhs.duke.edu> BOSC 2008 Call for Abstracts The 9th annual Bioinformatics Open Source Conference (BOSC 2008) will take place in Toronto, Ontario, Canada, as one of several Special Interest Group (SIG) meetings occurring in conjunction with the 16th annual Intelligent Systems for Molecular Biology Conference (ISMB 2008). The Bioinformatics Open Source Conference (BOSC) is sponsored by the Open Bioinformatics Foundation (O|B|F), a non-profit group dedicated to promoting the practice and philosophy of Open Source software development within the biological research community. Many Open Source bioinformatics packages are widely used by the research community across many application areas and form a cornerstone in enabling research in the genomic and post-genomic era. Open source bioinformatics software has facilitated rapid innovation and dissemination of new computational methods as well as informatics infrastructure. Since the work of the Open Source Bioinformatics Community represents some of the most cutting edge of Bioinformatics in general, the overall theme for the conference this year is "Tackling Hard Problems with Emerging Technologies". Topics under this umbrella include cyberinfrastructure, grid computing and workflow management and discovery, and visualization. We will also have a series of update talks about the main Open Source Bioinformatics Software suites. One of the hallmarks of BOSC is the coming together of the open source developer community in one location. A face-to-face meeting of this community creates synergy where participants can work together to create use cases, prototype working code, or run bootcamps for developers from other projects as short, informal, and hands-on tutorials in new software packages and emerging technologies. In short, BOSC is not just a conference for presentations of completed work, but is a dynamic meeting where collaborative work gets done. This year, BOSC is accepting abstract submissions on the conference theme "Tackling Hard Problems with Emerging Technologies". The conference theme reflects that there are new technologies emerging on both the scientific front (new sequencing technologies, etc.) and the IT front (workflows, mashup/web 2.0, improvements in all of the major programming languages, etc.), which may allow the open source community to solve problems that were previously intractable. Abstracts may be submitted for the following topics. 1. Cyberinfrastructure - We are interested in presentations on topics dealing with the development of infrastructure on the web to facilitate software and data re-use (mashups, or traditional), interoperability and inter-process communication, system/service discovery, and data movement and modeling in distributed systems. This may include peer-to-peer systems of data transfer, Web Services, various flavors of data representation (SOAP, JSON, XML, others), and technologies commonly referred to under the Web 2.0 paradigm (e.g. folksonomies/tagging, user-based content generation, content feeds, and Social Networking). 2. Grid Computing and Workflow Management and Discovery - We particularly invite talks that report progress in making workflow systems easier to use and on how to do distributed-collaborative research , e.g. workflows that encompass the coordination of systems running in different parts of the world. 3. Visualization - Visualization is a maturing area of open source software development. We particularly invite talks that demonstrate innovative visualization systems in the context of workflows. 4. Open Source Software - Speakers will present talks on the use, development, or philosophy of open source software in bioinformatics. 5. Bio* Open Source Project Updates - We invite abstracts from the representatives of the open source projects sponsored by or affiliated to the O|B|F (see Projects). Please consult the official BOSC 2008 website at http://www.open-bio.org/wiki/Upcoming_BOSC_conference for all updates and extra information. Submission Process: All abstracts must be submitted through our Open Conference Systems site (http://events.open-bio.org/BOSC2008/openconf.php). The form will ask for a small Abstract Text to be pasted into it, and a full paper. The small Abstract text should be a summary, while the longer abstract (should provide more details, including the open-source license requirement details) Full-length abstracts are limited to one page with one inch (2.5 cm) margins on the top, sides, and bottom. The full-length abstract should include the title, authors, and affiliations. We prefer your abstract to be in PDF format, although plain t Important Dates: May 11: Abstract submission deadline. June 2: Notification of accepted talks. June 4: Early registration discount cut-off. July 18-19: BOSC 2008! We hope to see you at BOSC 2008! Kam Dahlquist and Darin London BOSC 2008 Co-organizers From er at xs4all.nl Thu Mar 20 15:24:12 2008 From: er at xs4all.nl (Erik) Date: Thu, 20 Mar 2008 20:24:12 +0100 (CET) Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer Message-ID: <5095.156.83.1.251.1206041052.squirrel@webmail.xs4all.nl> Hi, (latest BioSQL, bioperl-db, and bioperl-live installed.) Postgres 8.3 will not auto-cast text (='character varying') to integer any longer, which causes test t/16odba.t to fail: ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: error while executing query in Bio::DB::BioSQL::SeqAdaptor::find_by_query: ERROR: operator does not exist: character varying = integer LINE 1: ...eq.taxon_id FROM bioentry seq WHERE seq.identifier = 5456929 It seems likely to cause many similar statements to fail; how should this be solved? I tried to fix it but I couldn't find the place where the statement/clauses are put together. Thanks, Erik Rijkers From hlapp at gmx.net Thu Mar 20 18:49:41 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 20 Mar 2008 18:49:41 -0400 Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer In-Reply-To: <5095.156.83.1.251.1206041052.squirrel@webmail.xs4all.nl> References: <5095.156.83.1.251.1206041052.squirrel@webmail.xs4all.nl> Message-ID: <0F80B40B-0232-4367-8433-992588B6E71B@gmx.net> Hi Erik, thanks for the report. Given the error message, it looks more like the integer (which in reality is a string) can't be automatically converted to a string. That would be equally interesting, though. DBI I thought used to bind all parameters as string by default, but maybe that has changed? The parameter values are indeed all bound generically (and the query is created dynamically too), and I'm leaving it up to the DBD drivers to do the "Right Thing". I could obviously force everything into type string, but that is likely to have it's own repercussions on various RDBMSs. So could you file this as a bug report on bugzilla.open-bio.org (category bioperl-db, this is actually not a BioSQL problem), and run the following test on your 8.3 instance (which minor version actually?): CREATE TABLE t1 (a varchar(10), b text, c integer); SELECT * from t1 WHERE a = 1; SELECT * from t1 WHERE b = 1; SELECT * from t1 WHERE c = '1'; INSERT INTO t1 (a,b,c) VALUES ('a','b',1); SELECT * from t1 WHERE a = 1; SELECT * from t1 WHERE b = 1; SELECT * from t1 WHERE c = '1'; SELECT * from t1 WHERE a = 1::text; SELECT * from t1 WHERE b = 1::text; SELECT * from t1 WHERE c = integer '1'; DROP TABLE t1; These work all fine on my 8.1.4 instance. -hilmar On Mar 20, 2008, at 3:24 PM, Erik wrote: > Hi, > > (latest BioSQL, bioperl-db, and bioperl-live installed.) > > Postgres 8.3 will not auto-cast text (='character > varying') to integer any longer, which causes test > t/16odba.t to fail: > > > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: error while executing query in > Bio::DB::BioSQL::SeqAdaptor::find_by_query: ERROR: > operator does not exist: character varying = integer > LINE 1: ...eq.taxon_id FROM bioentry seq WHERE > seq.identifier = 5456929 > > It seems likely to cause many similar statements to fail; > how should this be solved? > > I tried to fix it but I couldn't find the place where the > statement/clauses are put together. > > > Thanks, > > Erik Rijkers > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From er at xs4all.nl Thu Mar 20 19:30:03 2008 From: er at xs4all.nl (Erik) Date: Fri, 21 Mar 2008 00:30:03 +0100 (CET) Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer Message-ID: <15786.156.83.1.157.1206055803.squirrel@webmail.xs4all.nl> On Thu, March 20, 2008 23:49, Hilmar Lapp wrote: > Hi Erik, thanks for the report. Given the error message, > it looks > more like the integer (which in reality is a string) can't > be automatically converted to a string. you are right, of course :) Here is the postgres 8.3.1 result of your sql statements: CREATE TABLE t1 (a varchar(10), b text, c integer); SELECT * from t1 WHERE a = 1; -- fails in 8.3.1 SELECT * from t1 WHERE b = 1; -- fails in 8.3.1 SELECT * from t1 WHERE c = '1'; -- ok INSERT INTO t1 (a,b,c) VALUES ('a','b',1); SELECT * from t1 WHERE a = 1; -- fails in 8.3.1 SELECT * from t1 WHERE b = 1; -- fails in 8.3.1 SELECT * from t1 WHERE c = '1'; -- ok SELECT * from t1 WHERE a = 1::text; -- ok SELECT * from t1 WHERE b = 1::text; -- ok SELECT * from t1 WHERE c = integer '1'; -- ok The failure is always (virtually) the same: ERROR: operator does not exist: character varying = integer LINE 1: SELECT * from t1 WHERE a = 1; ^ HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts. Then there is the cast function: for instance, I can let the test in t/16odba.t proceed faultlessly with $seq = $biodb->get_Seq_by_id( "cast(5456929 as text)" ); I am also doubtful/curious as to how this would affect the various loading scripts which I was going to use - I want to set up a GBrowse with human/mouse/flybase sequence annotation to show ChipSeq data against. But one thing at a time, I guess... > So could you file this as a bug report on > bugzilla.open-bio.org > (category bioperl-db, this is actually not a BioSQL > problem), I'll make an entry in bugzilla/bioperl-db. Thanks for you quick reply! Erik Rijkers From hlapp at gmx.net Thu Mar 20 20:34:42 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 20 Mar 2008 20:34:42 -0400 Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer In-Reply-To: <15786.156.83.1.157.1206055803.squirrel@webmail.xs4all.nl> References: <15786.156.83.1.157.1206055803.squirrel@webmail.xs4all.nl> Message-ID: <987C9C0E-840B-44AD-B3E9-0FC2809FF4F4@gmx.net> On Mar 20, 2008, at 7:30 PM, Erik wrote: > Here is the postgres 8.3.1 result of your sql statements: > > CREATE TABLE t1 (a varchar(10), b text, c integer); > > SELECT * from t1 WHERE a = 1; -- fails in 8.3.1 > SELECT * from t1 WHERE b = 1; -- fails in 8.3.1 > SELECT * from t1 WHERE c = '1'; -- ok > > [...] > The failure is always (virtually) the same: > ERROR: operator does not exist: character varying = integer > LINE 1: SELECT * from t1 WHERE a = 1; > ^ > HINT: No operator matches the given name and argument > type(s). You might need to add explicit type casts. So it's indeed the backend that changed behavior. It's actually documented as I see now: http://www.postgresql.org/docs/8.3/static/release-8-3.html scroll to section E.2.2. Migration to Version 8.3, E.2.2.1. General, and the first item there: Non-character data types are no longer automatically cast to TEXT (Peter, Tom) Previously, if a non-character value was supplied to an operator or function that requires text input, it was automatically cast to text, for most (though not all) built-in data types. This no longer happens: an explicit cast to text is now required for all non- character-string types. I can see the arguments there but this will prevent upgrading to 8.3 for many many applications, and the comments from the Pg developers ('fix your SQL to use casts') that I've seen there on the mailing lists are just not helpful. Fixing SQL is for many legacy applications is just not an option. In the case of Bioperl-db it's very non-trivial, because all of a sudden we would be changing from a hands-off and let-the-driver- figure-it-out approach to forcing types everywhere. So I think at this point with this change I have to declare Bioperl- db officially incompatible with PostgreSQL 8.3+ until we've found a solution to this, which is too bad because it seems 8.3 has some really nice performance features added. One possible solution might be to create a CAST in the database (namely the one that was taken away, restoring behavior to pre-8.3). Another possibility is to move the parameter binding method into the driver adaptor which would then delegate to the DBI method but would be overridden for the PostgreSQL adapter to force all bindings to type string. Which leads me back to the surprise observation that the parameter was bound as an integer in the first place, when DBD::Pg used to bind everything as string unless you told it otherwise. Which DBD::Pg version is it that you are using? I would suspect (or hope) that maybe there is soon an update release of DBD::Pg that fixes this problem by going back to binding everything as string by default (and as the tests show PostgreSQL will still convert strings to integer if necessary). Depending on what I (or can someone else update us on this?) find out for the DBD::Pg plans, I'll probably start looking into moving the parameter binding into the driver adapters. Though it does feel pathetic that this is now also not transparent between drivers. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From er at xs4all.nl Thu Mar 20 20:51:43 2008 From: er at xs4all.nl (Erik) Date: Fri, 21 Mar 2008 01:51:43 +0100 (CET) Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer Message-ID: <4483.156.83.1.157.1206060703.squirrel@webmail.xs4all.nl> On Fri, March 21, 2008 01:34, Hilmar Lapp wrote: > > So I think at this point with this change I have to > declare Bioperl- > db officially incompatible with PostgreSQL 8.3+ until > we've found a > solution to this, which is too bad because it seems 8.3 > has some > really nice performance features added. Pg 8.3 is indeed very noticably faster, and it has other excellent new features like full text indexing. (This also makes that downgrading is not really an option) > Which DBD::Pg version is it that you are using? DBD::Pg 2.3.0 Thanks, Erik Rijkers From hlapp at gmx.net Thu Mar 20 21:36:50 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 20 Mar 2008 21:36:50 -0400 Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer In-Reply-To: <4483.156.83.1.157.1206060703.squirrel@webmail.xs4all.nl> References: <4483.156.83.1.157.1206060703.squirrel@webmail.xs4all.nl> Message-ID: <071CB899-AB3E-40B8-9477-82AE98DB88B1@gmx.net> On Mar 20, 2008, at 8:51 PM, Erik wrote: > On Fri, March 21, 2008 01:34, Hilmar Lapp wrote: >> >> So I think at this point with this change I have to declare >> Bioperl-db officially incompatible with PostgreSQL 8.3+ until >> we've found a solution to this, which is too bad because it seems >> 8.3 has some really nice performance features added. > > Pg 8.3 is indeed very noticably faster, and it has other > excellent new features like full text indexing. (This also > makes that downgrading is not really an option) Right, I saw that too. It is, however, just migrated from what was a contrib module before, so downgrading and using the contrib module is an option. Furthermore, folding these new features together with a behavior change that is backwards incompatible was a choice the PostgreSQL people made, not we. We also aren't doing poor typing that deserves fixing; we're just not doing any typing by treating everything as a string. This is the Perl paradigm. At this point it's actually unclear to me how this new behavior is compatible with untyped scripting languages unless you know the type of each column that you're binding a value for, because if you actually force typecasts to string for everything you get an error if an integer is indeed what's needed. I'm wondering what I'm missing. -hilmar BTW what does the following query yield on your 8.3.1 database: select s.typname as source, t.typname as target, f.proname as function, c.castcontextfrom pg_cast c, pg_type s, pg_type t, pg_proc f where c.castsource = s.oid and c.casttarget = t.oid and c.castfunc = f.oidand t.typname = 'text'; On my 8.1.4 database I get: source | target | function | castcontext -------------+--------+----------+------------- bpchar | text | text | i char | text | text | i name | text | text | i int8 | text | text | i int2 | text | text | i int4 | text | text | i oid | text | text | i float4 | text | text | i float8 | text | text | i macaddr | text | text | e cidr | text | text | e inet | text | text | e date | text | text | i time | text | text | i timestamp | text | text | i timestamptz | text | text | i interval | text | text | i timetz | text | text | i numeric | text | text | i (19 rows) -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From greg at turnstep.com Thu Mar 20 22:41:10 2008 From: greg at turnstep.com (Greg Sabino Mullane) Date: Fri, 21 Mar 2008 02:41:10 -0000 Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer In-Reply-To: <987C9C0E-840B-44AD-B3E9-0FC2809FF4F4@gmx.net> Message-ID: <19ecb7a297f64722c4f63f10ed2ebdce@biglumber.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 > Which leads me back to the surprise observation that the parameter > was bound as an integer in the first place, when DBD::Pg used to bind > everything as string unless you told it otherwise. Which DBD::Pg > version is it that you are using? I would suspect (or hope) that > maybe there is soon an update release of DBD::Pg that fixes this > problem by going back to binding everything as string by default (and > as the tests show PostgreSQL will still convert strings to integer if > necessary). > > Depending on what I (or can someone else update us on this?) find out > for the DBD::Pg plans, I'll probably start looking into moving the > parameter binding into the driver adapters. Though it does feel > pathetic that this is now also not transparent between drivers. What you are probably looking for is already there, namely: $dbh->{pg_server_prepare} = 0; There's good reasons for the casting enforcement in 8.3, although I've been a sharp critic of the change, and certainly of the suddeness of it. Another solution to consider is adding the casts back in: http://people.planetpostgresql.org/peter/index.php?/archives/2008/03.html (the March 4th entry) - -- Greg Sabino Mullane greg at turnstep.com PGP Key: 0x14964AC8 200803202237 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iEYEAREDAAYFAkfjIBYACgkQvJuQZxSWSsiamwCdEbNrC4F4oU7AGHrbHAm1YNXG HbUAoIRJtGW4brvMKklxZYG6pusbcTqf =Zawx -----END PGP SIGNATURE----- From hlapp at gmx.net Fri Mar 21 08:52:39 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 21 Mar 2008 08:52:39 -0400 Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer In-Reply-To: <19ecb7a297f64722c4f63f10ed2ebdce@biglumber.com> References: <19ecb7a297f64722c4f63f10ed2ebdce@biglumber.com> Message-ID: Hi Greg - thanks for your email, it's very helpful. On Mar 20, 2008, at 10:41 PM, Greg Sabino Mullane wrote: >> >> Depending on what I (or can someone else update us on this?) find out >> for the DBD::Pg plans, I'll probably start looking into moving the >> parameter binding into the driver adapters. Though it does feel >> pathetic that this is now also not transparent between drivers. > > What you are probably looking for is already there, namely: > > $dbh->{pg_server_prepare} = 0; So disabling server-side prepares will leave values quoted? Having server-side prepares would be very useful though, especially for Bioperl-db with its many lookup queries that all use similar parameter values. > > There's good reasons for the casting enforcement in 8.3 I do understand that, but it's also a sharp contrast to other RDBMSs that doesn't it make it easier for people to choose Pg when they should, and doesn't help writing cross-platform database applications either. > although I've been a sharp critic of the change, and certainly of > the suddeness > of it. Another solution to consider is adding the casts back in: > > http://people.planetpostgresql.org/peter/index.php?/archives/ > 2008/03.html > (the March 4th entry) Thanks for this, that helps a lot. Do you have links to some of the key threads showing what rationale went into the decision? (Or should I just search for your name?) I'd like to read up on that first before pouring more oil into the fire. I suspect that many of those who made the decision are never faced with needing to write cross-RDBMS code. Also, I wonder why this wasn't made a configurable option so it can be disabled by a simple config file change (such as the move away from automatic OID columns). But obviously this is the wrong list for discussing this (though Bioperl-db *is* one of those pieces of software that must be cross-RDBMS). -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From er at xs4all.nl Fri Mar 21 17:43:47 2008 From: er at xs4all.nl (Erik) Date: Fri, 21 Mar 2008 22:43:47 +0100 (CET) Subject: [BioSQL-l] [Bioperl-l] postgres 8.3 - load_seqdatabase.pl / swissprot Message-ID: <16589.156.83.1.157.1206135827.squirrel@webmail.xs4all.nl> Hi, PostgreSQL 8.3.1 DBD::Pg 2.3.0 perl 5.8.8 (The following error may have to do with the 8.3 problems that I reported yesterday (bug 2472) - I don't know) I ran biosql-schema/scripts/load_ncbi_taxonomy.pl without problem. Then I ran scripts/biosql/load_seqdatabase.pl as: perl scripts/biosql/load_seqdatabase.pl \ -driver Pg \ -dbuser xxxxxxx \ -dbname bioseqdb \ -namespace swissprot \ -format swiss \ /DATA/ms/ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat It took two hours to load 26504 records (7%) of uniprot_sprot.dat (is it expected to be so slow?), then failed with: Could not store Q2UXW0: ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: create: object (Bio::Species) failed to insert or to be found by unique key STACK: Error::throw STACK: Bio::Root::Root::throw /home/aardvark/bin/perl/lib/site_perl/5.8.8/Bio/Root/Root.pm:357 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /home/aardvark/bin/perl/lib/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206 STACK: Bio::DB::Persistent::PersistentObject::create /home/aardvark/bin/perl/lib/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:244 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /home/aardvark/bin/perl/lib/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:169 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::store /home/aardvark/bin/perl/lib/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251 STACK: Bio::DB::Persistent::PersistentObject::store /home/aardvark/bin/perl/lib/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:271 STACK: scripts/biosql/load_seqdatabase.pl:630 ----------------------------------------------------------- I don't know if this is directly related to the 8.3 casting problems I reported yesterday (bug 2472), or a separate Bio::Species issue regards, Erik Rijkers From hlapp at gmx.net Sat Mar 22 14:18:45 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 22 Mar 2008 14:18:45 -0400 Subject: [BioSQL-l] Call for Student Applications - NESCent participates in the Google Summer of Code In-Reply-To: <0025B440-EF1E-4632-9DB4-B98489BF3550@duke.edu> Message-ID: <5AC4F213-8D88-41C6-B380-59B2EF7831F0@gmx.net> Hi all - just wanted to draw your attention to our Google Summer of Code participation this year. One of the projects deals directly with BioPerl, another one builds on BioSQL (and could be implemented taking advantage of BioPerl or Bio::Phylo, or Biojava). Cheers, -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== Phyloinformatics Summer of Code 2008 http://phyloinformatics.net/Phyloinformatics_Summer_of_Code_2008 *** Please disseminate this announcement widely to appropriate students at your institution *** The National Evolutionary Synthesis Center (NESCent: http:// www.nescent.org/) is participating in 2008 for the second year as a mentoring organization in the Google Summer of Code (http:// code.google.com/soc). Through this program, Google provides undergraduate, masters, and PhD students with a unique opportunity to obtain hands-on experience writing and extending open-source software under the mentorship of experienced developers from around the world. Our goal in participating is to train future researchers and developers to not only have awareness and understanding of the value of open-source and collaboratively developed software, but also to gain the programming and remote collaboration skills needed to successfully contribute to such projects. Students will receive a stipend from Google, and may work from their home, or home institution, for the duration of the 3 month program. Students will each have one or more dedicated mentors with expertise in phylogenetic methods and open-source software development. NESCent is particularly targeting students interested in both evolutionary biology and software development. Project ideas (see URL below) range from visualizing phylogenetic data in R, to development of a Mesquite module, web-services for phylogenetic data providers or geophylogeny mashups, implementing phyloXML support, navigating databases of networks, topology queries for PhyloCode registries, to phylogenetic tree mining in a MapReduce framework, and more. The project ideas are flexible and many can be adjusted in scope to match the skills of the student. If the program sounds interesting to you but you are unsure whether you have the necessary skills, please email the mentors at the address below. We will work with you to find a project that fits your interests and skills. INQUIRIES: Email any questions, including self-proposed project ideas, to phylosoc {at} nescent {dot} org. TO APPLY: Apply on-line at the Google Summer of Code website (http://code.google.com/soc/2008), where you will also find GSoC program rules and eligibility requirements. The 1-week application period for students opens on Monday March 24th and runs through Monday, March 31st, 2008. Hilmar Lapp and Todd Vision US National Evolutionary Synthesis Center ===== URLs: ===== 2008 NESCent Phyloinformatics Summer of Code: http://phyloinformatics.net/Phyloinformatics_Summer_of_Code_2008 Eligibility requirements: http://code.google.com/opensource/gsoc/2008/faqs.html#0.1_eligibility Stipends: http://code.google.com/opensource/gsoc/2008/faqs.html#0.1_administrivia To sign up for quarterly NESCent newsletters: with announcements about upcoming programs at the Center: http://www.nescent.org/about/contact.php From hlapp at gmx.net Sat Mar 22 16:01:51 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 22 Mar 2008 16:01:51 -0400 Subject: [BioSQL-l] [Bioperl-l] postgres 8.3 - load_seqdatabase.pl / swissprot In-Reply-To: <16589.156.83.1.157.1206135827.squirrel@webmail.xs4all.nl> References: <16589.156.83.1.157.1206135827.squirrel@webmail.xs4all.nl> Message-ID: <69D3EA33-810B-40EA-8687-752FA1A34FBF@gmx.net> Forgot to respond to this: On Mar 21, 2008, at 5:43 PM, Erik wrote: > It took two hours to load 26504 records (7%) of uniprot_sprot.dat > (is it expected to be so slow?) The last time I used to load those regularly it was a bit faster (~ 5 seqs/s) but it is in a ballpark that wouldn't raise a red flag for me. BTW you can make it print statistics using the --logchunk N option, where N is the number of seqs after which you want the current count and the #recs/s printed. You may get it to be faster if you tune the database (e.g., make sure there is enough memory for index reorganization, transaction log and tablespace datafile are on separate disks, etc; fiddling with the query optimizer has probably little effect as almost all queries are simple lookups or inserts). That all said, the strength of load_seqdatabase.pl isn't speed. It doesn't make use of any bulk upload optimizations, and therefore the initial load of a very large database will take its time. The power is more in subsequent updates where you can configure what you want to happen, and during which the database is never in an inconsistent state, so it can run in the background. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From greg at turnstep.com Sun Mar 23 20:42:36 2008 From: greg at turnstep.com (Greg Sabino Mullane) Date: Mon, 24 Mar 2008 00:42:36 -0000 Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer In-Reply-To: Message-ID: <4ab14dcc59d7566b55ba87027055e9fd@biglumber.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 >> Depending on what I (or can someone else update us on this?) find out >> for the DBD::Pg plans, I'll probably start looking into moving the >> parameter binding into the driver adapters. Though it does feel >> pathetic that this is now also not transparent between drivers. > > What you are probably looking for is already there, namely: > > $dbh->{pg_server_prepare} = 0; > So disabling server-side prepares will leave values quoted? Having > server-side prepares would be very useful though, especially for > Bioperl-db with its many lookup queries that all use similar > parameter values. Yes, it forces DBD::Pg to do the quoting itself, which basically means that everything is shipped to the server as a single SQL string, and no placeholders are used. In the grand scheme of things, the speed difference is not large for most queries. Certainly one way would be to turn this on for 8.3 and above, and slowly migrate the queries/schema over time. >> There's good reasons for the casting enforcement in 8.3 > I do understand that, but it's also a sharp contrast to other RDBMSs > that doesn't it make it easier for people to choose Pg when they > should, and doesn't help writing cross-platform database applications > either. I'm not overly familiar with how other databases treat this, but I've heard DB2 can be a stickler about this too. I've not dug into the bioperl code in a while, to be honest, so I'm not sure what sort of queries we're talking about. Certainly long-term the code and schema should move away from implicit casting. Maybe a better short-term solution is addind the more obvious casts (e.g. text<->int) back in. > Do you have links to some of the key threads showing what rationale > went into the decision? (Or should I just search for your name?) I'd > like to read up on that first before pouring more oil into the fire. > I suspect that many of those who made the decision are never faced > with needing to write cross-RDBMS code. > > Also, I wonder why this wasn't made a configurable option so it can > be disabled by a simple config file change (such as the move away > from automatic OID columns). But obviously this is the wrong list for . discussing this (though Bioperl-db *is* one of those pieces of > software that must be cross-RDBMS). I did ask about that, and was told it would not have been easy to do so. But I agree, a phasing in period (heck, even a warning) would have been nice. Feel free to pour some oil on the fire, I think this is one of many apps that has been affected. (I've run across two other major cross-DB apps (Interchange and MediaWiki) that are struggling with the same pain. I managed to painfully fix the latter, but the former is way too complex to tackle at the moment). I could not find the thread(s?) I weighed in on, but you can find some relevant discussions by googling "strict-typing benefits grokbase" - -- Greg Sabino Mullane greg at turnstep.com PGP Key: 0x14964AC8 200803232039 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iEYEAREDAAYFAkfm+NAACgkQvJuQZxSWSsi4ogCdGNWvCJIzXxb+YKzdm6wwxQMv p3AAnizkWXoo/rvxv4KVdC8tD0vF87k3 =dNYi -----END PGP SIGNATURE----- From biopython at maubp.freeserve.co.uk Tue Mar 25 11:56:16 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 25 Mar 2008 15:56:16 +0000 Subject: [BioSQL-l] [BioPython] Concerns the update of BioSQL.taxon table In-Reply-To: <320fb6e00803250853i629e59aj310ddc5667ea57d@mail.gmail.com> References: <711039.40736.qm@web26505.mail.ukl.yahoo.com> <320fb6e00803250853i629e59aj310ddc5667ea57d@mail.gmail.com> Message-ID: <320fb6e00803250856n1001d74dxeb8560652f594e51@mail.gmail.com> On Tue, Mar 25, 2008 at 3:53 PM, Peter wrote: > Hi Eric, > > Your issue is almost certainly due to switching from Biopython 1.44 to > 1.45, rather than from a prerelease BioSQL to the recently released > BioSQL 1.0.0. > > For background, you should read Bug 2422 and the BioSQL thread it points to. > http://bugzilla.open-bio.org/show_bug.cgi?id=2422 > > Biopython 1.44 never recorded the taxon id (and therefore didn't use > the taxon/taxon_name tables) > Biopython 1.45 does record the taxon id, and attempts to fill in > missing taxon/taxon_name entries > > I'm a little unclear on what is going wrong for you. Did you pre-load > the NCBI taxonomy for example? The script you are talking about, is > this your own? > > Peter > P.S. Did you mean to send your original message to the BioSQL list as well Eric? You need biosql-l at lists.open-bio.org not biosql at lists.open-bio.org Peter From ericgibert at yahoo.fr Wed Mar 26 07:29:24 2008 From: ericgibert at yahoo.fr (Eric Gibert) Date: Wed, 26 Mar 2008 11:29:24 +0000 (GMT) Subject: [BioSQL-l] Concerns the update of BioSQL.taxon table Message-ID: <290936.61510.qm@web26510.mail.ukl.yahoo.com> Thank you Peter for the correct email of the BioSQL list. No, it is not something linked to BioPython 1.45 upgrade: same behavior as 1.44. My problem is linked to the fact that the BioSQl schema version 1.0.0 defines a *unique* index on taxon.ncbi_taxon_id. I did not have this index before. I have written a script that connects to the taxonomy database of NCBI and get the XML data for the species. Then it updates the taxon table, replacing the ncbi_taxon_id and node_rank NULL by their values for all the lineage. I call it after the loading of BioSeqs in the database. Example: I load a BioSeq for Nannophya pygmaea then I run my script to update the ncbi_taxon_id and rank: +----------+---------------+-----------------+--------------+ | taxon_id | ncbi_taxon_id | parent_taxon_id | node_rank | +----------+---------------+-----------------+--------------+ | 13 | 2759 | NULL | superkingdom | | 14 | 33208 | 13 | kingdom | | 15 | 6656 | 14 | phylum | | 16 | 6960 | 15 | superclass | | 17 | 50557 | 16 | class | | 18 | 7496 | 17 | no rank | | 19 | 33339 | 18 | subclass | | 20 | 6961 | 19 | order | | 21 | 6962 | 20 | suborder | | 22 | 6964 | 21 | family | | 23 | 229390 | 22 | genus | | 24 | 229391 | 23 | species | No problem. Now I insert/load another Libellulideae (Orthetrum sabina ): 'empty/NULL' taxons records are inserted by the db.load() BioPython function: | 25 | NULL | NULL | NULL | | 26 | NULL | 25 | NULL | | 27 | NULL | 26 | NULL | | 28 | NULL | 27 | NULL | | 29 | NULL | 28 | NULL | | 30 | NULL | 29 | NULL | | 31 | NULL | 30 | NULL | | 32 | NULL | 31 | NULL | | 33 | NULL | 32 | NULL | | 34 | NULL | 33 | NULL | | 35 | NULL | 34 | genus | | 36 | 320892 | 35 | species | then I try to run my script: this time I have an update failure because the record 34 is the SAME family hence same ncbi_taxon_id as record 22: 'duplicate entry on key 2'. Either this *unique* index is new and it is a BioSQL "issue" (as said, this index did not exist in my previous BioSQL db so I never encountered this issue before), OR the way BioPython "repeats" existing taxons is incorrect/not compatible. In that case, when inserting the second BioSeq, record 34 should not be created but record 35 (the genus) should "point" to the already existing family at record 22 as its father. Thus I would have the confirmation on by BioSQL team that the unique index is valid. If that is the case, then we can have a BioPython separate talk about how to improve the management of the taxon table. Best regards, Eric _____________________________________________________________________________ Envoyez avec Yahoo! Mail. Capacit? de stockage illimit?e pour vos emails. http://mail.yahoo.fr From holland at ebi.ac.uk Wed Mar 26 08:00:03 2008 From: holland at ebi.ac.uk (Richard Holland) Date: Wed, 26 Mar 2008 12:00:03 +0000 Subject: [BioSQL-l] Concerns the update of BioSQL.taxon table In-Reply-To: <290936.61510.qm@web26510.mail.ukl.yahoo.com> References: <290936.61510.qm@web26510.mail.ukl.yahoo.com> Message-ID: <47EA3AC3.20104@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Purely from a database perspective, the index is correct. There should be no need to have a duplicate entry in ncbi_taxon_id. The implication is that taxon_id is a 1:1 mapping to ncbi_taxon_id. There should be no need to have two separate local taxon_id values referring to one NCBI taxon. Ideally, when you run your update script, for each taxon_id record it processes it should be checking for an existing entry with the same ncbi_taxon_id, getting the taxon_id for that existing entry, then removing the duplicate entry and updating the relevant parent_taxon_id values in other records to refer to the existing taxon_id instead. BioPython would need to be making similar checks when it inserts new entries. If it isn't, then it needs to be fixed. cheers, Richard Eric Gibert wrote: > Thank you Peter for the correct email of the BioSQL list. > > No, it is not something linked to BioPython 1.45 upgrade: same behavior as 1.44. My problem is linked to the fact that the BioSQl schema version 1.0.0 defines a *unique* index on taxon.ncbi_taxon_id. I did not have this index before. > > I have written a script that connects to the taxonomy database of NCBI and get the XML data for the species. Then it updates the taxon table, replacing the ncbi_taxon_id and node_rank NULL by their values for all the lineage. I call it after the loading of BioSeqs in the database. > > Example: > I load a BioSeq for Nannophya pygmaea then I run my script to update the ncbi_taxon_id and rank: > +----------+---------------+-----------------+--------------+ > | taxon_id | ncbi_taxon_id | parent_taxon_id | node_rank | > +----------+---------------+-----------------+--------------+ > | 13 | 2759 | NULL | superkingdom | > | 14 | 33208 | 13 | kingdom | > | 15 | 6656 | 14 | phylum | > | 16 | 6960 | 15 | superclass | > | 17 | 50557 | 16 | class | > | 18 | 7496 | 17 | no rank | > | 19 | 33339 | 18 | subclass | > | 20 | 6961 | 19 | order | > | 21 | 6962 | 20 | suborder | > | 22 | 6964 | 21 | family | > | 23 | 229390 | 22 | genus | > | 24 | 229391 | 23 | species | > > No problem. > > Now I insert/load another Libellulideae (Orthetrum sabina ): 'empty/NULL' taxons records are inserted by the db.load() BioPython function: > | 25 | NULL | NULL | NULL | > | 26 | NULL | 25 | NULL | > | 27 | NULL | 26 | NULL | > | 28 | NULL | 27 | NULL | > | 29 | NULL | 28 | NULL | > | 30 | NULL | 29 | NULL | > | 31 | NULL | 30 | NULL | > | 32 | NULL | 31 | NULL | > | 33 | NULL | 32 | NULL | > | 34 | NULL | 33 | NULL | > | 35 | NULL | 34 | genus | > | 36 | 320892 | 35 | species | > > then I try to run my script: this time I have an update failure because the record 34 is the SAME family hence same ncbi_taxon_id as record 22: 'duplicate entry on key 2'. > > Either this *unique* index is new and it is a BioSQL "issue" (as said, this index did not exist in my previous BioSQL db so I never encountered this issue before), OR the way BioPython "repeats" existing taxons is incorrect/not compatible. In that case, when inserting the second BioSeq, record 34 should not be created but record 35 (the genus) should "point" to the already existing family at record 22 as its father. > > Thus I would have the confirmation on by BioSQL team that the unique index is valid. If that is the case, then we can have a BioPython separate talk about how to improve the management of the taxon table. > > > Best regards, > > Eric > > > > > > > _____________________________________________________________________________ > Envoyez avec Yahoo! Mail. Capacit? de stockage illimit?e pour vos emails. http://mail.yahoo.fr > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > - -- Richard Holland (BioMart) EMBL EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK Tel. +44 (0)1223 494416 http://www.biomart.org/ http://www.biojava.org/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH6jrD4C5LeMEKA/QRAu7rAJ9TBYt0CeTTrPi0QN7Vm/UwiBANQwCfeoqz 0uTvcXXteholK+4xxuxjCXw= =qhOf -----END PGP SIGNATURE----- From biopython at maubp.freeserve.co.uk Wed Mar 26 08:30:50 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 26 Mar 2008 12:30:50 +0000 Subject: [BioSQL-l] [BioPython] Concerns the update of BioSQL.taxon table In-Reply-To: <290936.61510.qm@web26510.mail.ukl.yahoo.com> References: <290936.61510.qm@web26510.mail.ukl.yahoo.com> Message-ID: <320fb6e00803260530w72cca900mc19654798d5d7e13@mail.gmail.com> On Wed, Mar 26, 2008 at 11:29 AM, Eric Gibert wrote: > Thank you Peter for the correct email of the BioSQL list. > > No, it is not something linked to BioPython 1.45 upgrade: same behavior as 1.44. > My problem is linked to the fact that the BioSQl schema version 1.0.0 defines a > *unique* index on taxon.ncbi_taxon_id. I did not have this index before. > > I have written a script that connects to the taxonomy database of NCBI and get > the XML data for the species. Then it updates the taxon table, replacing the > ncbi_taxon_id and node_rank NULL by their values for all the lineage. I call it > after the loading of BioSeqs in the database. So you wrote your own version of the BioSQL perl script load_ncbi_taxonomy.pl? > Example: > I load a BioSeq for Nannophya pygmaea then I run my script to update the ncbi_taxon_id and rank: > +----------+---------------+-----------------+--------------+ > | taxon_id | ncbi_taxon_id | parent_taxon_id | node_rank | > +----------+---------------+-----------------+--------------+ > | 13 | 2759 | NULL | superkingdom | > | 14 | 33208 | 13 | kingdom | > | 15 | 6656 | 14 | phylum | > | 16 | 6960 | 15 | superclass | > | 17 | 50557 | 16 | class | > | 18 | 7496 | 17 | no rank | > | 19 | 33339 | 18 | subclass | > | 20 | 6961 | 19 | order | > | 21 | 6962 | 20 | suborder | > | 22 | 6964 | 21 | family | > | 23 | 229390 | 22 | genus | > | 24 | 229391 | 23 | species | > > No problem. > > Now I insert/load another Libellulideae (Orthetrum sabina ): 'empty/NULL' > taxons records are inserted by the db.load() BioPython function: These records are "guess work" based on the lineage in the GenBank file - we don't know the NCBI taxon ids, so they are NULL, nor the rank, but there is a scientific name in the lined taxon_name table. I am open to the idea of not writing this guessed lineage, and just writing one entry for the species and the given NCBI taxon ID. However, as the new entry Orthetrum sabina should share some of its lineage with Nannophya pygmaea, then I agree Biopython *should* be re-using those existing taxon entries, if it can match them safely using the scientific name. Re-reading the relevant bit of old code, it doesn't seem to do this. I've file bug 2475: http://bugzilla.open-bio.org/show_bug.cgi?id=2475 This is actually a tricky problem, requiring some a 'clever' parent linkage as you said in your earlier email. Hilmar wrote this about the equivalent code in BioPerl: >> It's pretty unreliable actually. There is not only synonymy but also >> rampant homonymy in taxonomic names. There are plenty of examples >> for the same scientific name in use for a plant and for some animal, for >> example. So in order to be unambiguous you will need to know (and >> check) the kingdom. See http://lists.open-bio.org/pipermail/biosql-l/2008-March/001207.html Eric wrote: > then I try to run my script: this time I have an update failure because the > record 34 is the SAME family hence same ncbi_taxon_id as record 22: > 'duplicate entry on key 2'. > > Either this *unique* index is new and it is a BioSQL "issue" (as said, this index > did not exist in my previous BioSQL db so I never encountered this issue before), Hopefully Hilmar from BioSQL can answer this. > OR the way BioPython "repeats" existing taxons is incorrect/not compatible. > In that case, when inserting the second BioSeq, record 34 should not be created > but record 35 (the genus) should "point" to the already existing family at record > 22 as its father. This example might be easier to follow if the scientific names from the taxon_name were included. I would check the lineage but the NCBI wepage is being very slow for me right now. In the short term, as a quick fix, your script could first remove taxon entries with a blank NCBI taxon ID (and clear any keys pointing to them). Not elegent - but it would work. Thanks Eric Peter From hlapp at gmx.net Wed Mar 26 09:29:01 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 26 Mar 2008 09:29:01 -0400 Subject: [BioSQL-l] Concerns the update of BioSQL.taxon table In-Reply-To: <290936.61510.qm@web26510.mail.ukl.yahoo.com> References: <290936.61510.qm@web26510.mail.ukl.yahoo.com> Message-ID: On Mar 26, 2008, at 7:29 AM, Eric Gibert wrote: > Either this *unique* index is new and it is a BioSQL "issue" (as > said, this index did not exist in my previous BioSQL db so I never > encountered this issue before) The unique index has been there since Feb 2003 (the Singapore Biohackathon). I'm not sure how you got a version that doesn't have it. The unique key constraint on the identifier column is also necessary - otherwise you cannot guarantee lookups by the NCBI taxonID to return either one or zero rows. Like Peter and Richard, I also don't understand what the point would be in allowing the same taxon (which in essence is a node), as identified by taxonID, to exist more than once. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From pan.mueller at yahoo.de Thu Mar 27 15:33:34 2008 From: pan.mueller at yahoo.de (=?iso-8859-1?Q?Peter_M=FCller?=) Date: Thu, 27 Mar 2008 20:33:34 +0100 (CET) Subject: [BioSQL-l] bioentries in a sequence cluster Message-ID: <664425.11239.qm@web28203.mail.ukl.yahoo.com> Dear list, I have a few questions, but maybe with a working example, I can derive the rest. With perl-db I can fetch a Bio::Cluster Object wit this query: (I found no documentation about c::subject and p::object ...) $query->datacollections( ["Bio::PrimarySeqI c::subject", "Bio::PrimarySeqI p::object", "Bio::PrimarySeqI<=>Bio::ClusterI<=>Bio::Ontology::TermI"]); $query->where(["p.accession_number = 'NM_000015'"]); my $adp = $db->get_object_adaptor('Bio::Cluster'); my $qres = $adp->find_by_query($query); That's great - but here I ask for a sequence accession-number. Is it possible to aks for the Clone (IMAGE:4722596) or for an STS accession-number where the result is also a cluster object? "give me the cluster(s) where in the sequence-line is a clone-entry with this number 'IMAGE:4722596' .... "give me the cluster(s) where in the STS-line is an accession-number with this value 'PMC310725P3'... PROTID and NID would be also interesting. UniGene-snippet: STS ACC=PMC310725P3 UNISTS=272646 PROTSIM ORG=10090; PROTGI=6754794; PROTID=NP_035004.1; PCT=76.55; ALN=288 SEQUENCE ACC=BG569293.1; NID=g13576946; CLONE=IMAGE:4722596; END=5'; LID=6989; SEQTYPE=EST; TRACE=44157214 regards pan Machen Sie Yahoo! zu Ihrer Startseite. Los geht's: http://de.yahoo.com/set From hlapp at gmx.net Sun Mar 30 01:00:25 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 30 Mar 2008 01:00:25 -0400 Subject: [BioSQL-l] bioentries in a sequence cluster In-Reply-To: <664425.11239.qm@web28203.mail.ukl.yahoo.com> References: <664425.11239.qm@web28203.mail.ukl.yahoo.com> Message-ID: <8083537C-C721-48C2-A838-AAC2B178468A@gmx.net> On Mar 27, 2008, at 3:33 PM, Peter M?ller wrote: > > > Dear list, > > I have a few questions, but maybe with a working example, I can > derive the rest. > > With perl-db I can fetch a Bio::Cluster Object wit this query: > (I found no documentation about c::subject and p::object ...) Yes, sorry, this needs a lot more documentation. The suffix of the alias separated from it by '::' is the 'context'. This is needed if the same entity participates more than once in an association. What's confusing the issue further here is that at the object level each object entity (Bio::PrimarySeq, Bio::ClusterI, Bio::Ontology::TermI) is participating only once, though in reality Bio::ClusterI and Bio::PrimarySeqI both map to table bioentry. > > $query->datacollections( > ["Bio::PrimarySeqI c::subject", > "Bio::PrimarySeqI p::object", I think that Bio::PrimarySeqI can be substituted with Bio::ClusterI in the second line. This would make the mapping clearer I guess. I'm not sure why I wrote the example that way, but I'd be surprised if Bio::ClusterI does not work here. > "Bio::PrimarySeqI<=>Bio::ClusterI<=>Bio::Ontology::TermI"]); > > $query->where(["p.accession_number = 'NM_000015'"]); Actually I think you need to use c.accession_number to query by sequence accession. The c (child) alias is the cluster member, and the p (parent) alias is the cluster itself. > > my $adp = $db->get_object_adaptor('Bio::Cluster'); > my $qres = $adp->find_by_query($query); > > > That's great - but here I ask for a sequence accession-number. > > Is it possible to aks for the Clone (IMAGE:4722596) or for an STS > accession-number where the result is also a cluster object? > "give me the cluster(s) where in the sequence-line is a clone-entry > with this number 'IMAGE:4722596' .... > "give me the cluster(s) where in the STS-line is an accession- > number with this value 'PMC310725P3'... > PROTID and NID would be also interesting. PID and NID should become the primary_id() of the sequence members. Hence, you would say c.primary_id where you have c.accession_number above. Each STS line should be in a qualifier/value pair attached to the cluster bioentry, under the tag 'sts' (which from what I can see would consist of whole lines, not ACC= and UNISTS= values parsed out, though I may be mistaken). So you would add "Bio::PrimarySeqI<=>Bio::Annotation::SimpleValue sv" to the datacollections, and "sv.value = 'ACC=PMC310725P3 UNISTS=272646'" and "sv.tagname = 'sts'" to the where() array. The same goes for IMAGE clone IDs, except that the tag name is 'clone' and the qualifier/value is attached to the member sequence, not the cluster; also here not the entire line is stored, but rather parsed into tokens. Does this help? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Sat Mar 1 20:42:05 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 1 Mar 2008 14:42:05 -0600 Subject: [BioSQL-l] BioSQL bug in bugzilla Message-ID: Hilmar, Just wanted to point out a bug which I thought was bioperl-db-related but is really BioSQL. Could you take a look to see what you think? http://bugzilla.open-bio.org/show_bug.cgi?id=2389 chris From hlapp at gmx.net Sun Mar 2 00:06:55 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 1 Mar 2008 19:06:55 -0500 Subject: [BioSQL-l] biosql usage/user survey In-Reply-To: <9692f0e9a791c7d0bf942e497668fdce@gmx.net> References: <9692f0e9a791c7d0bf942e497668fdce@gmx.net> Message-ID: I sent this survey request back in 2005 and received a number of direct responses. I am assuming that since I said I was going to use them for the paper everyone was assuming that their BioSQL usage would be made public. I am going to assemble the responses into a Wiki page as Malcolm suggested; if you responded to me and do not want to appear on that page, please let me know. -hilmar On Nov 3, 2005, at 11:53 AM, Hilmar Lapp wrote: > Hi all, > > I am writing up a paper on BioSQL and would like to include some > current usage figures to support its utility. > > Therefore, if you are using BioSQL I'd be glad if you could drop me > an email; if you can include a word or two (not more than 1 > sentence) on what you use it for that'd be great too. > > Thanks in advance, > > -hilmar > -- > ------------------------------------------------------------- > Hilmar Lapp email: lapp at gnf.org > GNF, San Diego, Ca. 92121 phone: +1-858-812-1757 > ------------------------------------------------------------- > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Sun Mar 2 01:16:24 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 1 Mar 2008 19:16:24 -0600 Subject: [BioSQL-l] multiple species for a sequence Message-ID: <0CFEE9C6-4012-4E6F-B81F-33EE7379036F@uiuc.edu> I'm looking at a bioperl bug I filed a while back that deals with multiple species in a sequence file, such as found for AJ428955: ID AJ428955; SV 1; linear; mRNA; STD; VRL; 8027 BP. XX AC AJ428955; XX DT 09-JUL-2002 (Rel. 72, Created) DT 15-APR-2005 (Rel. 83, Last updated, Version 4) XX DE Hepatitis GB virus B subgenomic replicon neoRepB XX KW core-neo fusion protein; core-neo gene; polyprotein. XX OS Hepatitis GB virus B OC Viruses; ssRNA positive-strand viruses, no DNA stage; Flaviviridae. XX OS Encephalomyocarditis virus OC Viruses; ssRNA positive-strand viruses, no DNA stage; Picornaviridae; OC Cardiovirus. ... We could probably add support in bioperl fairly easily (Bio::Seq could just return an array or the first species object based on context), but would BioSQL support sequences like this? chris From hlapp at gmx.net Sun Mar 2 17:33:23 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 2 Mar 2008 12:33:23 -0500 Subject: [BioSQL-l] multiple species for a sequence In-Reply-To: <0CFEE9C6-4012-4E6F-B81F-33EE7379036F@uiuc.edu> References: <0CFEE9C6-4012-4E6F-B81F-33EE7379036F@uiuc.edu> Message-ID: <86EDE863-287A-48BF-9ED8-13219C3D342E@gmx.net> On Mar 1, 2008, at 8:16 PM, Chris Fields wrote: > I'm looking at a bioperl bug I filed a while back that deals with > multiple species in a sequence file, such as found for AJ428955: > > ID AJ428955; SV 1; linear; mRNA; STD; VRL; 8027 BP. > XX > AC AJ428955; > XX > DT 09-JUL-2002 (Rel. 72, Created) > DT 15-APR-2005 (Rel. 83, Last updated, Version 4) > XX > DE Hepatitis GB virus B subgenomic replicon neoRepB > XX > KW core-neo fusion protein; core-neo gene; polyprotein. > XX > OS Hepatitis GB virus B > OC Viruses; ssRNA positive-strand viruses, no DNA stage; > Flaviviridae. > XX > OS Encephalomyocarditis virus > OC Viruses; ssRNA positive-strand viruses, no DNA stage; > Picornaviridae; > OC Cardiovirus. > > ... > > We could probably add support in bioperl fairly easily (Bio::Seq > could just return an array or the first species object based on > context), but would BioSQL support sequences like this? No it wouldn't. There may only be one species (taxon) per sequence. There has been a lot of discussion about this in the past mostly driven by the former SwissProt peculiarity of collapsing sequences by sequence identity into a single record. We held out and eventually UniProt dropped this practice. I guess we never quite decided what to do about chimeric sequences like the above. Note that the GenBank record gives this differently: http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=21727885 Here, there's one taxon (ORGANISM line) reference, but two localized 'source' features in the feature table. (I'm actually not 100% sure what the genbank parser would do with this - i.e., whether the second source feature will override the taxon_id found in the first.) Because seqfeatures (in BioSQL) don't have a link to taxon, you wouldn't be able to hit the sequence by its second (chimeric) taxon if that were your query criteria (though you could store it fine, and if you queried by dbxrefs of features of type 'source', you would find it). At the end of the day, BioSQL will evolve (hopefully) quickly to support what the Bio* toolkits support, and will be much slower to change in ways that Bio* wouldn't be able to take advantage of anyway. At least that's my current vision of it, and of course is up for debate as to whether that's a useful vision as much as anything else. So, as you say, right now BioPerl, and AFAIAA any of the other Bio* toolkits, doesn't support more than one species per sequence, but as soon as that changes, there's a clear need for BioSQL to follow along. Does that make sense? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sun Mar 2 17:39:17 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 2 Mar 2008 12:39:17 -0500 Subject: [BioSQL-l] BioSQL bug in bugzilla In-Reply-To: References: Message-ID: I don't think it's a good idea to just replace all varchar() types with type text. First of all, having reasonable constraints is a Good Thing(tm) in my book as the majority of times I found them violated it revealed a parsing error, rather than the constraints not fitting the data. Second, this won't solve the problem for the other RDBMS versions for which there is a real performance penalty and other implications when having unreasonably large column widths. That said, if the constraint is indeed not compatible with current data (such as Uniprot) we have a problem that needs to be fixed. So, what I would like to find out is 1) is this in reality a parsing error, or is there indeed a value for a column that in BioSQL is constrained to 40 chars, and 2) if so, which column in which table is the problem. Erik - would you mind sending me the full error stack if you still have it? Usually load_seqdatabase.pl will also print an extra warning message saying what it couldn't store. That message would be great too. If you don't have either anymore, do you remember vaguely what those messsages said? Alternatively, do you have the offending uniprot entry (or its accession)? I suspect that it's actually the constraint on dbxref.accession. Does that ring a bell? -hilmar On Mar 1, 2008, at 3:42 PM, Chris Fields wrote: > Hilmar, > > Just wanted to point out a bug which I thought was bioperl-db- > related but is really BioSQL. Could you take a look to see what > you think? > > http://bugzilla.open-bio.org/show_bug.cgi?id=2389 > > chris > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Sun Mar 2 18:00:50 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 2 Mar 2008 12:00:50 -0600 Subject: [BioSQL-l] multiple species for a sequence In-Reply-To: <86EDE863-287A-48BF-9ED8-13219C3D342E@gmx.net> References: <0CFEE9C6-4012-4E6F-B81F-33EE7379036F@uiuc.edu> <86EDE863-287A-48BF-9ED8-13219C3D342E@gmx.net> Message-ID: On Mar 2, 2008, at 11:33 AM, Hilmar Lapp wrote: > On Mar 1, 2008, at 8:16 PM, Chris Fields wrote: > >> I'm looking at a bioperl bug I filed a while back that deals with >> multiple species in a sequence file, such as found for AJ428955: >> >> ID AJ428955; SV 1; linear; mRNA; STD; VRL; 8027 BP. >> XX >> AC AJ428955; >> XX >> DT 09-JUL-2002 (Rel. 72, Created) >> DT 15-APR-2005 (Rel. 83, Last updated, Version 4) >> XX >> DE Hepatitis GB virus B subgenomic replicon neoRepB >> XX >> KW core-neo fusion protein; core-neo gene; polyprotein. >> XX >> OS Hepatitis GB virus B >> OC Viruses; ssRNA positive-strand viruses, no DNA stage; >> Flaviviridae. >> XX >> OS Encephalomyocarditis virus >> OC Viruses; ssRNA positive-strand viruses, no DNA stage; >> Picornaviridae; >> OC Cardiovirus. >> >> ... >> >> We could probably add support in bioperl fairly easily (Bio::Seq >> could just return an array or the first species object based on >> context), but would BioSQL support sequences like this? > > No it wouldn't. There may only be one species (taxon) per sequence. > > There has been a lot of discussion about this in the past mostly > driven by the former SwissProt peculiarity of collapsing sequences > by sequence identity into a single record. We held out and > eventually UniProt dropped this practice. I'm unsure how often these pop up. The behavior of both EMBL and GenBank parsers assumes one species (as does Bio::Seq); the embl parser picks up both and just replaces the first with the second: ... DE Hepatitis GB virus B subgenomic replicon neoRepB XX KW core-neo fusion protein; core-neo gene; polyprotein. XX OS Encephalomyocarditis virus OC Viruses; ssRNA positive-strand viruses, no DNA stage; Picornaviridae; OC Cardiovirus. XX RN [1] ... > I guess we never quite decided what to do about chimeric sequences > like the above. Note that the GenBank record gives this differently: > > http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=21727885 > > Here, there's one taxon (ORGANISM line) reference, but two localized > 'source' features in the feature table. (I'm actually not 100% sure > what the genbank parser would do with this - i.e., whether the > second source feature will override the taxon_id found in the > first.) Because seqfeatures (in BioSQL) don't have a link to taxon, > you wouldn't be able to hit the sequence by its second (chimeric) > taxon if that were your query criteria (though you could store it > fine, and if you queried by dbxrefs of features of type 'source', > you would find it). The genbank parser gets the taxon and tax ID correct; I would think when it hit the next source feature key it would assign the wrong tax ID to the species object but maybe there's a secondary check. Both output the source in feature tables just fine. > At the end of the day, BioSQL will evolve (hopefully) quickly to > support what the Bio* toolkits support, and will be much slower to > change in ways that Bio* wouldn't be able to take advantage of > anyway. At least that's my current vision of it, and of course is up > for debate as to whether that's a useful vision as much as anything > else. > > So, as you say, right now BioPerl, and AFAIAA any of the other Bio* > toolkits, doesn't support more than one species per sequence, but as > soon as that changes, there's a clear need for BioSQL to follow along. > > Does that make sense? > > -hilmar Yes. I think we could add in support for multiple species fairly easily but I'll probably hold off on anything until after a 1.6 release (i.e. push it to the next developer series, which gives us more time to think on how to implement this in a BioSQL-friendly way). chris From er at xs4all.nl Sun Mar 2 18:34:10 2008 From: er at xs4all.nl (Erik Rijkers) Date: Sun, 2 Mar 2008 19:34:10 +0100 (CET) Subject: [BioSQL-l] BioSQL bug in bugzilla Message-ID: <25296.156.83.0.185.1204482850.squirrel@webmail.xs4all.nl> Hi Hilmar, Sorry, it's too long ago. I can run it again (with new versions) somewhere next week. I don't remember which of the two problems (parser or data size) it was in my case. If it is true what you say (that most errors are due to the parser), it might indeed be better to leave those constraints in until such time that the parser has become more trustworthy, and use the database as a test instrument... What is really needed of course is a place to run these loading scrips continually against any appearing new versions of parsable text, and against the different database backends. Does that already happen somewhere? Should we consider such a bioperl buildfarm / loadfarm? (I might be able to help with any postgres loading tests.) Thanks, Erik Rijkers On Sun, March 2, 2008 18:39, Hilmar Lapp wrote: > I don't think it's a good idea to just replace all > varchar() types > with type text. > > First of all, having reasonable constraints is a Good > Thing(tm) in my > book as the majority of times I found them violated it > revealed a > parsing error, rather than the constraints not fitting the > data. > Second, this won't solve the problem for the other RDBMS > versions for > which there is a real performance penalty and other > implications when > having unreasonably large column widths. > > That said, if the constraint is indeed not compatible with > current > data (such as Uniprot) we have a problem that needs to be > fixed. So, > what I would like to find out is > > 1) is this in reality a parsing error, or is there indeed > a value for > a column that in BioSQL is constrained to 40 chars, and > > 2) if so, which column in which table is the problem. > > Erik - would you mind sending me the full error stack if > you still > have it? Usually load_seqdatabase.pl will also print an > extra warning > message saying what it couldn't store. That message would > be great > too. If you don't have either anymore, do you remember > vaguely what > those messsages said? Alternatively, do you have the > offending > uniprot entry (or its accession)? > > I suspect that it's actually the constraint on > dbxref.accession. Does > that ring a bell? > > -hilmar > > > On Mar 1, 2008, at 3:42 PM, Chris Fields wrote: > >> Hilmar, >> >> Just wanted to point out a bug which I thought was >> bioperl-db- >> related but is really BioSQL. Could you take a look to >> see what >> you think? >> >> http://bugzilla.open-bio.org/show_bug.cgi?id=2389 >> >> chris >> _______________________________________________ >> BioSQL-l mailing list >> BioSQL-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biosql-l > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net > : > =========================================================== > > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > From hlapp at gmx.net Sun Mar 2 19:20:21 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 2 Mar 2008 14:20:21 -0500 Subject: [BioSQL-l] database loading test server (was: BioSQL bug in bugzilla) In-Reply-To: <25296.156.83.0.185.1204482850.squirrel@webmail.xs4all.nl> References: <25296.156.83.0.185.1204482850.squirrel@webmail.xs4all.nl> Message-ID: <52F7E02B-7192-4F3C-9B48-887583B25CCE@gmx.net> Hi Erik, On Mar 2, 2008, at 1:34 PM, Erik Rijkers wrote: > What is really needed of course is a place to run these > loading scrips continually against any appearing new > versions of parsable text, and against the different > database backends. very true indeed. > > Does that already happen somewhere? > > Should we consider such a bioperl buildfarm / loadfarm? > > (I might be able to help with any postgres loading tests.) Coincidentally we have been batting around the idea to have a OBF machine dedicated to serve for testing and proof-of-concept demonstrations of OBF projects. Indeed one of the services we had thought about setting up is a BioSQL database, and it's reassuring to hear independently that that would be useful. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From er at xs4all.nl Sun Mar 2 20:01:46 2008 From: er at xs4all.nl (Erik Rijkers) Date: Sun, 2 Mar 2008 21:01:46 +0100 (CET) Subject: [BioSQL-l] database loading test server (was: BioSQL bug in bugzilla) Message-ID: <9081.156.83.0.185.1204488106.squirrel@webmail.xs4all.nl> Maybe we can use some ideas from the way the PostgreSQL project has setup a distributed buildfarm (conceived by Andrew Dunstan, I think): see: http://www.pgbuildfarm.org/ it lets members of the community use a standardized setup for building postgresql on their own machines and automates all steps involved. I know the projects and the communities are different, but the general idea to have a standard process to set up machines for whomever wants to dedicate some hardware and time seems like a good idea. Erik Rijkers On Sun, March 2, 2008 20:20, Hilmar Lapp wrote: > Hi Erik, > > On Mar 2, 2008, at 1:34 PM, Erik Rijkers wrote: > >> What is really needed of course is a place to run these >> loading scrips continually against any appearing new >> versions of parsable text, and against the different >> database backends. > > very true indeed. > >> >> Does that already happen somewhere? >> >> Should we consider such a bioperl buildfarm / loadfarm? >> >> (I might be able to help with any postgres loading >> tests.) > > > Coincidentally we have been batting around the idea to > have a OBF > machine dedicated to serve for testing and > proof-of-concept > demonstrations of OBF projects. Indeed one of the services > we had > thought about setting up is a BioSQL database, and it's > reassuring to > hear independently that that would be useful. > > -hilmar > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net > : > =========================================================== > > > > From hlapp at gmx.net Sun Mar 2 20:38:27 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 2 Mar 2008 15:38:27 -0500 Subject: [BioSQL-l] enhancement request scheduling Message-ID: <5D2BC733-9A44-4EEA-B1D7-6DF90116B50E@gmx.net> FYI, I have added the chimeric sequence problem and the character column width issue to the Enhancement Requests page on the wiki: http://www.biosql.org/wiki/Enhancement_Requests I've also started to arrange individual requests in a first draft towards scheduling them for implementation. This is very much up for debate, so let me know any feedback or disagreement you have or votes you might want to put in. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Sun Mar 2 22:53:34 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 2 Mar 2008 22:53:34 +0000 Subject: [BioSQL-l] database loading test server (was: BioSQL bug in bugzilla) In-Reply-To: <52F7E02B-7192-4F3C-9B48-887583B25CCE@gmx.net> References: <25296.156.83.0.185.1204482850.squirrel@webmail.xs4all.nl> <52F7E02B-7192-4F3C-9B48-887583B25CCE@gmx.net> Message-ID: <320fb6e00803021453h553a5c2ay8c50381ef39d0b6a@mail.gmail.com> > > Coincidentally we have been batting around the idea to have a OBF > machine dedicated to serve for testing and proof-of-concept > demonstrations of OBF projects. Indeed one of the services we had > thought about setting up is a BioSQL database, and it's reassuring to > hear independently that that would be useful. > The BioSQL test database would be especially useful if we have all the Bio* projects hooked up to it, to automatically check they can all read records written by each other. I still haven't made time to get BioPerl setup on my machine to check the BioSQL compatibility with Biopython... Peter From hlapp at gmx.net Mon Mar 3 03:18:47 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 2 Mar 2008 22:18:47 -0500 Subject: [BioSQL-l] small "bug" correction in package BioSql In-Reply-To: <473455BE.6040807@ebi.ac.uk> References: <762277.43372.qm@web26507.mail.ukl.yahoo.com> <473336E6.6000100@ebi.ac.uk> <9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net> <473455BE.6040807@ebi.ac.uk> Message-ID: <4C9ACC1A-8C61-4611-8083-EFAD34D186EF@gmx.net> Just FYI, I added a section to this extent to the Enhancement Requests: http://www.biosql.org/wiki/ Enhancement_Requests#Check_constraint_on_biosequence.alphabet Feel free to fix/add as appropriate. -hilmar On Nov 9, 2007, at 7:42 AM, Richard Holland wrote: > I did a bit of poking around in our code and internally BioJava > represents all the default alphabet names (Protein, DNA, etc.) in > upper > case. It also allows for mixed case alphabet names. > > It's not quite as easy as I thought to change these to lower case as > they are often referenced by text name, meaning other people's code > might break if I change them. > > Also, as it allows for mixed-case alphabet names, I can't do a > toUpper/toLower fudge on persistence to BioSQL, as I wouldn't > necessarily get out what I put in! -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Mon Mar 3 03:38:59 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 2 Mar 2008 22:38:59 -0500 Subject: [BioSQL-l] Fwd: error on insert new sequences from GenBank: no annotations saved in BioSQL database References: Message-ID: <917E3FCD-5FDB-460D-8F63-B897ACA5CDD2@gmx.net> FYI, I used this to start a page on the recommended mapping of sequence annotation to BioSQL: http://www.biosql.org/wiki/Annotation_Mapping Obviously, this is very rudimentary, but everyone is welcome to add to it or comment with further questions. Also, one of the most important questions, namely a consistent vocabulary for annotation (qualifier) tags, isn't mentioned there (yet). -hilmar Begin forwarded message: > From: Hilmar Lapp > Date: November 8, 2007 3:28:19 PM EST > To: Eric Gibert > Cc: biopython at lists.open-bio.org, BioJava > Subject: Re: [Biojava-l] [BioPython] error on insert new sequences > from GenBank: no annotations saved in BioSQL database > > Maybe we need to hold some mini-hackathon to make the different > toolkits compatible in how they map annotation to the schema. > Obviously I don't know whether you have the latest Biojava setup > here, but I'll just comment how BioPerl/Bioperl-db would map this: > > 'ORIGIN' - if I'm not mistaken this is only a token that introduces > the actual sequence. I'm not sure what Biojava is storing as value > here. > > 'DIVISION' - this maps to column division in table bioentry (though I > agree that if perfectly following the weak typing principle this > should be tag/value association, but at present it's still an actual > column) > > 'genbank_accessions' - secondary accession numbers indeed go into the > qualifier value table. The primary accession maps to column accession > in table bioentry > > 'TITLE' - this is part of a publication reference, and should map to > column title in table reference (which it does in bioperl-db) > > 'cross_references' - not sure where these would be coming from in > GenBank format; for EMBL this will map to the dbxref table > > 'data_file_division' - not sure what this is (same as DIVISION?) > > 'VERSION' - in BioPerl we parse this apart into a version for the > accession (which is column version in table bioentry) and the GI > number, which maps to column identifier in table bioentry > > 'references' - these map to table reference (and bioentry_reference > for association with the bioentry) > > 'KEYWORDS' - indeed these map to bioentry_qualifier_value > > 'GI' - maps to column identifier in table bioentry > > 'SIZE' - not sure what size that is. If it is the length of the > sequence, it should (and in BioPerl/bioperl-db does) map to column > length in table biosequence > > 'DEFINITION' - maps to column description in table bioentry > > 'REFERENCE' - should be the same as for 'references' > > 'MDAT' - not sure what this is > > 'ORGANISM' - this is the organism and maps to the table taxon (and > taxon_name), with a foreign key in bioentry pointing to the taxon > > 'JOURNAL' - this is part of a reference, see 'references' > > 'ACCESSION' - the primary accession, maps to column accession in > table bioentry > > 'LOCUS' - in the file itself this is an entire line consisting of > multiple fields; BioPerl/bioperl-db maps the locus name (the first > token after the literal token LOCUS) to column name in table bioentry > > 'SOURCE' - this is the organism, see 'ORGANISM' > > 'PUBMED' - this is part of a literature reference, and maps to a > foreign key in the reference table (reference.dbxref) to a dbxref > entry with PUBMED or PMID as the database and the pubmed ID as the > accession > > 'AUTHORS' - part of a literature reference, maps to column authors in > table reference > > 'TYPE' - not sure what this is. If it's the alphabet, it maps to > table biosequence, column alphabet > > 'CIRCULAR' - this at present indeed maps to bioentry_qualifier_value, > though there have been plans to make it a column in table biosequence. > > Note that this could in fact be the way Biojava stores it too, but > upon retrieval represents it in the way you are seeing it. > > Hth, > > -hilmar > > On Nov 8, 2007, at 12:50 PM, Eric Gibert wrote: > >> Dear all, >> >> When I retrieve a BioSQL.BioSeq.DBSeqRecord which was inserted >> previously by my BioJava application, I have: >> >> print "Debug on Seq:", Seq.id, "=", Seq.annotations.keys() >> >> Debug on Seq: AJ459190.1 = ['ORIGIN', 'DIVISION', >> 'genbank_accessions', 'TITLE', 'cross_references', >> 'data_file_division', 'VERSION', 'references', 'KEYWORDS', 'GI', >> 'SIZE', 'DEFINITION', 'REFERENCE', 'MDAT', 'ORGANISM', 'JOURNAL', >> 'ACCESSION', 'LOCUS', 'SOURCE', 'PUBMED', 'AUTHORS', 'TYPE', >> 'CIRCULAR'] >> >> but a freshly inserted BioSeq by BioPython 1.44 only gives me: >> Debug on Seq: EF631597.1 = ['cross_references', 'dates', >> 'references', 'gi', 'data_file_division'] >> >> >> Once I look in the table bioentry_qualifier_value >> >> * 20 records for a Sequence imported by BioJava >> * 1 only for a Sequence inserted by BioPython: the date which >> should be inserted by "_load_bioentry_date" in BioSQL/Loader.py >> >> Quite a few annotations missing, no? >> >> Any idea? >> >> Eric >> >> >> >> >> >> _____________________________________________________________________ >> _ >> _______ >> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers >> Yahoo! Mail >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > > > _______________________________________________ > Biojava-l mailing list - Biojava-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biojava-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From cjfields at uiuc.edu Mon Mar 3 04:36:56 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Sun, 2 Mar 2008 22:36:56 -0600 Subject: [BioSQL-l] Fwd: error on insert new sequences from GenBank: no annotations saved in BioSQL database In-Reply-To: <917E3FCD-5FDB-460D-8F63-B897ACA5CDD2@gmx.net> References: <917E3FCD-5FDB-460D-8F63-B897ACA5CDD2@gmx.net> Message-ID: On Mar 2, 2008, at 9:38 PM, Hilmar Lapp wrote: > FYI, I used this to start a page on the recommended mapping of > sequence annotation to BioSQL: > > http://www.biosql.org/wiki/Annotation_Mapping > > Obviously, this is very rudimentary, but everyone is welcome to add > to it or comment with further questions. Also, one of the most > important questions, namely a consistent vocabulary for annotation > (qualifier) tags, isn't mentioned there (yet). > > -hilmar > >> ... >> Maybe we need to hold some mini-hackathon to make the different >> toolkits compatible in how they map annotation to the schema. >> Obviously I don't know whether you have the latest Biojava setup >> here, but I'll just comment how BioPerl/Bioperl-db would map this: These are the ones I know of: >> 'cross_references' - not sure where these would be coming from in >> GenBank format; for EMBL this will map to the dbxref table GenPept has DBSOURCE, so maybe from there? >> 'data_file_division' - not sure what this is (same as DIVISION?) Note sure about that one, but division sounds right. >> 'MDAT' - not sure what this is Modification Date, I think. 'MDAT' is a field name used for limits in Entrez searches: Field code: MDAT name: Modification Date desc: Date of last update count: 4012 Attributes: is_date,is_singletoken chris From markjschreiber at gmail.com Wed Mar 5 02:06:17 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Wed, 5 Mar 2008 10:06:17 +0800 Subject: [BioSQL-l] multiple species for a sequence In-Reply-To: References: <0CFEE9C6-4012-4E6F-B81F-33EE7379036F@uiuc.edu> <86EDE863-287A-48BF-9ED8-13219C3D342E@gmx.net> Message-ID: <93b45ca50803041806o6f802548g4e408339d1a40c27@mail.gmail.com> BioJava doesn't support multiple taxa per sequence. It's something to consider though. Philosophically you really have to wonder about he meaning of species when you have a chimera : ) Should it not be a hybrid species all on it's own? I wonder what they will do when Craig Venter produces Craigus ventus... - Mark On Mon, Mar 3, 2008 at 2:00 AM, Chris Fields wrote: > > > On Mar 2, 2008, at 11:33 AM, Hilmar Lapp wrote: > > > On Mar 1, 2008, at 8:16 PM, Chris Fields wrote: > > > >> I'm looking at a bioperl bug I filed a while back that deals with > >> multiple species in a sequence file, such as found for AJ428955: > >> > >> ID AJ428955; SV 1; linear; mRNA; STD; VRL; 8027 BP. > >> XX > >> AC AJ428955; > >> XX > >> DT 09-JUL-2002 (Rel. 72, Created) > >> DT 15-APR-2005 (Rel. 83, Last updated, Version 4) > >> XX > >> DE Hepatitis GB virus B subgenomic replicon neoRepB > >> XX > >> KW core-neo fusion protein; core-neo gene; polyprotein. > >> XX > >> OS Hepatitis GB virus B > >> OC Viruses; ssRNA positive-strand viruses, no DNA stage; > >> Flaviviridae. > >> XX > >> OS Encephalomyocarditis virus > >> OC Viruses; ssRNA positive-strand viruses, no DNA stage; > >> Picornaviridae; > >> OC Cardiovirus. > >> > >> ... > >> > >> We could probably add support in bioperl fairly easily (Bio::Seq > >> could just return an array or the first species object based on > >> context), but would BioSQL support sequences like this? > > > > No it wouldn't. There may only be one species (taxon) per sequence. > > > > There has been a lot of discussion about this in the past mostly > > driven by the former SwissProt peculiarity of collapsing sequences > > by sequence identity into a single record. We held out and > > eventually UniProt dropped this practice. > > I'm unsure how often these pop up. The behavior of both EMBL and > GenBank parsers assumes one species (as does Bio::Seq); the embl > parser picks up both and just replaces the first with the second: > > ... > > DE Hepatitis GB virus B subgenomic replicon neoRepB > XX > KW core-neo fusion protein; core-neo gene; polyprotein. > XX > > OS Encephalomyocarditis virus > OC Viruses; ssRNA positive-strand viruses, no DNA stage; > Picornaviridae; > OC Cardiovirus. > XX > RN [1] > ... > > > > I guess we never quite decided what to do about chimeric sequences > > like the above. Note that the GenBank record gives this differently: > > > > http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=21727885 > > > > Here, there's one taxon (ORGANISM line) reference, but two localized > > 'source' features in the feature table. (I'm actually not 100% sure > > what the genbank parser would do with this - i.e., whether the > > second source feature will override the taxon_id found in the > > first.) Because seqfeatures (in BioSQL) don't have a link to taxon, > > you wouldn't be able to hit the sequence by its second (chimeric) > > taxon if that were your query criteria (though you could store it > > fine, and if you queried by dbxrefs of features of type 'source', > > you would find it). > > The genbank parser gets the taxon and tax ID correct; I would think > when it hit the next source feature key it would assign the wrong tax > ID to the species object but maybe there's a secondary check. Both > output the source in feature tables just fine. > > > > At the end of the day, BioSQL will evolve (hopefully) quickly to > > support what the Bio* toolkits support, and will be much slower to > > change in ways that Bio* wouldn't be able to take advantage of > > anyway. At least that's my current vision of it, and of course is up > > for debate as to whether that's a useful vision as much as anything > > else. > > > > So, as you say, right now BioPerl, and AFAIAA any of the other Bio* > > toolkits, doesn't support more than one species per sequence, but as > > soon as that changes, there's a clear need for BioSQL to follow along. > > > > Does that make sense? > > > > -hilmar > > Yes. I think we could add in support for multiple species fairly > easily but I'll probably hold off on anything until after a 1.6 > release (i.e. push it to the next developer series, which gives us > more time to think on how to implement this in a BioSQL-friendly way). > > chris > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > From cjfields at uiuc.edu Wed Mar 5 23:24:03 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Wed, 5 Mar 2008 17:24:03 -0600 Subject: [BioSQL-l] bioperl-db bugs Message-ID: Hilmar, I think I have two bioperl-db bugs sorted out, but I'm trying to determine whether the solution is a side-effect, a feature, or a bug. Dmitry has filed two bug reports which are somewhat related: http://bugzilla.open-bio.org/show_bug.cgi?id=2280 http://bugzilla.open-bio.org/show_bug.cgi?id=2281 I have added my comments to it, but maybe you can shed some more light on this. What he is trying to do is copy a persistent Seq object to a different namespace; load_seqdatabase.pl won't let him do that directly using the same sequence file. If he changes the namespace() and store()s it using a script, the seq is moved to the new namespace, not updated. My reasoning is this is a feature (by not changing the primary_key, you don't store a new sequence but update the current one). However, if the primary_key is unset (undef), then it appears you can copy the sequence over (from Dmitry's script, with my addition noted): ... my $ns1 = 'space1'; my $ns2 = 'space2'; my $seqadp = $db->get_object_adaptor('Bio::SeqI'); my $aux_seq = Bio::Seq::RichSeq->new( -accession_number => 'NC_005982', -version => 1, -namespace => $ns1); my $seq = $seqadp->find_by_unique_key($aux_seq); # store the found sequence in the second biodatabase: my $pseq = $seqadp->create_persistent($ns2); $pseq->namespace('bioperl2'); $pseq->primary_key(undef); # my addition, which appears to work $pseq->store(); $seqadp->commit; ... My question: is this an intended effect? The ability to assign undef to primary_key seems intentional based on the method code, but I'm a bit uncertain here. chris From hlapp at gmx.net Thu Mar 6 05:03:26 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 6 Mar 2008 00:03:26 -0500 Subject: [BioSQL-l] Announcement: BioSQL v1.0.0 released Message-ID: BioSQL v1.0.0 Release ===================== I am extremely pleased to announce the release of version 1.0.0 (code-named Tokyo, see below) of BioSQL. The release can be downloaded at the following location, in the following formats: http://biosql.org/DIST/biosql-1.0.0.tar.gz http://biosql.org/DIST/biosql-1.0.0.tar.bz2 http://biosql.org/DIST/biosql-1.0.0.zip (has Windows-style EOL) MD5 signatures (http://biosql.org/DIST/SIGNATURES.md5): MD5(biosql-1.0.0.tar.bz2)= 2b09a821b9d94bb1e94c3c79dc2f4cff MD5(biosql-1.0.0.tar.gz)= e47982d979ddb98aae640b5ab55ce2c6 MD5(biosql-1.0.0.zip)= 06913c8639ca4fe7f9000b556d8a04ed The core BioSQL schema is a generic, extensible relational model for sequences, sequence features, their annotation, and ontology terms. It is also designed as the interoperable persistence interface between the Bio* projects. This version of the schema has essentially been the same since November 2004. Software that worked with schema versions downloaded from CVS (or, as of lately, svn) after November 2004 should work with all 1.0.x releases. This release contains - the core BioSQL schema as DDL (Data Definition Language) for the following RDBMSs: MySQL, PostgreSQL, Oracle, HSQLDB, and Apache Derby, - ancillary (but optional) schema files for PostgreSQL, - documentation and an ERD (Entity-Relationship Diagram), and - a Perl script that can pre-load (and update) a BioSQL instance with the NCBI taxonomy. Installation instructions for MySQL and PostgreSQL are in the file INSTALL, and the file doc/bj_and_bsql_oracle_howto.htm has instructions for installing the Oracle version. Additional information regarding BioSQL, including links to language bindings, a roadmap to future releases and enhancements, and possible local optimizations is available from the BioSQL website at http://biosql.org. On behalf of the BioSQL developers, Hilmar Lapp Acknowledgments --------------- BioSQL in general and this releases in particular owes enormously to a number of number of people and would not exist without their contributions, the contributions of people on the biosql-l mailing list, and the support of other developers and users from the Bio* community. Ewan Birney created the first version of the schema and during the 2003 BioHackathon in Singapore tested and wrote much of the INSTALL document. Elia Stupka and Chris Mungall made significant changes at the 2002 BioHackathons in Tucson, AZ, and Cape Town, South Africa. Aaron Mackey was instrumental in the changes made at the Singapore BioHackathon, which set the path to the version (code-named 'post-Singapore') that eventually stabilized as v1.0. Matthew Pocock and Thomas Down provided important input for the ontology model. This release and the accompanying work on cleaning up, updating documentation, and jump-starting a useful (wiki) website was irreversibly set in motion at the BioHackathon 2008 in Tokyo, and would not have happened without the active encouragement from several participants, especially Heikki Lehvahslaiho, Mark Schreiber, Richard Holland, and Raoul Bonnal. Finally, without the superb and prompt help from Mauricio Herrera Cuadra and Jason Stajich with various wiki and other admin issues that occasionally reared their heads we wouldn't have made it to this point. In recognition of the role the BioHackathon 2008 played in getting this release out the door, and in keeping with an informal tradition held up since the first BioHackathon, I am code-naming the 1.0.x release series the Tokyo release series of BioSQL. Thank you to everyone! License ------- BioSQL is free software: you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sun Mar 9 23:38:18 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 9 Mar 2008 19:38:18 -0400 Subject: [BioSQL-l] bioperl-db bugs In-Reply-To: References: Message-ID: Hi Chris, I added comments to both bug reports. This belongs to BioPerl, though, as it has only to do with its language binding. The tidbit may be worth keeping in mind for a general BioSQL audience is that bioentry namespace (foreign key to biodatabase) is part of the (compound) bioentry unique keys. The identifier column used to be unique by itself (and could still be made such in a local instance, there's a comment to this effect in the DDL), but that was changed a while ago. (Also, if one uses any of the Bio* language bindings, changing a unique key constraint to something that differs from what the language binding assumes may be asking for a lot of trouble. Bioperl-db will expect the combination of primary_id() and namespace () to match if the latter is provided.) -hilmar On Mar 5, 2008, at 6:24 PM, Chris Fields wrote: > Hilmar, > > I think I have two bioperl-db bugs sorted out, but I'm trying to > determine whether the solution is a side-effect, a feature, or a > bug. Dmitry has filed two bug reports which are somewhat related: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2280 > http://bugzilla.open-bio.org/show_bug.cgi?id=2281 > > I have added my comments to it, but maybe you can shed some more > light on this. What he is trying to do is copy a persistent Seq > object to a different namespace; load_seqdatabase.pl won't let him > do that directly using the same sequence file. If he changes the > namespace() and store()s it using a script, the seq is moved to the > new namespace, not updated. > > My reasoning is this is a feature (by not changing the primary_key, > you don't store a new sequence but update the current one). > However, if the primary_key is unset (undef), then it appears you > can copy the sequence over (from Dmitry's script, with my addition > noted): > > ... > my $ns1 = 'space1'; > my $ns2 = 'space2'; > > my $seqadp = $db->get_object_adaptor('Bio::SeqI'); > my $aux_seq = Bio::Seq::RichSeq->new( > -accession_number => 'NC_005982', > -version => 1, > -namespace => $ns1); > my $seq = $seqadp->find_by_unique_key($aux_seq); > > # store the found sequence in the second biodatabase: > my $pseq = $seqadp->create_persistent($ns2); > $pseq->namespace('bioperl2'); > $pseq->primary_key(undef); # my addition, which appears to work > $pseq->store(); > $seqadp->commit; > ... > > My question: is this an intended effect? The ability to assign > undef to primary_key seems intentional based on the method code, > but I'm a bit uncertain here. > > chris > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From jswetnam at gmail.com Mon Mar 10 19:27:46 2008 From: jswetnam at gmail.com (James Swetnam) Date: Mon, 10 Mar 2008 15:27:46 -0400 Subject: [BioSQL-l] Possible Mysql 5.x bug Message-ID: <2ABD56A9-9632-4AB1-BC54-B0AF71037DC8@gmail.com> First off, thank you very much to the developers for creating and maintaining such a useful and interesting project. I think I have found a small syntactical bug; as a caveat, however, I am not a database developer and have very little experience in these matters. I do know how to read documentation though, which I've relied heavily on to write this email. As per the biopython setup tutorial I'm attempting to run the biosqldb- mysql.sql file on Mac OS X Leopard. Here is my mysql version string: cardozo13:sql james$ mysql -V mysql Ver 14.12 Distrib 5.0.54-20071214, for apple-darwin9.1.0 (powerpc) using EditLine wrapper And my procedure (after grabbing the biosql source via CVS). cardozo13:sql james$ mysqladmin -u root -p create bioseqdb Enter password: cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb- mysql.sqlEnter password: ERROR 1064 (42000) at line 169: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '--CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id)' at line 1 Interesting. Let's take a look at line 169: --CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id); And an excerpt from the documentation for my version of MySQL (5.0 reference manual), section 1.8.5.6. '--' as the Start of a Comment: Standard SQL uses ?--? as a start-comment sequence. MySQL Server uses ?#? as the start comment character. MySQL Server 3.23.3 and up also supports a variant of the ?--? comment style. That is, the ?--? start- comment sequence must be followed by a space (or by a control character such as a newline). The space is required to prevent problems with automatically generated SQL queries that use constructs such as the following, where we automatically insert the value of the payment for payment: OK. So after replacing all the lines in which -- is not followed by a space (thank you regexps), it works beautifully. cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb-mysql.sql Enter password: Should this change be implemented? Or am i missing something? James Swetnam Research Technician New York University School of Medicine - Done. ---------- Forwarded message ---------- From: "James Swetnam" To: biosql-l-request at lists.open-bio.org Date: Thu, 6 Mar 2008 18:10:07 -0500 Subject: Comment Syntax bug Generates error on Hello. First off, thank you very much to the developers for creating and maintaining such a useful and interesting project. I think I have found a small syntactical bug; as a caveat, however, I am not a database developer and have very little experience in these matters. I do know how to read documentation though, which I've relied heavily on to write this email. As per the biopython setup tutorial I'm attempting to run the biosqldb- mysql.sql file on Mac OS X Leopard. Here is my mysql version string: cardozo13:sql james$ mysql -V mysql Ver 14.12 Distrib 5.0.54-20071214, for apple-darwin9.1.0 (powerpc) using EditLine wrapper And my procedure (after grabbing the biosql source via CVS). cardozo13:sql james$ mysqladmin -u root -p create bioseqdb Enter password: cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb- mysql.sqlEnter password: ERROR 1064 (42000) at line 169: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near '--CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id)' at line 1 Interesting. Let's take a look at line 169: --CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id); And an excerpt from the documentation for my version of MySQL (5.0 reference manual), section 1.8.5.6. '--' as the Start of a Comment: Standard SQL uses ?--? as a start-comment sequence. MySQL Server uses ?#? as the start comment character. MySQL Server 3.23.3 and up also supports a variant of the ?--? comment style. That is, the ?--? start- comment sequence must be followed by a space (or by a control character such as a newline). The space is required to prevent problems with automatically generated SQL queries that use constructs such as the following, where we automatically insert the value of the payment for payment: OK. So after replacing all the lines in which -- is not followed by a space (thank you regexps), it works beautifully. cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb-mysql.sql Enter password: Should this change be implemented? Or am i missing something? James Swetnam Research Technician New York University School of Medicine Reply Forward From hlapp at gmx.net Tue Mar 11 03:05:32 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Mon, 10 Mar 2008 23:05:32 -0400 Subject: [BioSQL-l] Possible Mysql 5.x bug In-Reply-To: <2ABD56A9-9632-4AB1-BC54-B0AF71037DC8@gmail.com> References: <2ABD56A9-9632-4AB1-BC54-B0AF71037DC8@gmail.com> Message-ID: <9051AFFE-8660-4E21-B25F-93D1FB70D98B@gmx.net> Hi James, thanks for reporting this. Sebastian Bassi beat you to it, though, and it has actually been fixed in svn, and is also fixed in the 1.0.0 release. BioSQL is meanwhile on svn; the anonymous cvs server is still up, but doesn't get updated since the switch-over to svn. Instructions for downloading from svn and download location of the 1.0.0 release are on the BioSQL wiki at http://biosql.org. Let us know if you encounter any difficulties. And great that you're finding the project useful! -hilmar On Mar 10, 2008, at 3:27 PM, James Swetnam wrote: > First off, thank you very much to the developers for creating and > maintaining such a useful and interesting project. I think I have > found a small syntactical bug; as a caveat, however, I am not a > database developer and have very little experience in these matters. > I do know how to read documentation though, which I've relied > heavily > on to write this email. > As per the biopython setup tutorial I'm attempting to run the > biosqldb- > mysql.sql file on Mac OS X Leopard. Here is my mysql version > string: > cardozo13:sql james$ mysql -V > mysql Ver 14.12 Distrib 5.0.54-20071214, for apple-darwin9.1.0 > (powerpc) using EditLine wrapper > And my procedure (after grabbing the biosql source via CVS). > cardozo13:sql james$ mysqladmin -u root -p create bioseqdb > Enter password: > cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb- > mysql.sqlEnter password: > ERROR 1064 (42000) at line 169: You have an error in your SQL > syntax; > check the manual that corresponds to your MySQL server version > for the > right syntax to use near '--CREATE INDEX ontrel_subjectid ON > term_relationship(subject_term_id)' at line 1 > Interesting. Let's take a look at line 169: > > --CREATE INDEX ontrel_subjectid ON term_relationship > (subject_term_id); > > And an excerpt from the documentation for my version of MySQL (5.0 > reference manual), section 1.8.5.6. '--' as the Start of a Comment: > > Standard SQL uses ?--? as a start-comment sequence. MySQL Server > uses > ?#? as the start comment character. MySQL Server 3.23.3 and up also > supports a variant of the ?--? comment style. That is, the ?--? > start- > comment sequence must be followed by a space (or by a control > character such as a newline). The space is required to prevent > problems with automatically generated SQL queries that use > constructs > such as the following, where we automatically insert the value > of the > payment for payment: > > OK. So after replacing all the lines in which -- is not followed > by a > space (thank you regexps), it works beautifully. > > cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb-mysql.sql > Enter password: > > Should this change be implemented? Or am i missing something? > > James Swetnam > Research Technician > New York University School of Medicine > > > > > > > > - Done. > > > > ---------- Forwarded message ---------- > From: "James Swetnam" > To: biosql-l-request at lists.open-bio.org > Date: Thu, 6 Mar 2008 18:10:07 -0500 > Subject: Comment Syntax bug Generates error on > Hello. > > First off, thank you very much to the developers for creating and > maintaining such a useful and interesting project. I think I have > found a small syntactical bug; as a caveat, however, I am not a > database developer and have very little experience in these matters. > I do know how to read documentation though, which I've relied heavily > on to write this email. > > As per the biopython setup tutorial I'm attempting to run the > biosqldb- > mysql.sql file on Mac OS X Leopard. Here is my mysql version string: > > cardozo13:sql james$ mysql -V > mysql Ver 14.12 Distrib 5.0.54-20071214, for apple-darwin9.1.0 > (powerpc) using EditLine wrapper > > And my procedure (after grabbing the biosql source via CVS). > > cardozo13:sql james$ mysqladmin -u root -p create bioseqdb > Enter password: > cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb- > mysql.sqlEnter password: > ERROR 1064 (42000) at line 169: You have an error in your SQL syntax; > check the manual that corresponds to your MySQL server version for the > right syntax to use near '--CREATE INDEX ontrel_subjectid ON > term_relationship(subject_term_id)' at line 1 > > Interesting. Let's take a look at line 169: > > --CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id); > > And an excerpt from the documentation for my version of MySQL (5.0 > reference manual), section 1.8.5.6. '--' as the Start of a Comment: > > Standard SQL uses ?--? as a start-comment sequence. MySQL Server uses > ?#? as the start comment character. MySQL Server 3.23.3 and up also > supports a variant of the ?--? comment style. That is, the ?--? start- > comment sequence must be followed by a space (or by a control > character such as a newline). The space is required to prevent > problems with automatically generated SQL queries that use constructs > such as the following, where we automatically insert the value of the > payment for payment: > > OK. So after replacing all the lines in which -- is not followed by a > space (thank you regexps), it works beautifully. > > cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb-mysql.sql > Enter password: > > Should this change be implemented? Or am i missing something? > > James Swetnam > Research Technician > New York University School of Medicine > > > > > > > > Reply > > Forward > > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Tue Mar 11 18:51:47 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 11 Mar 2008 18:51:47 +0000 Subject: [BioSQL-l] Biopython documentation in BioSQL SVN In-Reply-To: <320fb6e00803111148k52150fdcu3b3c22f72f59f514@mail.gmail.com> References: <320fb6e00803111148k52150fdcu3b3c22f72f59f514@mail.gmail.com> Message-ID: <320fb6e00803111151q645c1f0cgead7842e8ab0d0d@mail.gmail.com> Hello, Over on the Biopython mailing list, James Swetnam drew my attention to the fact that we still had documentation referring to installing BioSQL from CVS (predating both the move to SVN and the official 1.0 release). I've updated our wiki page, http://biopython.org/wiki/BioSQL However, there is some older LaTeX based documentation on our webpage, http://biopython.org/DIST/docs/biosql/python_biosql_basic.html http://biopython.org/DIST/docs/biosql/python_biosql_basic.pdf These are currently living in the BioSQL repository, which I don't think I have access to. http://code.open-bio.org/svnweb/index.cgi/biosql/browse/biosql-schema/trunk/doc/biopython/ Does it make sense to have this documentation separate from the Biopython code it refers to (which lives in the Biopython repository)? For one thing, it complicates access rights for developers. What I would suggest is just to: (*) add a disclaimer to the top of python_biosql_basic.tex saying this document is depreciated, and giving a link to the wiki page, http://biopython.org/wiki/BioSQL (*) regenerate the PDF and HTML files. (*) Update these three files in BioSQL's SVN repository. (*) Copy the new PDF and HTML files over to the Biopython webserver. Thanks Peter From hlapp at gmx.net Tue Mar 11 19:57:16 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 11 Mar 2008 15:57:16 -0400 Subject: [BioSQL-l] Biopython documentation in BioSQL SVN In-Reply-To: <320fb6e00803111151q645c1f0cgead7842e8ab0d0d@mail.gmail.com> References: <320fb6e00803111148k52150fdcu3b3c22f72f59f514@mail.gmail.com> <320fb6e00803111151q645c1f0cgead7842e8ab0d0d@mail.gmail.com> Message-ID: On Mar 11, 2008, at 2:51 PM, Peter wrote: > However, there is some older LaTeX based documentation on our webpage, > http://biopython.org/DIST/docs/biosql/python_biosql_basic.html > http://biopython.org/DIST/docs/biosql/python_biosql_basic.pdf > > These are currently living in the BioSQL repository, You mean that the originals are, i.e., the source .tex file, right? The files in the BioSQL repository have been updated, and the updates should be in the v1.0.0 release. > [...] > Does it make sense to have this documentation separate from the > Biopython code it refers to (which lives in the Biopython repository)? > For one thing, it complicates access rights for developers. Indeed. You can have write access but that doesn't mean it would then be easy to maintain for you folks (as it being in a non-biopython repository likely makes it slip from your mind again). However, at the end of the day it is your call. I'm happy to leave it there, especially if there is continuing interest from Biopython folks to keep it updated (if there isn't, I may schedule it for deletion for one of the 1.1 or higher releases). > > What I would suggest is just to: > > (*) add a disclaimer to the top of python_biosql_basic.tex saying this > document is depreciated, and giving a link to the wiki page, > http://biopython.org/wiki/BioSQL Just send me a patch of the change you would like to make. > (*) regenerate the PDF and HTML files. Those have been regenerated already, before the v1.0.0 release (by me, under some pains trying to get HeVeA to do what the original creators seemed to have gotten it to do). > (*) Update these three files in BioSQL's SVN repository. Done already as far as the change to svn is concerned. Actually, some Biopythonist (Sebastian?) walked through the file and made sure everything works as described, giving rise to an additional change. > (*) Copy the new PDF and HTML files over to the Biopython webserver. Feel free to grab them from svn (or from the BioSQL 1.0.0 release, there haven't been any changes since the release). -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Thu Mar 13 15:06:18 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Mar 2008 15:06:18 +0000 Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id Message-ID: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> Dear list, One of the unresolved issues with Biopython's BioSQL interface is dealing with the NCBI taxon ID when loading sequences into the database. As I understand it, ideally before loading any sequences, the user will have loaded in the entire NCBI taxonomy using the load_ncbi_taxonomy.pl script, as I described here: http://biopython.org/wiki/BioSQL#NCBI_Taxonomy When a new sequence is added to the database with a known taxon id, there is no problem. But happens if its a recently sequenced organism which isn't defined yet in the BioSQL taxonomy tables? Could/should the user re-run load_ncbi_taxonomy.pl, and then load in their new sequence? Right now in Biopython due what appears to have been intended as a short term hack, we simple don't record the taxon id at all (!), and I would like to fix this (bug 2422). http://bugzilla.open-bio.org/show_bug.cgi?id=2422 How do BioPerl et al deal with this issue? Do they try and update the taxonomy tables using the available information in the new record's annotation (i.e. the new taxon id and the species name)? Do they lookup the NCBI taxonomy definition via the internet? Do they throw an error and halt? Thanks, Peter (Biopython) From hlapp at gmx.net Thu Mar 13 22:51:13 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 13 Mar 2008 18:51:13 -0400 Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id In-Reply-To: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> Message-ID: <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> (this is more of a bioperl question than a biosql one) The load_ncbi_taxonomy.pl script is designed to update the taxon tables in a non-disruptive way, and if there weren't many changes shouldn't actually take that long (except that recalculating the nested set values may take a couple of minutes). Bioperl-db will store the taxon information it finds in the Bio::Species object if it can't locate the taxon by lookup, and will not raise an error. The problem with this is that it relies on the Bio::SeqIO parser to have gotten the species and lineage information correct, which is sometimes a wrong assumption for exotic species. Most often the error will not manifest itself at the time of storing the erroneously parsed information, but when it is re-retrieved and used to populate a Bio::Species object. For the SymAtlas project we had this situation (new species in sequence updates that the last NCBI taxonomy update hadn't yet brought in) quite regularly. I wrote a SQL script would fix those 'haphazard' additions such that load_ncbi_taxonomy would update them to their correct values come the next NCBI taxonomy update. I can send you the script (it would be for the Oracle version), but I'm not sure this is a widely viable strategy. -hilmar On Mar 13, 2008, at 11:06 AM, Peter wrote: > Dear list, > > One of the unresolved issues with Biopython's BioSQL interface is > dealing with the NCBI taxon ID when loading sequences into the > database. > > As I understand it, ideally before loading any sequences, the user > will have loaded in the entire NCBI taxonomy using the > load_ncbi_taxonomy.pl script, as I described here: > http://biopython.org/wiki/BioSQL#NCBI_Taxonomy > > When a new sequence is added to the database with a known taxon id, > there is no problem. But happens if its a recently sequenced organism > which isn't defined yet in the BioSQL taxonomy tables? Could/should > the user re-run load_ncbi_taxonomy.pl, and then load in their new > sequence? > > Right now in Biopython due what appears to have been intended as a > short term hack, we simple don't record the taxon id at all (!), and I > would like to fix this (bug 2422). > http://bugzilla.open-bio.org/show_bug.cgi?id=2422 > > How do BioPerl et al deal with this issue? Do they try and update the > taxonomy tables using the available information in the new record's > annotation (i.e. the new taxon id and the species name)? Do they > lookup the NCBI taxonomy definition via the internet? Do they throw > an error and halt? > > Thanks, > > Peter > (Biopython) > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Thu Mar 13 23:13:32 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 13 Mar 2008 23:13:32 +0000 Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id In-Reply-To: <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> Message-ID: <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com> On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp wrote: > (this is more of a bioperl question than a biosql one) Well, yes and no. And I'm not subscribed to the Bioperl list, nor the BioJava one, nor the BioRuby one. > The load_ncbi_taxonomy.pl script is designed to update the taxon > tables in a non-disruptive way, and if there weren't many changes > shouldn't actually take that long (except that recalculating the > nested set values may take a couple of minutes). Do you think when faced with a novel taxon id, Biopython/BioPerl/... could write some minimal taxonomy entry (without any guess work based on the species name), in order to record the sequence's taxon - and then running an improved load_ncbi_taxonomy.pl at a later date would sort out the proper taxonomy? > Bioperl-db will store the taxon information it finds in the > Bio::Species object if it can't locate the taxon by lookup, and will > not raise an error. The problem with this is that it relies on the > Bio::SeqIO parser to have gotten the species and lineage information > correct, which is sometimes a wrong assumption for exotic species. > Most often the error will not manifest itself at the time of storing > the erroneously parsed information, but when it is re-retrieved and > used to populate a Bio::Species object. This is what I would like to avoid with Biopython. > For the SymAtlas project we had this situation (new species in > sequence updates that the last NCBI taxonomy update hadn't yet > brought in) quite regularly. I wrote a SQL script would fix those > 'haphazard' additions such that load_ncbi_taxonomy would update them > to their correct values come the next NCBI taxonomy update. I can > send you the script (it would be for the Oracle version), but I'm not > sure this is a widely viable strategy. So this wasn't integrated with load_ncbi_taxonomy.pl at all? Peter From hlapp at gmx.net Thu Mar 13 23:41:43 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 13 Mar 2008 19:41:43 -0400 Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id In-Reply-To: <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com> References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com> Message-ID: On Mar 13, 2008, at 7:13 PM, Peter wrote: > On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp wrote: >> [...] >> The load_ncbi_taxonomy.pl script is designed to update the taxon >> tables in a non-disruptive way, and if there weren't many changes >> shouldn't actually take that long (except that recalculating the >> nested set values may take a couple of minutes). > > Do you think when faced with a novel taxon id, Biopython/BioPerl/... > could write some minimal taxonomy entry (without any guess work based > on the species name), in order to record the sequence's taxon This is what Bioperl-db does. There isn't any guesswork. If Bio::Species has lineage information it will also insert the lineage information, though. > - and then running an improved load_ncbi_taxonomy.pl at a later > date would > sort out the proper taxonomy? If I remember correctly, the script makes (and hence expects) the primary key and the NCBI taxonomy ID to be identical. If your loading procedure can achieve that already then load_ncbi_taxonomy.pl should pick them up and fix them. You can try that by loading the taxonomy through the script, then arbitrarily choose a taxon, create a stub bioentry for it and set its taxon_id foreign key to the chosen taxon, change its taxon_name.name to some bogus value (for the 'scientific name' class, for example) (and feel free to change the left_id and right_id values in taxon too), and rerun the script. It should fix the change you made, and your bioentry should still point to the same taxon (because its primary key did not change, and did not get deleted either; otherwise the bioentry would now have a null value in the foreign key). The Bioperl-db way of storing things does not give control over primary key assignment to Bioperl-db, so the database will assign it. > [...] >> For the SymAtlas project we had this situation (new species in >> sequence updates that the last NCBI taxonomy update hadn't yet >> brought in) quite regularly. I wrote a SQL script would fix those >> 'haphazard' additions such that load_ncbi_taxonomy would update them >> to their correct values come the next NCBI taxonomy update. I can >> send you the script (it would be for the Oracle version), but I'm >> not >> sure this is a widely viable strategy. > > So this wasn't integrated with load_ncbi_taxonomy.pl at all? No, but now that you say it I don't see any reason why I couldn't. Maybe that's just what I should do. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From mrphysh at juno.com Fri Mar 14 01:58:25 2008 From: mrphysh at juno.com (mrphysh at juno.com) Date: Fri, 14 Mar 2008 01:58:25 GMT Subject: [BioSQL-l] bioperl basics Message-ID: <20080313.195825.6855.0@webmail20.vgs.untd.com> I am a molecular biologist studying bioinformatics from a Perl background and making progress. I am realizing that without tapping into the existing infrastructure, I will be writing code for ever. Bioperl is the path for me. I am moving forward. the error I encounter is can't locate Cache/FileCache in @INC (@INC contains /etc/perl/ /usr/locaql/lib/perl/5.8.8 .....) and so forth. I found the files in a home directory. I must have told the install to put them there...? anyway: How do I edit this environmental variable..... @INC. I cannot find anything in my book. thanks john brigham I will be writing code for years and need to tap into the _____________________________________________________________ Need cash? Click to get an emergency loan, bad credit ok http://thirdpartyoffers.juno.com/TGL2121/fc/Ioyw6i3mKmyQsg01zMPK1Qa0178ZfajwTEBgEXdzlmb9zLLZc8pLOU/ From barry.moore at genetics.utah.edu Fri Mar 14 03:08:19 2008 From: barry.moore at genetics.utah.edu (Barry Moore) Date: Thu, 13 Mar 2008 21:08:19 -0600 Subject: [BioSQL-l] bioperl basics In-Reply-To: <20080313.195825.6855.0@webmail20.vgs.untd.com> References: <20080313.195825.6855.0@webmail20.vgs.untd.com> Message-ID: John, @INC is not an environment variable, it is a perl variable that gets populated by the environment variable PERL5LIB. You would normally set that environment variable by doing something like 'export PERL5LIB='/path/to/perl/libraries':$PERL5LIB' if you use bash shell or setenv PERL5LIB "/path/to/perl/libraries:$PERL5LIB" if you use c shell and you'll want to put those lines into the appropriate start up files so that they get set everytime you log in. This will be different on a windows system but I'm afraid I can't help with that. If you are having trouble installing bioperl I would encourage you to read the installation documentation at http://www.bioperl.org/wiki/ Installing_BioPerl. Beyond that you will find a wealth of help with your beginning perl questions by searching the web with Google, asking at perlmonks.org or joining one of the many perl mailing lists that you can find at http://lists.cpan.org/. The bioperl mailing list and this mailing list (BioSQL) are devoted specifically to discussions directly related to Bioperl and BioSQL respectively. You should search for answers to questions like this one first on the web, then on one of the general perl mailing lists or web sites mentioned above. When you have questions (even beginner ones) that are specific to Bioperl or BioSQL you are welcome post to those lists at any time. Barry On Mar 13, 2008, at 7:58 PM, mrphysh at juno.com wrote: > I am a molecular biologist studying bioinformatics from a Perl > background and making progress. I am realizing that without > tapping into the existing infrastructure, I will be writing code > for ever. Bioperl is the path for me. I am moving forward. > > the error I encounter is > > can't locate Cache/FileCache in @INC (@INC contains /etc/perl/ /usr/ > locaql/lib/perl/5.8.8 .....) and so forth. > > I found the files in a home directory. I must have told the > install to put them there...? > > > anyway: How do I edit this environmental variable..... @INC. I > cannot find anything in my book. > > thanks > john brigham > > > I will be writing code for years and need to tap into the > _____________________________________________________________ > Need cash? Click to get an emergency loan, bad credit ok > http://thirdpartyoffers.juno.com/TGL2121/fc/ > Ioyw6i3mKmyQsg01zMPK1Qa0178ZfajwTEBgEXdzlmb9zLLZc8pLOU/ > > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l From markjschreiber at gmail.com Fri Mar 14 13:48:38 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Fri, 14 Mar 2008 21:48:38 +0800 Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id In-Reply-To: References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com> Message-ID: <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com> >From memory BioJava will add it if it is not already in there. If the taxid can be found then the system connects you with whatever is in that taxid, it doesn't overwrite it. This has two curious side effects. Because the details associated with a taxid sometimes change (eg common name changes a lot) you can get connected to an outdated version (if your record is newer than your NCBI taxonomy) or you can get connected with a version that is newer than your record which means when you round-trip you don't get complete identity. For compatibility across the projects some kind of consensus would be good. - Mark On Fri, Mar 14, 2008 at 7:41 AM, Hilmar Lapp wrote: > > > On Mar 13, 2008, at 7:13 PM, Peter wrote: > > > On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp wrote: > >> [...] > > >> The load_ncbi_taxonomy.pl script is designed to update the taxon > >> tables in a non-disruptive way, and if there weren't many changes > >> shouldn't actually take that long (except that recalculating the > >> nested set values may take a couple of minutes). > > > > Do you think when faced with a novel taxon id, Biopython/BioPerl/... > > could write some minimal taxonomy entry (without any guess work based > > on the species name), in order to record the sequence's taxon > > This is what Bioperl-db does. There isn't any guesswork. If > Bio::Species has lineage information it will also insert the lineage > information, though. > > > > - and then running an improved load_ncbi_taxonomy.pl at a later > > date would > > sort out the proper taxonomy? > > If I remember correctly, the script makes (and hence expects) the > primary key and the NCBI taxonomy ID to be identical. If your loading > procedure can achieve that already then load_ncbi_taxonomy.pl should > pick them up and fix them. You can try that by loading the taxonomy > through the script, then arbitrarily choose a taxon, create a stub > bioentry for it and set its taxon_id foreign key to the chosen > taxon, change its taxon_name.name to some bogus value (for the > 'scientific name' class, for example) (and feel free to change the > left_id and right_id values in taxon too), and rerun the script. It > should fix the change you made, and your bioentry should still point > to the same taxon (because its primary key did not change, and did > not get deleted either; otherwise the bioentry would now have a null > value in the foreign key). > > The Bioperl-db way of storing things does not give control over > primary key assignment to Bioperl-db, so the database will assign it. > > > [...] > > >> For the SymAtlas project we had this situation (new species in > >> sequence updates that the last NCBI taxonomy update hadn't yet > >> brought in) quite regularly. I wrote a SQL script would fix those > >> 'haphazard' additions such that load_ncbi_taxonomy would update them > >> to their correct values come the next NCBI taxonomy update. I can > >> send you the script (it would be for the Oracle version), but I'm > >> not > >> sure this is a widely viable strategy. > > > > So this wasn't integrated with load_ncbi_taxonomy.pl at all? > > No, but now that you say it I don't see any reason why I couldn't. > Maybe that's just what I should do. > > -hilmar > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > > _______________________________________________ > > > > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > From cjfields at uiuc.edu Fri Mar 14 14:31:09 2008 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 14 Mar 2008 09:31:09 -0500 Subject: [BioSQL-l] [Bioperl-l] Loading sequences with novel NCBI taxon id In-Reply-To: <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com> References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com> <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com> Message-ID: The counter to that perspective (using new sequences with old tax info) would be to regularly update NCBI taxonomy, particularly in circumstances prior to adding new sequences. Hilmar mentioned that once tax is loaded it doesn't take as long to update, so you could set up a cron job to update regularly. I remember someone mentioning weekly or monthly updates on the list quite a while ago, but I'm unsure how often NCBI updates tax information (i.e. with every release, monthly, weekly, etc). I can see instances popping up where you used the an up-to-date taxonomy but a new sequence contains a tax ID not present. I think bioperl-db handles these but I'm not sure what other Bio* do. chris On Mar 14, 2008, at 8:48 AM, Mark Schreiber wrote: >> From memory BioJava will add it if it is not already in there. If the > taxid can be found then the system connects you with whatever is in > that taxid, it doesn't overwrite it. > > This has two curious side effects. Because the details associated with > a taxid sometimes change (eg common name changes a lot) you can get > connected to an outdated version (if your record is newer than your > NCBI taxonomy) or you can get connected with a version that is newer > than your record which means when you round-trip you don't get > complete identity. > > For compatibility across the projects some kind of consensus would > be good. > > - Mark > On Fri, Mar 14, 2008 at 7:41 AM, Hilmar Lapp wrote: >> >> >> On Mar 13, 2008, at 7:13 PM, Peter wrote: >> >>> On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp wrote: >>>> [...] >> >>>> The load_ncbi_taxonomy.pl script is designed to update the taxon >>>> tables in a non-disruptive way, and if there weren't many changes >>>> shouldn't actually take that long (except that recalculating the >>>> nested set values may take a couple of minutes). >>> >>> Do you think when faced with a novel taxon id, Biopython/BioPerl/... >>> could write some minimal taxonomy entry (without any guess work >>> based >>> on the species name), in order to record the sequence's taxon >> >> This is what Bioperl-db does. There isn't any guesswork. If >> Bio::Species has lineage information it will also insert the lineage >> information, though. >> >> >>> - and then running an improved load_ncbi_taxonomy.pl at a later >>> date would >>> sort out the proper taxonomy? >> >> If I remember correctly, the script makes (and hence expects) the >> primary key and the NCBI taxonomy ID to be identical. If your loading >> procedure can achieve that already then load_ncbi_taxonomy.pl should >> pick them up and fix them. You can try that by loading the taxonomy >> through the script, then arbitrarily choose a taxon, create a stub >> bioentry for it and set its taxon_id foreign key to the chosen >> taxon, change its taxon_name.name to some bogus value (for the >> 'scientific name' class, for example) (and feel free to change the >> left_id and right_id values in taxon too), and rerun the script. It >> should fix the change you made, and your bioentry should still point >> to the same taxon (because its primary key did not change, and did >> not get deleted either; otherwise the bioentry would now have a null >> value in the foreign key). >> >> The Bioperl-db way of storing things does not give control over >> primary key assignment to Bioperl-db, so the database will assign it. >> >>> [...] >> >>>> For the SymAtlas project we had this situation (new species in >>>> sequence updates that the last NCBI taxonomy update hadn't yet >>>> brought in) quite regularly. I wrote a SQL script would fix those >>>> 'haphazard' additions such that load_ncbi_taxonomy would update >>>> them >>>> to their correct values come the next NCBI taxonomy update. I can >>>> send you the script (it would be for the Oracle version), but I'm >>>> not >>>> sure this is a widely viable strategy. >>> >>> So this wasn't integrated with load_ncbi_taxonomy.pl at all? >> >> No, but now that you say it I don't see any reason why I couldn't. >> Maybe that's just what I should do. >> >> -hilmar >> >> -- >> =========================================================== >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : >> =========================================================== >> >> >> >> _______________________________________________ >> >> >> >> BioSQL-l mailing list >> BioSQL-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biosql-l >> > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From markjschreiber at gmail.com Sat Mar 15 00:56:37 2008 From: markjschreiber at gmail.com (Mark Schreiber) Date: Sat, 15 Mar 2008 08:56:37 +0800 Subject: [BioSQL-l] [Bioperl-l] Loading sequences with novel NCBI taxon id In-Reply-To: References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com> <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com> Message-ID: <93b45ca50803141756m3d7f022cnb57bd39f37270682@mail.gmail.com> I agree. A regular update would be best. Of course if your BioSQL db is limited to one or a few organisms you can just keep a fragment of the db. - Mark On Fri, Mar 14, 2008 at 10:31 PM, Chris Fields wrote: > The counter to that perspective (using new sequences with old tax > info) would be to regularly update NCBI taxonomy, particularly in > circumstances prior to adding new sequences. Hilmar mentioned that > once tax is loaded it doesn't take as long to update, so you could set > up a cron job to update regularly. > > I remember someone mentioning weekly or monthly updates on the list > quite a while ago, but I'm unsure how often NCBI updates tax > information (i.e. with every release, monthly, weekly, etc). I can > see instances popping up where you used the an up-to-date taxonomy but > a new sequence contains a tax ID not present. I think bioperl-db > handles these but I'm not sure what other Bio* do. > > chris > > On Mar 14, 2008, at 8:48 AM, Mark Schreiber wrote: > > >> From memory BioJava will add it if it is not already in there. If the > > taxid can be found then the system connects you with whatever is in > > that taxid, it doesn't overwrite it. > > > > This has two curious side effects. Because the details associated with > > a taxid sometimes change (eg common name changes a lot) you can get > > connected to an outdated version (if your record is newer than your > > NCBI taxonomy) or you can get connected with a version that is newer > > than your record which means when you round-trip you don't get > > complete identity. > > > > For compatibility across the projects some kind of consensus would > > be good. > > > > - Mark > > On Fri, Mar 14, 2008 at 7:41 AM, Hilmar Lapp wrote: > >> > >> > >> On Mar 13, 2008, at 7:13 PM, Peter wrote: > >> > >>> On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp wrote: > >>>> [...] > >> > >>>> The load_ncbi_taxonomy.pl script is designed to update the taxon > >>>> tables in a non-disruptive way, and if there weren't many changes > >>>> shouldn't actually take that long (except that recalculating the > >>>> nested set values may take a couple of minutes). > >>> > >>> Do you think when faced with a novel taxon id, Biopython/BioPerl/... > >>> could write some minimal taxonomy entry (without any guess work > >>> based > >>> on the species name), in order to record the sequence's taxon > >> > >> This is what Bioperl-db does. There isn't any guesswork. If > >> Bio::Species has lineage information it will also insert the lineage > >> information, though. > >> > >> > >>> - and then running an improved load_ncbi_taxonomy.pl at a later > >>> date would > >>> sort out the proper taxonomy? > >> > >> If I remember correctly, the script makes (and hence expects) the > >> primary key and the NCBI taxonomy ID to be identical. If your loading > >> procedure can achieve that already then load_ncbi_taxonomy.pl should > >> pick them up and fix them. You can try that by loading the taxonomy > >> through the script, then arbitrarily choose a taxon, create a stub > >> bioentry for it and set its taxon_id foreign key to the chosen > >> taxon, change its taxon_name.name to some bogus value (for the > >> 'scientific name' class, for example) (and feel free to change the > >> left_id and right_id values in taxon too), and rerun the script. It > >> should fix the change you made, and your bioentry should still point > >> to the same taxon (because its primary key did not change, and did > >> not get deleted either; otherwise the bioentry would now have a null > >> value in the foreign key). > >> > >> The Bioperl-db way of storing things does not give control over > >> primary key assignment to Bioperl-db, so the database will assign it. > >> > >>> [...] > >> > >>>> For the SymAtlas project we had this situation (new species in > >>>> sequence updates that the last NCBI taxonomy update hadn't yet > >>>> brought in) quite regularly. I wrote a SQL script would fix those > >>>> 'haphazard' additions such that load_ncbi_taxonomy would update > >>>> them > >>>> to their correct values come the next NCBI taxonomy update. I can > >>>> send you the script (it would be for the Oracle version), but I'm > >>>> not > >>>> sure this is a widely viable strategy. > >>> > >>> So this wasn't integrated with load_ncbi_taxonomy.pl at all? > >> > >> No, but now that you say it I don't see any reason why I couldn't. > >> Maybe that's just what I should do. > >> > >> -hilmar > >> > >> -- > >> =========================================================== > >> : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > >> =========================================================== > >> > >> > >> > >> _______________________________________________ > >> > >> > >> > >> BioSQL-l mailing list > >> BioSQL-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biosql-l > >> > > _______________________________________________ > > Bioperl-l mailing list > > Bioperl-l at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioperl-l > > Christopher Fields > Postdoctoral Researcher > Lab of Dr. Robert Switzer > Dept of Biochemistry > University of Illinois Urbana-Champaign > > > > From biopython at maubp.freeserve.co.uk Sun Mar 16 19:16:04 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 16 Mar 2008 19:16:04 +0000 Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id In-Reply-To: <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com> References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com> <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com> Message-ID: <320fb6e00803161216q51488e5bo8d1538fb616edb88@mail.gmail.com> On Fri, Mar 14, 2008 Mark Schreiber wrote: > From memory BioJava will add it if it is not already in there. If the > taxid can be found then the system connects you with whatever is in > that taxid, it doesn't overwrite it. BioPerl does this to, so there is consensus on this at least. But see below regarding the lineage. > This has two curious side effects. Because the details associated with > a taxid sometimes change (eg common name changes a lot) you can get > connected to an outdated version (if your record is newer than your > NCBI taxonomy) or you can get connected with a version that is newer > than your record which means when you round-trip you don't get > complete identity. This is understandable, even if a little unexpected. I (Peter) wrote: > > > Do you think when faced with a novel taxon id, Biopython/BioPerl/... > > > could write some minimal taxonomy entry (without any guess work based > > > on the species name), in order to record the sequence's taxon Hilmar Lapp replied: > > This is what Bioperl-db does. There isn't any guesswork. If > > Bio::Species has lineage information it will also insert the lineage > > information, though. I am planing to fix Biopython so that once again, it will record the taxon id against new sequences if the species is already in the table, and add it to the taxonomy if it isn't there already. Should we also try and add the lineage into the taxon/taxon_name tables, linking to existing entries based on matching scientific names where possible? Or, should we just add a single taxonomy entry for the new species, with no lineage links at all? The old Biopython code also used to add taxon table entries for the full lineage - trying to reuse existing entries based on string matching to the scientific name field in the taxon_name table. This strikes me as a little unreliable (which is why I used the term "guess work" in my earlier email). I am also concerned that this complicates the clean up operation for load_ncbi_taxonomy.pl, but have not looked into this. Hilmar Lapp wrote: > > If I remember correctly, the script makes (and hence expects) the > > primary key and the NCBI taxonomy ID to be identical. Really? Perhaps I have misunderstood you. That would cause problems if we want to record a new sequence entry with species information but no NCBI taxonomy ID (e.g. an in house sequencing project). The Biopython code doesn't seem to assume the taxon table ID bears any resemblance to the the NCBI taxonomy ID. When creating new taxon table entries, we let the database will assign the taxon table id (primary key). Peter From hlapp at gmx.net Sun Mar 16 22:00:12 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 16 Mar 2008 18:00:12 -0400 Subject: [BioSQL-l] BioSQL + Embl + Comments In-Reply-To: <1205182450.18769.20.camel@Graco> References: <1205182450.18769.20.camel@Graco> Message-ID: Hi Raoul, On Mar 10, 2008, at 4:54 PM, Raoul Jean Pierre Bonnal wrote: > Dear Hilmar, > I'm here for asking you some help. > > BioRuby guys chosen as example for round trip tests the sequence ID > AJ224122; SV 3; linear; genomic DNA; STD; PLN; 3827 BP. > > I have problem with the references/comments informations. > In biosql "comment" seems to be something generic not directly > binded to > a reference. Comment in BioSQL is a piece of annotation of type comment. The schema at present only allows you to attach those to bioentries, and in fact one particular comment can be assigned to only one bioentry (1:n relationship). > If you look at the AJ224122's embl format a comment is > connected with the reference. You're referring to the following line, right? RC revised by [3] > There is no problem with genbank because there is only a generic > comment > and BioSQL works correctly in this case. > So, how can I manage the problem with Embl ? I was thinking to add a > column the "comment_id" to "bioentry_reference" as fk to "comment" > table > in a way that a bioentry_reference can have more comments. One question here is whether the comment is specific to the association of the reference with the bioentry, or to the reference in general. The next thing to note is that the comment above is not just text, it actually establishes a relationship to another reference (or to another reference to bioentry association). So to really capture it you would want a typed link between bioentry_reference rows (in this case the relationship type would be 'revises' or 'revised by', depending on direction). The question is whether this depth of modeling is needed or useful, aside from the fact that I'm pretty sure that none of the Bio* libraries supports it (but maybe they want to?). So if not, I guess this goes back to the use-case of round-tripping? Maybe to satisfy that a bioentry_reference_qualifier table would suffice (assuming that the comment does apply rather to the reference/ bioentry association than directly to the reference). > > PS: I don't know if this stuff should be emailed to biosql list Yes, I actually hadn't realized that you hadn't posted this to the list. Should have forwarded right away, sorry for sitting on it. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Sun Mar 16 22:54:45 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 16 Mar 2008 18:54:45 -0400 Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id In-Reply-To: <320fb6e00803161216q51488e5bo8d1538fb616edb88@mail.gmail.com> References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com> <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com> <320fb6e00803161216q51488e5bo8d1538fb616edb88@mail.gmail.com> Message-ID: <9BC888F9-1DB1-40CC-93DA-C27E30019C04@gmx.net> On Mar 16, 2008, at 3:16 PM, Peter wrote: > [...] I (Peter) wrote: >>>> Do you think when faced with a novel taxon id, Biopython/ >>>> BioPerl/... >>>> could write some minimal taxonomy entry (without any guess work >>>> based >>>> on the species name), in order to record the sequence's taxon > > Hilmar Lapp replied: >>> This is what Bioperl-db does. There isn't any guesswork. If >>> Bio::Species has lineage information it will also insert the lineage >>> information, though. > > I am planing to fix Biopython so that once again, it will record the > taxon id against new sequences if the species is already in the table, > and add it to the taxonomy if it isn't there already. > > Should we also try and add the lineage into the taxon/taxon_name > tables, linking to existing entries based on matching scientific names > where possible? Or, should we just add a single taxonomy entry for > the new species, with no lineage links at all? This should probably depend on how good or complete the lineage information is that you have. BioPerl parses this out of the sequence files (for formats that have it, such as GenBank, EMBL, UniProt), and so except for exotic clades that don't follow the typical patterns it is usually in good shape (though one might say that the majority of clades are exotic). Moreover, it's worth noting that the NCBI taxonomy often contains more nodes in a lineage than are shown in the GenBank record. In this case, unless you know which levels (ranks) to print and which not to, having the full NCBI taxonomy information may in fact cause problems for round-tripping. > > The old Biopython code also used to add taxon table entries for the > full lineage - trying to reuse existing entries based on string > matching to the scientific name field in the taxon_name table. This > strikes me as a little unreliable (which is why I used the term "guess > work" in my earlier email). It's pretty unreliable actually. There is not only synonymy but also rampant homonymy in taxonomic names. There are plenty of examples for the same scientific name in use for a plant and for some animal, for example. So in order to be unambiguous you will need to know (and check) the kingdom. > I am also concerned that this complicates the clean up operation > for load_ncbi_taxonomy.pl, but have not looked into this. It shouldn't. The script makes no difference between tip (species or subspecies) nodes or internal nodes. > > Hilmar Lapp wrote: >>> If I remember correctly, the script makes (and hence expects) the >>> primary key and the NCBI taxonomy ID to be identical. > > Really? Perhaps I have misunderstood you. That would cause problems > if we want to record a new sequence entry with species information but > no NCBI taxonomy ID (e.g. an in house sequencing project). The > Biopython code doesn't seem to assume the taxon table ID bears any > resemblance to the the NCBI taxonomy ID. When creating new taxon > table entries, we let the database will assign the taxon table id > (primary key). Right, that's what I said Bioperl-db does too, and is the reason I had to regularly run that SQL script that would migrate the primary keys. Doing that isn't a big deal but I guess this could also be fixed in load_ncbi_taxonomy.pl so that it doesn't need to rely on this assumption. Would someone mind filing the bug report? (We have a BioSQL category now on bugzilla.open-bio.org.) Cheers, -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Mon Mar 17 16:08:43 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 17 Mar 2008 16:08:43 +0000 Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id In-Reply-To: <9BC888F9-1DB1-40CC-93DA-C27E30019C04@gmx.net> References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com> <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com> <320fb6e00803161216q51488e5bo8d1538fb616edb88@mail.gmail.com> <9BC888F9-1DB1-40CC-93DA-C27E30019C04@gmx.net> Message-ID: <320fb6e00803170908x76f0b9a3he57f4653d2fd433@mail.gmail.com> On Sun, Mar 16, 2008 at 10:54 PM, Hilmar Lapp wrote: > > Should we [Biopython] also try and add the lineage into the taxon/ > > taxon_name tables, linking to existing entries based on matching scientific > > names where possible? Or, should we just add a single taxonomy entry > > for the new species, with no lineage links at all? > > This should probably depend on how good or complete the lineage > information is that you have. BioPerl parses this out of the sequence > files (for formats that have it, such as GenBank, EMBL, UniProt), and > so except for exotic clades that don't follow the typical patterns it > is usually in good shape (though one might say that the majority of > clades are exotic). I'm currently testing with GenBank, EMBL and SwissProt/UniProt files. Some of these files are several years old, and include have horrible multi-species SwissProt files with "species" names longer than 255 characters etc. The good news is that as you pointed out on another thread on the BioSQL mailing list earlier this month, they don't seem to do this anymore. > Moreover, it's worth noting that the NCBI taxonomy often contains > more nodes in a lineage than are shown in the GenBank record. In this > case, unless you know which levels (ranks) to print and which not to, > having the full NCBI taxonomy information may in fact cause problems > for round-tripping. I've come to accept that taxonomy information won't always survive a round trip. > > The old Biopython code also used to add taxon table entries for the > > full lineage - trying to reuse existing entries based on string > > matching to the scientific name field in the taxon_name table. This > > strikes me as a little unreliable (which is why I used the term "guess > > work" in my earlier email). > > It's pretty unreliable actually. There is not only synonymy but also > rampant homonymy in taxonomic names. There are plenty of examples for > the same scientific name in use for a plant and for some animal, for > example. So in order to be unambiguous you will need to know (and > check) the kingdom. I don't think the current Biopython code for recording the lineages checks the kingdom... could someone point me at the relevant bit of BioPerl and I'll see if I can understand exactly what they do? Hilmar Lapp wrote: > If I remember correctly, the script makes (and hence expects) the > primary key and the NCBI taxonomy ID to be identical. > ... > Doing that isn't a big deal but I guess this could also be fixed in > load_ncbi_taxonomy.pl so that it doesn't need to rely on this > assumption. Would someone mind filing the bug report? (We have a > BioSQL category now on bugzilla.open-bio.org.) I've filed Bug 2470 on this, http://bugzilla.open-bio.org/show_bug.cgi?id=2470 Regards, Peter From hlapp at gmx.net Tue Mar 18 12:30:34 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Tue, 18 Mar 2008 08:30:34 -0400 Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id In-Reply-To: <320fb6e00803170908x76f0b9a3he57f4653d2fd433@mail.gmail.com> References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com> <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net> <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com> <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com> <320fb6e00803161216q51488e5bo8d1538fb616edb88@mail.gmail.com> <9BC888F9-1DB1-40CC-93DA-C27E30019C04@gmx.net> <320fb6e00803170908x76f0b9a3he57f4653d2fd433@mail.gmail.com> Message-ID: <418EB160-7848-4F1A-A88B-99B00003F8A2@gmx.net> On Mar 17, 2008, at 12:08 PM, Peter wrote: >> [...] >> It's pretty unreliable actually. There is not only synonymy but also >> rampant homonymy in taxonomic names. There are plenty of examples >> for >> the same scientific name in use for a plant and for some animal, for >> example. So in order to be unambiguous you will need to know (and >> check) the kingdom. > > I don't think the current Biopython code for recording the lineages > checks the > kingdom... could someone point me at the relevant bit of BioPerl > and I'll see > if I can understand exactly what they do? Bioperl-db locates by NCBI taxon id first and then by scientific name. It does not take kingdom into account. You can find the persisted columns, unique key queries etc in Bio/DB/ BioSQL and then the respective adapter, in this case SpeciesAdapter.pm. The unique key queries are defined in get_unique_key_query(). > > Hilmar Lapp wrote: >> If I remember correctly, the script makes (and hence expects) the >> primary key and the NCBI taxonomy ID to be identical. >> ... >> Doing that isn't a big deal but I guess this could also be fixed in >> load_ncbi_taxonomy.pl so that it doesn't need to rely on this >> assumption. Would someone mind filing the bug report? (We have a >> BioSQL category now on bugzilla.open-bio.org.) > > I've filed Bug 2470 on this, http://bugzilla.open-bio.org/ > show_bug.cgi?id=2470 Thanks for the help, great, appreciated! -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at nescent.org Sun Mar 9 23:36:26 2008 From: hlapp at nescent.org (Hilmar Lapp) Date: Sun, 9 Mar 2008 19:36:26 -0400 Subject: [BioSQL-l] bioperl-db bugs In-Reply-To: References: Message-ID: Hi Chris, I added comments to both bug reports. This belongs to BioPerl, though, as it has only to do with its language binding. The tidbit may be worth keeping in mind for a general BioSQL audience is that bioentry namespace (foreign key to biodatabase) is part of the (compound) bioentry unique keys. The identifier column used to be unique by itself (and could still be made such in a local instance, there's a comment to this effect in the DDL), but that was changed a while ago. (Also, if one uses any of the Bio* language bindings, changing a unique key constraint to something that differs from what the language binding assumes may be asking for a lot of trouble. Bioperl-db will expect the combination of primary_id() and namespace () to match if the latter is provided.) -hilmar On Mar 5, 2008, at 6:24 PM, Chris Fields wrote: > Hilmar, > > I think I have two bioperl-db bugs sorted out, but I'm trying to > determine whether the solution is a side-effect, a feature, or a > bug. Dmitry has filed two bug reports which are somewhat related: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2280 > http://bugzilla.open-bio.org/show_bug.cgi?id=2281 > > I have added my comments to it, but maybe you can shed some more > light on this. What he is trying to do is copy a persistent Seq > object to a different namespace; load_seqdatabase.pl won't let him > do that directly using the same sequence file. If he changes the > namespace() and store()s it using a script, the seq is moved to the > new namespace, not updated. > > My reasoning is this is a feature (by not changing the primary_key, > you don't store a new sequence but update the current one). > However, if the primary_key is unset (undef), then it appears you > can copy the sequence over (from Dmitry's script, with my addition > noted): > > ... > my $ns1 = 'space1'; > my $ns2 = 'space2'; > > my $seqadp = $db->get_object_adaptor('Bio::SeqI'); > my $aux_seq = Bio::Seq::RichSeq->new( > -accession_number => 'NC_005982', > -version => 1, > -namespace => $ns1); > my $seq = $seqadp->find_by_unique_key($aux_seq); > > # store the found sequence in the second biodatabase: > my $pseq = $seqadp->create_persistent($ns2); > $pseq->namespace('bioperl2'); > $pseq->primary_key(undef); # my addition, which appears to work > $pseq->store(); > $seqadp->commit; > ... > > My question: is this an intended effect? The ability to assign > undef to primary_key seems intentional based on the method code, > but I'm a bit uncertain here. > > chris > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- informatics.nescent.org : =========================================================== From darin.london at duke.edu Tue Mar 18 18:16:59 2008 From: darin.london at duke.edu (darin.london at duke.edu) Date: Tue, 18 Mar 2008 13:16:59 -0500 Subject: [BioSQL-l] BOSC 2008 Announcement and Call For Submissions Message-ID: <200803181816.m2IIGx2k007275@tenero.duhs.duke.edu> BOSC 2008 Call for Abstracts The 9th annual Bioinformatics Open Source Conference (BOSC 2008) will take place in Toronto, Ontario, Canada, as one of several Special Interest Group (SIG) meetings occurring in conjunction with the 16th annual Intelligent Systems for Molecular Biology Conference (ISMB 2008). The Bioinformatics Open Source Conference (BOSC) is sponsored by the Open Bioinformatics Foundation (O|B|F), a non-profit group dedicated to promoting the practice and philosophy of Open Source software development within the biological research community. Many Open Source bioinformatics packages are widely used by the research community across many application areas and form a cornerstone in enabling research in the genomic and post-genomic era. Open source bioinformatics software has facilitated rapid innovation and dissemination of new computational methods as well as informatics infrastructure. Since the work of the Open Source Bioinformatics Community represents some of the most cutting edge of Bioinformatics in general, the overall theme for the conference this year is "Tackling Hard Problems with Emerging Technologies". Topics under this umbrella include cyberinfrastructure, grid computing and workflow management and discovery, and visualization. We will also have a series of update talks about the main Open Source Bioinformatics Software suites. One of the hallmarks of BOSC is the coming together of the open source developer community in one location. A face-to-face meeting of this community creates synergy where participants can work together to create use cases, prototype working code, or run bootcamps for developers from other projects as short, informal, and hands-on tutorials in new software packages and emerging technologies. In short, BOSC is not just a conference for presentations of completed work, but is a dynamic meeting where collaborative work gets done. This year, BOSC is accepting abstract submissions on the conference theme "Tackling Hard Problems with Emerging Technologies". The conference theme reflects that there are new technologies emerging on both the scientific front (new sequencing technologies, etc.) and the IT front (workflows, mashup/web 2.0, improvements in all of the major programming languages, etc.), which may allow the open source community to solve problems that were previously intractable. Abstracts may be submitted for the following topics. 1. Cyberinfrastructure - We are interested in presentations on topics dealing with the development of infrastructure on the web to facilitate software and data re-use (mashups, or traditional), interoperability and inter-process communication, system/service discovery, and data movement and modeling in distributed systems. This may include peer-to-peer systems of data transfer, Web Services, various flavors of data representation (SOAP, JSON, XML, others), and technologies commonly referred to under the Web 2.0 paradigm (e.g. folksonomies/tagging, user-based content generation, content feeds, and Social Networking). 2. Grid Computing and Workflow Management and Discovery - We particularly invite talks that report progress in making workflow systems easier to use and on how to do distributed-collaborative research , e.g. workflows that encompass the coordination of systems running in different parts of the world. 3. Visualization - Visualization is a maturing area of open source software development. We particularly invite talks that demonstrate innovative visualization systems in the context of workflows. 4. Open Source Software - Speakers will present talks on the use, development, or philosophy of open source software in bioinformatics. 5. Bio* Open Source Project Updates - We invite abstracts from the representatives of the open source projects sponsored by or affiliated to the O|B|F (see Projects). Please consult the official BOSC 2008 website at http://www.open-bio.org/wiki/Upcoming_BOSC_conference for all updates and extra information. Submission Process: All abstracts must be submitted through our Open Conference Systems site (http://events.open-bio.org/BOSC2008/openconf.php). The form will ask for a small Abstract Text to be pasted into it, and a full paper. The small Abstract text should be a summary, while the longer abstract (should provide more details, including the open-source license requirement details) Full-length abstracts are limited to one page with one inch (2.5 cm) margins on the top, sides, and bottom. The full-length abstract should include the title, authors, and affiliations. We prefer your abstract to be in PDF format, although plain t Important Dates: May 11: Abstract submission deadline. June 2: Notification of accepted talks. June 4: Early registration discount cut-off. July 18-19: BOSC 2008! We hope to see you at BOSC 2008! Kam Dahlquist and Darin London BOSC 2008 Co-organizers From er at xs4all.nl Thu Mar 20 19:24:12 2008 From: er at xs4all.nl (Erik) Date: Thu, 20 Mar 2008 20:24:12 +0100 (CET) Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer Message-ID: <5095.156.83.1.251.1206041052.squirrel@webmail.xs4all.nl> Hi, (latest BioSQL, bioperl-db, and bioperl-live installed.) Postgres 8.3 will not auto-cast text (='character varying') to integer any longer, which causes test t/16odba.t to fail: ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: error while executing query in Bio::DB::BioSQL::SeqAdaptor::find_by_query: ERROR: operator does not exist: character varying = integer LINE 1: ...eq.taxon_id FROM bioentry seq WHERE seq.identifier = 5456929 It seems likely to cause many similar statements to fail; how should this be solved? I tried to fix it but I couldn't find the place where the statement/clauses are put together. Thanks, Erik Rijkers From hlapp at gmx.net Thu Mar 20 22:49:41 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 20 Mar 2008 18:49:41 -0400 Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer In-Reply-To: <5095.156.83.1.251.1206041052.squirrel@webmail.xs4all.nl> References: <5095.156.83.1.251.1206041052.squirrel@webmail.xs4all.nl> Message-ID: <0F80B40B-0232-4367-8433-992588B6E71B@gmx.net> Hi Erik, thanks for the report. Given the error message, it looks more like the integer (which in reality is a string) can't be automatically converted to a string. That would be equally interesting, though. DBI I thought used to bind all parameters as string by default, but maybe that has changed? The parameter values are indeed all bound generically (and the query is created dynamically too), and I'm leaving it up to the DBD drivers to do the "Right Thing". I could obviously force everything into type string, but that is likely to have it's own repercussions on various RDBMSs. So could you file this as a bug report on bugzilla.open-bio.org (category bioperl-db, this is actually not a BioSQL problem), and run the following test on your 8.3 instance (which minor version actually?): CREATE TABLE t1 (a varchar(10), b text, c integer); SELECT * from t1 WHERE a = 1; SELECT * from t1 WHERE b = 1; SELECT * from t1 WHERE c = '1'; INSERT INTO t1 (a,b,c) VALUES ('a','b',1); SELECT * from t1 WHERE a = 1; SELECT * from t1 WHERE b = 1; SELECT * from t1 WHERE c = '1'; SELECT * from t1 WHERE a = 1::text; SELECT * from t1 WHERE b = 1::text; SELECT * from t1 WHERE c = integer '1'; DROP TABLE t1; These work all fine on my 8.1.4 instance. -hilmar On Mar 20, 2008, at 3:24 PM, Erik wrote: > Hi, > > (latest BioSQL, bioperl-db, and bioperl-live installed.) > > Postgres 8.3 will not auto-cast text (='character > varying') to integer any longer, which causes test > t/16odba.t to fail: > > > ------------- EXCEPTION: Bio::Root::Exception ------------- > MSG: error while executing query in > Bio::DB::BioSQL::SeqAdaptor::find_by_query: ERROR: > operator does not exist: character varying = integer > LINE 1: ...eq.taxon_id FROM bioentry seq WHERE > seq.identifier = 5456929 > > It seems likely to cause many similar statements to fail; > how should this be solved? > > I tried to fix it but I couldn't find the place where the > statement/clauses are put together. > > > Thanks, > > Erik Rijkers > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From er at xs4all.nl Thu Mar 20 23:30:03 2008 From: er at xs4all.nl (Erik) Date: Fri, 21 Mar 2008 00:30:03 +0100 (CET) Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer Message-ID: <15786.156.83.1.157.1206055803.squirrel@webmail.xs4all.nl> On Thu, March 20, 2008 23:49, Hilmar Lapp wrote: > Hi Erik, thanks for the report. Given the error message, > it looks > more like the integer (which in reality is a string) can't > be automatically converted to a string. you are right, of course :) Here is the postgres 8.3.1 result of your sql statements: CREATE TABLE t1 (a varchar(10), b text, c integer); SELECT * from t1 WHERE a = 1; -- fails in 8.3.1 SELECT * from t1 WHERE b = 1; -- fails in 8.3.1 SELECT * from t1 WHERE c = '1'; -- ok INSERT INTO t1 (a,b,c) VALUES ('a','b',1); SELECT * from t1 WHERE a = 1; -- fails in 8.3.1 SELECT * from t1 WHERE b = 1; -- fails in 8.3.1 SELECT * from t1 WHERE c = '1'; -- ok SELECT * from t1 WHERE a = 1::text; -- ok SELECT * from t1 WHERE b = 1::text; -- ok SELECT * from t1 WHERE c = integer '1'; -- ok The failure is always (virtually) the same: ERROR: operator does not exist: character varying = integer LINE 1: SELECT * from t1 WHERE a = 1; ^ HINT: No operator matches the given name and argument type(s). You might need to add explicit type casts. Then there is the cast function: for instance, I can let the test in t/16odba.t proceed faultlessly with $seq = $biodb->get_Seq_by_id( "cast(5456929 as text)" ); I am also doubtful/curious as to how this would affect the various loading scripts which I was going to use - I want to set up a GBrowse with human/mouse/flybase sequence annotation to show ChipSeq data against. But one thing at a time, I guess... > So could you file this as a bug report on > bugzilla.open-bio.org > (category bioperl-db, this is actually not a BioSQL > problem), I'll make an entry in bugzilla/bioperl-db. Thanks for you quick reply! Erik Rijkers From hlapp at gmx.net Fri Mar 21 00:34:42 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 20 Mar 2008 20:34:42 -0400 Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer In-Reply-To: <15786.156.83.1.157.1206055803.squirrel@webmail.xs4all.nl> References: <15786.156.83.1.157.1206055803.squirrel@webmail.xs4all.nl> Message-ID: <987C9C0E-840B-44AD-B3E9-0FC2809FF4F4@gmx.net> On Mar 20, 2008, at 7:30 PM, Erik wrote: > Here is the postgres 8.3.1 result of your sql statements: > > CREATE TABLE t1 (a varchar(10), b text, c integer); > > SELECT * from t1 WHERE a = 1; -- fails in 8.3.1 > SELECT * from t1 WHERE b = 1; -- fails in 8.3.1 > SELECT * from t1 WHERE c = '1'; -- ok > > [...] > The failure is always (virtually) the same: > ERROR: operator does not exist: character varying = integer > LINE 1: SELECT * from t1 WHERE a = 1; > ^ > HINT: No operator matches the given name and argument > type(s). You might need to add explicit type casts. So it's indeed the backend that changed behavior. It's actually documented as I see now: http://www.postgresql.org/docs/8.3/static/release-8-3.html scroll to section E.2.2. Migration to Version 8.3, E.2.2.1. General, and the first item there: Non-character data types are no longer automatically cast to TEXT (Peter, Tom) Previously, if a non-character value was supplied to an operator or function that requires text input, it was automatically cast to text, for most (though not all) built-in data types. This no longer happens: an explicit cast to text is now required for all non- character-string types. I can see the arguments there but this will prevent upgrading to 8.3 for many many applications, and the comments from the Pg developers ('fix your SQL to use casts') that I've seen there on the mailing lists are just not helpful. Fixing SQL is for many legacy applications is just not an option. In the case of Bioperl-db it's very non-trivial, because all of a sudden we would be changing from a hands-off and let-the-driver- figure-it-out approach to forcing types everywhere. So I think at this point with this change I have to declare Bioperl- db officially incompatible with PostgreSQL 8.3+ until we've found a solution to this, which is too bad because it seems 8.3 has some really nice performance features added. One possible solution might be to create a CAST in the database (namely the one that was taken away, restoring behavior to pre-8.3). Another possibility is to move the parameter binding method into the driver adaptor which would then delegate to the DBI method but would be overridden for the PostgreSQL adapter to force all bindings to type string. Which leads me back to the surprise observation that the parameter was bound as an integer in the first place, when DBD::Pg used to bind everything as string unless you told it otherwise. Which DBD::Pg version is it that you are using? I would suspect (or hope) that maybe there is soon an update release of DBD::Pg that fixes this problem by going back to binding everything as string by default (and as the tests show PostgreSQL will still convert strings to integer if necessary). Depending on what I (or can someone else update us on this?) find out for the DBD::Pg plans, I'll probably start looking into moving the parameter binding into the driver adapters. Though it does feel pathetic that this is now also not transparent between drivers. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From er at xs4all.nl Fri Mar 21 00:51:43 2008 From: er at xs4all.nl (Erik) Date: Fri, 21 Mar 2008 01:51:43 +0100 (CET) Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer Message-ID: <4483.156.83.1.157.1206060703.squirrel@webmail.xs4all.nl> On Fri, March 21, 2008 01:34, Hilmar Lapp wrote: > > So I think at this point with this change I have to > declare Bioperl- > db officially incompatible with PostgreSQL 8.3+ until > we've found a > solution to this, which is too bad because it seems 8.3 > has some > really nice performance features added. Pg 8.3 is indeed very noticably faster, and it has other excellent new features like full text indexing. (This also makes that downgrading is not really an option) > Which DBD::Pg version is it that you are using? DBD::Pg 2.3.0 Thanks, Erik Rijkers From hlapp at gmx.net Fri Mar 21 01:36:50 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 20 Mar 2008 21:36:50 -0400 Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer In-Reply-To: <4483.156.83.1.157.1206060703.squirrel@webmail.xs4all.nl> References: <4483.156.83.1.157.1206060703.squirrel@webmail.xs4all.nl> Message-ID: <071CB899-AB3E-40B8-9477-82AE98DB88B1@gmx.net> On Mar 20, 2008, at 8:51 PM, Erik wrote: > On Fri, March 21, 2008 01:34, Hilmar Lapp wrote: >> >> So I think at this point with this change I have to declare >> Bioperl-db officially incompatible with PostgreSQL 8.3+ until >> we've found a solution to this, which is too bad because it seems >> 8.3 has some really nice performance features added. > > Pg 8.3 is indeed very noticably faster, and it has other > excellent new features like full text indexing. (This also > makes that downgrading is not really an option) Right, I saw that too. It is, however, just migrated from what was a contrib module before, so downgrading and using the contrib module is an option. Furthermore, folding these new features together with a behavior change that is backwards incompatible was a choice the PostgreSQL people made, not we. We also aren't doing poor typing that deserves fixing; we're just not doing any typing by treating everything as a string. This is the Perl paradigm. At this point it's actually unclear to me how this new behavior is compatible with untyped scripting languages unless you know the type of each column that you're binding a value for, because if you actually force typecasts to string for everything you get an error if an integer is indeed what's needed. I'm wondering what I'm missing. -hilmar BTW what does the following query yield on your 8.3.1 database: select s.typname as source, t.typname as target, f.proname as function, c.castcontextfrom pg_cast c, pg_type s, pg_type t, pg_proc f where c.castsource = s.oid and c.casttarget = t.oid and c.castfunc = f.oidand t.typname = 'text'; On my 8.1.4 database I get: source | target | function | castcontext -------------+--------+----------+------------- bpchar | text | text | i char | text | text | i name | text | text | i int8 | text | text | i int2 | text | text | i int4 | text | text | i oid | text | text | i float4 | text | text | i float8 | text | text | i macaddr | text | text | e cidr | text | text | e inet | text | text | e date | text | text | i time | text | text | i timestamp | text | text | i timestamptz | text | text | i interval | text | text | i timetz | text | text | i numeric | text | text | i (19 rows) -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From greg at turnstep.com Fri Mar 21 02:41:10 2008 From: greg at turnstep.com (Greg Sabino Mullane) Date: Fri, 21 Mar 2008 02:41:10 -0000 Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer In-Reply-To: <987C9C0E-840B-44AD-B3E9-0FC2809FF4F4@gmx.net> Message-ID: <19ecb7a297f64722c4f63f10ed2ebdce@biglumber.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 > Which leads me back to the surprise observation that the parameter > was bound as an integer in the first place, when DBD::Pg used to bind > everything as string unless you told it otherwise. Which DBD::Pg > version is it that you are using? I would suspect (or hope) that > maybe there is soon an update release of DBD::Pg that fixes this > problem by going back to binding everything as string by default (and > as the tests show PostgreSQL will still convert strings to integer if > necessary). > > Depending on what I (or can someone else update us on this?) find out > for the DBD::Pg plans, I'll probably start looking into moving the > parameter binding into the driver adapters. Though it does feel > pathetic that this is now also not transparent between drivers. What you are probably looking for is already there, namely: $dbh->{pg_server_prepare} = 0; There's good reasons for the casting enforcement in 8.3, although I've been a sharp critic of the change, and certainly of the suddeness of it. Another solution to consider is adding the casts back in: http://people.planetpostgresql.org/peter/index.php?/archives/2008/03.html (the March 4th entry) - -- Greg Sabino Mullane greg at turnstep.com PGP Key: 0x14964AC8 200803202237 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iEYEAREDAAYFAkfjIBYACgkQvJuQZxSWSsiamwCdEbNrC4F4oU7AGHrbHAm1YNXG HbUAoIRJtGW4brvMKklxZYG6pusbcTqf =Zawx -----END PGP SIGNATURE----- From hlapp at gmx.net Fri Mar 21 12:52:39 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Fri, 21 Mar 2008 08:52:39 -0400 Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer In-Reply-To: <19ecb7a297f64722c4f63f10ed2ebdce@biglumber.com> References: <19ecb7a297f64722c4f63f10ed2ebdce@biglumber.com> Message-ID: Hi Greg - thanks for your email, it's very helpful. On Mar 20, 2008, at 10:41 PM, Greg Sabino Mullane wrote: >> >> Depending on what I (or can someone else update us on this?) find out >> for the DBD::Pg plans, I'll probably start looking into moving the >> parameter binding into the driver adapters. Though it does feel >> pathetic that this is now also not transparent between drivers. > > What you are probably looking for is already there, namely: > > $dbh->{pg_server_prepare} = 0; So disabling server-side prepares will leave values quoted? Having server-side prepares would be very useful though, especially for Bioperl-db with its many lookup queries that all use similar parameter values. > > There's good reasons for the casting enforcement in 8.3 I do understand that, but it's also a sharp contrast to other RDBMSs that doesn't it make it easier for people to choose Pg when they should, and doesn't help writing cross-platform database applications either. > although I've been a sharp critic of the change, and certainly of > the suddeness > of it. Another solution to consider is adding the casts back in: > > http://people.planetpostgresql.org/peter/index.php?/archives/ > 2008/03.html > (the March 4th entry) Thanks for this, that helps a lot. Do you have links to some of the key threads showing what rationale went into the decision? (Or should I just search for your name?) I'd like to read up on that first before pouring more oil into the fire. I suspect that many of those who made the decision are never faced with needing to write cross-RDBMS code. Also, I wonder why this wasn't made a configurable option so it can be disabled by a simple config file change (such as the move away from automatic OID columns). But obviously this is the wrong list for discussing this (though Bioperl-db *is* one of those pieces of software that must be cross-RDBMS). -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From er at xs4all.nl Fri Mar 21 21:43:47 2008 From: er at xs4all.nl (Erik) Date: Fri, 21 Mar 2008 22:43:47 +0100 (CET) Subject: [BioSQL-l] [Bioperl-l] postgres 8.3 - load_seqdatabase.pl / swissprot Message-ID: <16589.156.83.1.157.1206135827.squirrel@webmail.xs4all.nl> Hi, PostgreSQL 8.3.1 DBD::Pg 2.3.0 perl 5.8.8 (The following error may have to do with the 8.3 problems that I reported yesterday (bug 2472) - I don't know) I ran biosql-schema/scripts/load_ncbi_taxonomy.pl without problem. Then I ran scripts/biosql/load_seqdatabase.pl as: perl scripts/biosql/load_seqdatabase.pl \ -driver Pg \ -dbuser xxxxxxx \ -dbname bioseqdb \ -namespace swissprot \ -format swiss \ /DATA/ms/ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat It took two hours to load 26504 records (7%) of uniprot_sprot.dat (is it expected to be so slow?), then failed with: Could not store Q2UXW0: ------------- EXCEPTION: Bio::Root::Exception ------------- MSG: create: object (Bio::Species) failed to insert or to be found by unique key STACK: Error::throw STACK: Bio::Root::Root::throw /home/aardvark/bin/perl/lib/site_perl/5.8.8/Bio/Root/Root.pm:357 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /home/aardvark/bin/perl/lib/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206 STACK: Bio::DB::Persistent::PersistentObject::create /home/aardvark/bin/perl/lib/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:244 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create /home/aardvark/bin/perl/lib/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:169 STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::store /home/aardvark/bin/perl/lib/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251 STACK: Bio::DB::Persistent::PersistentObject::store /home/aardvark/bin/perl/lib/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:271 STACK: scripts/biosql/load_seqdatabase.pl:630 ----------------------------------------------------------- I don't know if this is directly related to the 8.3 casting problems I reported yesterday (bug 2472), or a separate Bio::Species issue regards, Erik Rijkers From hlapp at gmx.net Sat Mar 22 18:18:45 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 22 Mar 2008 14:18:45 -0400 Subject: [BioSQL-l] Call for Student Applications - NESCent participates in the Google Summer of Code In-Reply-To: <0025B440-EF1E-4632-9DB4-B98489BF3550@duke.edu> Message-ID: <5AC4F213-8D88-41C6-B380-59B2EF7831F0@gmx.net> Hi all - just wanted to draw your attention to our Google Summer of Code participation this year. One of the projects deals directly with BioPerl, another one builds on BioSQL (and could be implemented taking advantage of BioPerl or Bio::Phylo, or Biojava). Cheers, -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== Phyloinformatics Summer of Code 2008 http://phyloinformatics.net/Phyloinformatics_Summer_of_Code_2008 *** Please disseminate this announcement widely to appropriate students at your institution *** The National Evolutionary Synthesis Center (NESCent: http:// www.nescent.org/) is participating in 2008 for the second year as a mentoring organization in the Google Summer of Code (http:// code.google.com/soc). Through this program, Google provides undergraduate, masters, and PhD students with a unique opportunity to obtain hands-on experience writing and extending open-source software under the mentorship of experienced developers from around the world. Our goal in participating is to train future researchers and developers to not only have awareness and understanding of the value of open-source and collaboratively developed software, but also to gain the programming and remote collaboration skills needed to successfully contribute to such projects. Students will receive a stipend from Google, and may work from their home, or home institution, for the duration of the 3 month program. Students will each have one or more dedicated mentors with expertise in phylogenetic methods and open-source software development. NESCent is particularly targeting students interested in both evolutionary biology and software development. Project ideas (see URL below) range from visualizing phylogenetic data in R, to development of a Mesquite module, web-services for phylogenetic data providers or geophylogeny mashups, implementing phyloXML support, navigating databases of networks, topology queries for PhyloCode registries, to phylogenetic tree mining in a MapReduce framework, and more. The project ideas are flexible and many can be adjusted in scope to match the skills of the student. If the program sounds interesting to you but you are unsure whether you have the necessary skills, please email the mentors at the address below. We will work with you to find a project that fits your interests and skills. INQUIRIES: Email any questions, including self-proposed project ideas, to phylosoc {at} nescent {dot} org. TO APPLY: Apply on-line at the Google Summer of Code website (http://code.google.com/soc/2008), where you will also find GSoC program rules and eligibility requirements. The 1-week application period for students opens on Monday March 24th and runs through Monday, March 31st, 2008. Hilmar Lapp and Todd Vision US National Evolutionary Synthesis Center ===== URLs: ===== 2008 NESCent Phyloinformatics Summer of Code: http://phyloinformatics.net/Phyloinformatics_Summer_of_Code_2008 Eligibility requirements: http://code.google.com/opensource/gsoc/2008/faqs.html#0.1_eligibility Stipends: http://code.google.com/opensource/gsoc/2008/faqs.html#0.1_administrivia To sign up for quarterly NESCent newsletters: with announcements about upcoming programs at the Center: http://www.nescent.org/about/contact.php From hlapp at gmx.net Sat Mar 22 20:01:51 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sat, 22 Mar 2008 16:01:51 -0400 Subject: [BioSQL-l] [Bioperl-l] postgres 8.3 - load_seqdatabase.pl / swissprot In-Reply-To: <16589.156.83.1.157.1206135827.squirrel@webmail.xs4all.nl> References: <16589.156.83.1.157.1206135827.squirrel@webmail.xs4all.nl> Message-ID: <69D3EA33-810B-40EA-8687-752FA1A34FBF@gmx.net> Forgot to respond to this: On Mar 21, 2008, at 5:43 PM, Erik wrote: > It took two hours to load 26504 records (7%) of uniprot_sprot.dat > (is it expected to be so slow?) The last time I used to load those regularly it was a bit faster (~ 5 seqs/s) but it is in a ballpark that wouldn't raise a red flag for me. BTW you can make it print statistics using the --logchunk N option, where N is the number of seqs after which you want the current count and the #recs/s printed. You may get it to be faster if you tune the database (e.g., make sure there is enough memory for index reorganization, transaction log and tablespace datafile are on separate disks, etc; fiddling with the query optimizer has probably little effect as almost all queries are simple lookups or inserts). That all said, the strength of load_seqdatabase.pl isn't speed. It doesn't make use of any bulk upload optimizations, and therefore the initial load of a very large database will take its time. The power is more in subsequent updates where you can configure what you want to happen, and during which the database is never in an inconsistent state, so it can run in the background. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From greg at turnstep.com Mon Mar 24 00:42:36 2008 From: greg at turnstep.com (Greg Sabino Mullane) Date: Mon, 24 Mar 2008 00:42:36 -0000 Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer In-Reply-To: Message-ID: <4ab14dcc59d7566b55ba87027055e9fd@biglumber.com> -----BEGIN PGP SIGNED MESSAGE----- Hash: RIPEMD160 >> Depending on what I (or can someone else update us on this?) find out >> for the DBD::Pg plans, I'll probably start looking into moving the >> parameter binding into the driver adapters. Though it does feel >> pathetic that this is now also not transparent between drivers. > > What you are probably looking for is already there, namely: > > $dbh->{pg_server_prepare} = 0; > So disabling server-side prepares will leave values quoted? Having > server-side prepares would be very useful though, especially for > Bioperl-db with its many lookup queries that all use similar > parameter values. Yes, it forces DBD::Pg to do the quoting itself, which basically means that everything is shipped to the server as a single SQL string, and no placeholders are used. In the grand scheme of things, the speed difference is not large for most queries. Certainly one way would be to turn this on for 8.3 and above, and slowly migrate the queries/schema over time. >> There's good reasons for the casting enforcement in 8.3 > I do understand that, but it's also a sharp contrast to other RDBMSs > that doesn't it make it easier for people to choose Pg when they > should, and doesn't help writing cross-platform database applications > either. I'm not overly familiar with how other databases treat this, but I've heard DB2 can be a stickler about this too. I've not dug into the bioperl code in a while, to be honest, so I'm not sure what sort of queries we're talking about. Certainly long-term the code and schema should move away from implicit casting. Maybe a better short-term solution is addind the more obvious casts (e.g. text<->int) back in. > Do you have links to some of the key threads showing what rationale > went into the decision? (Or should I just search for your name?) I'd > like to read up on that first before pouring more oil into the fire. > I suspect that many of those who made the decision are never faced > with needing to write cross-RDBMS code. > > Also, I wonder why this wasn't made a configurable option so it can > be disabled by a simple config file change (such as the move away > from automatic OID columns). But obviously this is the wrong list for . discussing this (though Bioperl-db *is* one of those pieces of > software that must be cross-RDBMS). I did ask about that, and was told it would not have been easy to do so. But I agree, a phasing in period (heck, even a warning) would have been nice. Feel free to pour some oil on the fire, I think this is one of many apps that has been affected. (I've run across two other major cross-DB apps (Interchange and MediaWiki) that are struggling with the same pain. I managed to painfully fix the latter, but the former is way too complex to tackle at the moment). I could not find the thread(s?) I weighed in on, but you can find some relevant discussions by googling "strict-typing benefits grokbase" - -- Greg Sabino Mullane greg at turnstep.com PGP Key: 0x14964AC8 200803232039 http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8 -----BEGIN PGP SIGNATURE----- iEYEAREDAAYFAkfm+NAACgkQvJuQZxSWSsi4ogCdGNWvCJIzXxb+YKzdm6wwxQMv p3AAnizkWXoo/rvxv4KVdC8tD0vF87k3 =dNYi -----END PGP SIGNATURE----- From biopython at maubp.freeserve.co.uk Tue Mar 25 15:56:16 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 25 Mar 2008 15:56:16 +0000 Subject: [BioSQL-l] [BioPython] Concerns the update of BioSQL.taxon table In-Reply-To: <320fb6e00803250853i629e59aj310ddc5667ea57d@mail.gmail.com> References: <711039.40736.qm@web26505.mail.ukl.yahoo.com> <320fb6e00803250853i629e59aj310ddc5667ea57d@mail.gmail.com> Message-ID: <320fb6e00803250856n1001d74dxeb8560652f594e51@mail.gmail.com> On Tue, Mar 25, 2008 at 3:53 PM, Peter wrote: > Hi Eric, > > Your issue is almost certainly due to switching from Biopython 1.44 to > 1.45, rather than from a prerelease BioSQL to the recently released > BioSQL 1.0.0. > > For background, you should read Bug 2422 and the BioSQL thread it points to. > http://bugzilla.open-bio.org/show_bug.cgi?id=2422 > > Biopython 1.44 never recorded the taxon id (and therefore didn't use > the taxon/taxon_name tables) > Biopython 1.45 does record the taxon id, and attempts to fill in > missing taxon/taxon_name entries > > I'm a little unclear on what is going wrong for you. Did you pre-load > the NCBI taxonomy for example? The script you are talking about, is > this your own? > > Peter > P.S. Did you mean to send your original message to the BioSQL list as well Eric? You need biosql-l at lists.open-bio.org not biosql at lists.open-bio.org Peter From ericgibert at yahoo.fr Wed Mar 26 11:29:24 2008 From: ericgibert at yahoo.fr (Eric Gibert) Date: Wed, 26 Mar 2008 11:29:24 +0000 (GMT) Subject: [BioSQL-l] Concerns the update of BioSQL.taxon table Message-ID: <290936.61510.qm@web26510.mail.ukl.yahoo.com> Thank you Peter for the correct email of the BioSQL list. No, it is not something linked to BioPython 1.45 upgrade: same behavior as 1.44. My problem is linked to the fact that the BioSQl schema version 1.0.0 defines a *unique* index on taxon.ncbi_taxon_id. I did not have this index before. I have written a script that connects to the taxonomy database of NCBI and get the XML data for the species. Then it updates the taxon table, replacing the ncbi_taxon_id and node_rank NULL by their values for all the lineage. I call it after the loading of BioSeqs in the database. Example: I load a BioSeq for Nannophya pygmaea then I run my script to update the ncbi_taxon_id and rank: +----------+---------------+-----------------+--------------+ | taxon_id | ncbi_taxon_id | parent_taxon_id | node_rank | +----------+---------------+-----------------+--------------+ | 13 | 2759 | NULL | superkingdom | | 14 | 33208 | 13 | kingdom | | 15 | 6656 | 14 | phylum | | 16 | 6960 | 15 | superclass | | 17 | 50557 | 16 | class | | 18 | 7496 | 17 | no rank | | 19 | 33339 | 18 | subclass | | 20 | 6961 | 19 | order | | 21 | 6962 | 20 | suborder | | 22 | 6964 | 21 | family | | 23 | 229390 | 22 | genus | | 24 | 229391 | 23 | species | No problem. Now I insert/load another Libellulideae (Orthetrum sabina ): 'empty/NULL' taxons records are inserted by the db.load() BioPython function: | 25 | NULL | NULL | NULL | | 26 | NULL | 25 | NULL | | 27 | NULL | 26 | NULL | | 28 | NULL | 27 | NULL | | 29 | NULL | 28 | NULL | | 30 | NULL | 29 | NULL | | 31 | NULL | 30 | NULL | | 32 | NULL | 31 | NULL | | 33 | NULL | 32 | NULL | | 34 | NULL | 33 | NULL | | 35 | NULL | 34 | genus | | 36 | 320892 | 35 | species | then I try to run my script: this time I have an update failure because the record 34 is the SAME family hence same ncbi_taxon_id as record 22: 'duplicate entry on key 2'. Either this *unique* index is new and it is a BioSQL "issue" (as said, this index did not exist in my previous BioSQL db so I never encountered this issue before), OR the way BioPython "repeats" existing taxons is incorrect/not compatible. In that case, when inserting the second BioSeq, record 34 should not be created but record 35 (the genus) should "point" to the already existing family at record 22 as its father. Thus I would have the confirmation on by BioSQL team that the unique index is valid. If that is the case, then we can have a BioPython separate talk about how to improve the management of the taxon table. Best regards, Eric _____________________________________________________________________________ Envoyez avec Yahoo! Mail. Capacit? de stockage illimit?e pour vos emails. http://mail.yahoo.fr From holland at ebi.ac.uk Wed Mar 26 12:00:03 2008 From: holland at ebi.ac.uk (Richard Holland) Date: Wed, 26 Mar 2008 12:00:03 +0000 Subject: [BioSQL-l] Concerns the update of BioSQL.taxon table In-Reply-To: <290936.61510.qm@web26510.mail.ukl.yahoo.com> References: <290936.61510.qm@web26510.mail.ukl.yahoo.com> Message-ID: <47EA3AC3.20104@ebi.ac.uk> -----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Purely from a database perspective, the index is correct. There should be no need to have a duplicate entry in ncbi_taxon_id. The implication is that taxon_id is a 1:1 mapping to ncbi_taxon_id. There should be no need to have two separate local taxon_id values referring to one NCBI taxon. Ideally, when you run your update script, for each taxon_id record it processes it should be checking for an existing entry with the same ncbi_taxon_id, getting the taxon_id for that existing entry, then removing the duplicate entry and updating the relevant parent_taxon_id values in other records to refer to the existing taxon_id instead. BioPython would need to be making similar checks when it inserts new entries. If it isn't, then it needs to be fixed. cheers, Richard Eric Gibert wrote: > Thank you Peter for the correct email of the BioSQL list. > > No, it is not something linked to BioPython 1.45 upgrade: same behavior as 1.44. My problem is linked to the fact that the BioSQl schema version 1.0.0 defines a *unique* index on taxon.ncbi_taxon_id. I did not have this index before. > > I have written a script that connects to the taxonomy database of NCBI and get the XML data for the species. Then it updates the taxon table, replacing the ncbi_taxon_id and node_rank NULL by their values for all the lineage. I call it after the loading of BioSeqs in the database. > > Example: > I load a BioSeq for Nannophya pygmaea then I run my script to update the ncbi_taxon_id and rank: > +----------+---------------+-----------------+--------------+ > | taxon_id | ncbi_taxon_id | parent_taxon_id | node_rank | > +----------+---------------+-----------------+--------------+ > | 13 | 2759 | NULL | superkingdom | > | 14 | 33208 | 13 | kingdom | > | 15 | 6656 | 14 | phylum | > | 16 | 6960 | 15 | superclass | > | 17 | 50557 | 16 | class | > | 18 | 7496 | 17 | no rank | > | 19 | 33339 | 18 | subclass | > | 20 | 6961 | 19 | order | > | 21 | 6962 | 20 | suborder | > | 22 | 6964 | 21 | family | > | 23 | 229390 | 22 | genus | > | 24 | 229391 | 23 | species | > > No problem. > > Now I insert/load another Libellulideae (Orthetrum sabina ): 'empty/NULL' taxons records are inserted by the db.load() BioPython function: > | 25 | NULL | NULL | NULL | > | 26 | NULL | 25 | NULL | > | 27 | NULL | 26 | NULL | > | 28 | NULL | 27 | NULL | > | 29 | NULL | 28 | NULL | > | 30 | NULL | 29 | NULL | > | 31 | NULL | 30 | NULL | > | 32 | NULL | 31 | NULL | > | 33 | NULL | 32 | NULL | > | 34 | NULL | 33 | NULL | > | 35 | NULL | 34 | genus | > | 36 | 320892 | 35 | species | > > then I try to run my script: this time I have an update failure because the record 34 is the SAME family hence same ncbi_taxon_id as record 22: 'duplicate entry on key 2'. > > Either this *unique* index is new and it is a BioSQL "issue" (as said, this index did not exist in my previous BioSQL db so I never encountered this issue before), OR the way BioPython "repeats" existing taxons is incorrect/not compatible. In that case, when inserting the second BioSeq, record 34 should not be created but record 35 (the genus) should "point" to the already existing family at record 22 as its father. > > Thus I would have the confirmation on by BioSQL team that the unique index is valid. If that is the case, then we can have a BioPython separate talk about how to improve the management of the taxon table. > > > Best regards, > > Eric > > > > > > > _____________________________________________________________________________ > Envoyez avec Yahoo! Mail. Capacit? de stockage illimit?e pour vos emails. http://mail.yahoo.fr > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > - -- Richard Holland (BioMart) EMBL EBI, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire CB10 1SD, UK Tel. +44 (0)1223 494416 http://www.biomart.org/ http://www.biojava.org/ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.2.2 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFH6jrD4C5LeMEKA/QRAu7rAJ9TBYt0CeTTrPi0QN7Vm/UwiBANQwCfeoqz 0uTvcXXteholK+4xxuxjCXw= =qhOf -----END PGP SIGNATURE----- From biopython at maubp.freeserve.co.uk Wed Mar 26 12:30:50 2008 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 26 Mar 2008 12:30:50 +0000 Subject: [BioSQL-l] [BioPython] Concerns the update of BioSQL.taxon table In-Reply-To: <290936.61510.qm@web26510.mail.ukl.yahoo.com> References: <290936.61510.qm@web26510.mail.ukl.yahoo.com> Message-ID: <320fb6e00803260530w72cca900mc19654798d5d7e13@mail.gmail.com> On Wed, Mar 26, 2008 at 11:29 AM, Eric Gibert wrote: > Thank you Peter for the correct email of the BioSQL list. > > No, it is not something linked to BioPython 1.45 upgrade: same behavior as 1.44. > My problem is linked to the fact that the BioSQl schema version 1.0.0 defines a > *unique* index on taxon.ncbi_taxon_id. I did not have this index before. > > I have written a script that connects to the taxonomy database of NCBI and get > the XML data for the species. Then it updates the taxon table, replacing the > ncbi_taxon_id and node_rank NULL by their values for all the lineage. I call it > after the loading of BioSeqs in the database. So you wrote your own version of the BioSQL perl script load_ncbi_taxonomy.pl? > Example: > I load a BioSeq for Nannophya pygmaea then I run my script to update the ncbi_taxon_id and rank: > +----------+---------------+-----------------+--------------+ > | taxon_id | ncbi_taxon_id | parent_taxon_id | node_rank | > +----------+---------------+-----------------+--------------+ > | 13 | 2759 | NULL | superkingdom | > | 14 | 33208 | 13 | kingdom | > | 15 | 6656 | 14 | phylum | > | 16 | 6960 | 15 | superclass | > | 17 | 50557 | 16 | class | > | 18 | 7496 | 17 | no rank | > | 19 | 33339 | 18 | subclass | > | 20 | 6961 | 19 | order | > | 21 | 6962 | 20 | suborder | > | 22 | 6964 | 21 | family | > | 23 | 229390 | 22 | genus | > | 24 | 229391 | 23 | species | > > No problem. > > Now I insert/load another Libellulideae (Orthetrum sabina ): 'empty/NULL' > taxons records are inserted by the db.load() BioPython function: These records are "guess work" based on the lineage in the GenBank file - we don't know the NCBI taxon ids, so they are NULL, nor the rank, but there is a scientific name in the lined taxon_name table. I am open to the idea of not writing this guessed lineage, and just writing one entry for the species and the given NCBI taxon ID. However, as the new entry Orthetrum sabina should share some of its lineage with Nannophya pygmaea, then I agree Biopython *should* be re-using those existing taxon entries, if it can match them safely using the scientific name. Re-reading the relevant bit of old code, it doesn't seem to do this. I've file bug 2475: http://bugzilla.open-bio.org/show_bug.cgi?id=2475 This is actually a tricky problem, requiring some a 'clever' parent linkage as you said in your earlier email. Hilmar wrote this about the equivalent code in BioPerl: >> It's pretty unreliable actually. There is not only synonymy but also >> rampant homonymy in taxonomic names. There are plenty of examples >> for the same scientific name in use for a plant and for some animal, for >> example. So in order to be unambiguous you will need to know (and >> check) the kingdom. See http://lists.open-bio.org/pipermail/biosql-l/2008-March/001207.html Eric wrote: > then I try to run my script: this time I have an update failure because the > record 34 is the SAME family hence same ncbi_taxon_id as record 22: > 'duplicate entry on key 2'. > > Either this *unique* index is new and it is a BioSQL "issue" (as said, this index > did not exist in my previous BioSQL db so I never encountered this issue before), Hopefully Hilmar from BioSQL can answer this. > OR the way BioPython "repeats" existing taxons is incorrect/not compatible. > In that case, when inserting the second BioSeq, record 34 should not be created > but record 35 (the genus) should "point" to the already existing family at record > 22 as its father. This example might be easier to follow if the scientific names from the taxon_name were included. I would check the lineage but the NCBI wepage is being very slow for me right now. In the short term, as a quick fix, your script could first remove taxon entries with a blank NCBI taxon ID (and clear any keys pointing to them). Not elegent - but it would work. Thanks Eric Peter From hlapp at gmx.net Wed Mar 26 13:29:01 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 26 Mar 2008 09:29:01 -0400 Subject: [BioSQL-l] Concerns the update of BioSQL.taxon table In-Reply-To: <290936.61510.qm@web26510.mail.ukl.yahoo.com> References: <290936.61510.qm@web26510.mail.ukl.yahoo.com> Message-ID: On Mar 26, 2008, at 7:29 AM, Eric Gibert wrote: > Either this *unique* index is new and it is a BioSQL "issue" (as > said, this index did not exist in my previous BioSQL db so I never > encountered this issue before) The unique index has been there since Feb 2003 (the Singapore Biohackathon). I'm not sure how you got a version that doesn't have it. The unique key constraint on the identifier column is also necessary - otherwise you cannot guarantee lookups by the NCBI taxonID to return either one or zero rows. Like Peter and Richard, I also don't understand what the point would be in allowing the same taxon (which in essence is a node), as identified by taxonID, to exist more than once. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From pan.mueller at yahoo.de Thu Mar 27 19:33:34 2008 From: pan.mueller at yahoo.de (=?iso-8859-1?Q?Peter_M=FCller?=) Date: Thu, 27 Mar 2008 20:33:34 +0100 (CET) Subject: [BioSQL-l] bioentries in a sequence cluster Message-ID: <664425.11239.qm@web28203.mail.ukl.yahoo.com> Dear list, I have a few questions, but maybe with a working example, I can derive the rest. With perl-db I can fetch a Bio::Cluster Object wit this query: (I found no documentation about c::subject and p::object ...) $query->datacollections( ["Bio::PrimarySeqI c::subject", "Bio::PrimarySeqI p::object", "Bio::PrimarySeqI<=>Bio::ClusterI<=>Bio::Ontology::TermI"]); $query->where(["p.accession_number = 'NM_000015'"]); my $adp = $db->get_object_adaptor('Bio::Cluster'); my $qres = $adp->find_by_query($query); That's great - but here I ask for a sequence accession-number. Is it possible to aks for the Clone (IMAGE:4722596) or for an STS accession-number where the result is also a cluster object? "give me the cluster(s) where in the sequence-line is a clone-entry with this number 'IMAGE:4722596' .... "give me the cluster(s) where in the STS-line is an accession-number with this value 'PMC310725P3'... PROTID and NID would be also interesting. UniGene-snippet: STS ACC=PMC310725P3 UNISTS=272646 PROTSIM ORG=10090; PROTGI=6754794; PROTID=NP_035004.1; PCT=76.55; ALN=288 SEQUENCE ACC=BG569293.1; NID=g13576946; CLONE=IMAGE:4722596; END=5'; LID=6989; SEQTYPE=EST; TRACE=44157214 regards pan Machen Sie Yahoo! zu Ihrer Startseite. Los geht's: http://de.yahoo.com/set From hlapp at gmx.net Sun Mar 30 05:00:25 2008 From: hlapp at gmx.net (Hilmar Lapp) Date: Sun, 30 Mar 2008 01:00:25 -0400 Subject: [BioSQL-l] bioentries in a sequence cluster In-Reply-To: <664425.11239.qm@web28203.mail.ukl.yahoo.com> References: <664425.11239.qm@web28203.mail.ukl.yahoo.com> Message-ID: <8083537C-C721-48C2-A838-AAC2B178468A@gmx.net> On Mar 27, 2008, at 3:33 PM, Peter M?ller wrote: > > > Dear list, > > I have a few questions, but maybe with a working example, I can > derive the rest. > > With perl-db I can fetch a Bio::Cluster Object wit this query: > (I found no documentation about c::subject and p::object ...) Yes, sorry, this needs a lot more documentation. The suffix of the alias separated from it by '::' is the 'context'. This is needed if the same entity participates more than once in an association. What's confusing the issue further here is that at the object level each object entity (Bio::PrimarySeq, Bio::ClusterI, Bio::Ontology::TermI) is participating only once, though in reality Bio::ClusterI and Bio::PrimarySeqI both map to table bioentry. > > $query->datacollections( > ["Bio::PrimarySeqI c::subject", > "Bio::PrimarySeqI p::object", I think that Bio::PrimarySeqI can be substituted with Bio::ClusterI in the second line. This would make the mapping clearer I guess. I'm not sure why I wrote the example that way, but I'd be surprised if Bio::ClusterI does not work here. > "Bio::PrimarySeqI<=>Bio::ClusterI<=>Bio::Ontology::TermI"]); > > $query->where(["p.accession_number = 'NM_000015'"]); Actually I think you need to use c.accession_number to query by sequence accession. The c (child) alias is the cluster member, and the p (parent) alias is the cluster itself. > > my $adp = $db->get_object_adaptor('Bio::Cluster'); > my $qres = $adp->find_by_query($query); > > > That's great - but here I ask for a sequence accession-number. > > Is it possible to aks for the Clone (IMAGE:4722596) or for an STS > accession-number where the result is also a cluster object? > "give me the cluster(s) where in the sequence-line is a clone-entry > with this number 'IMAGE:4722596' .... > "give me the cluster(s) where in the STS-line is an accession- > number with this value 'PMC310725P3'... > PROTID and NID would be also interesting. PID and NID should become the primary_id() of the sequence members. Hence, you would say c.primary_id where you have c.accession_number above. Each STS line should be in a qualifier/value pair attached to the cluster bioentry, under the tag 'sts' (which from what I can see would consist of whole lines, not ACC= and UNISTS= values parsed out, though I may be mistaken). So you would add "Bio::PrimarySeqI<=>Bio::Annotation::SimpleValue sv" to the datacollections, and "sv.value = 'ACC=PMC310725P3 UNISTS=272646'" and "sv.tagname = 'sts'" to the where() array. The same goes for IMAGE clone IDs, except that the tag name is 'clone' and the qualifier/value is attached to the member sequence, not the cluster; also here not the entire line is stored, but rather parsed into tokens. Does this help? -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : ===========================================================