From cjfields at uiuc.edu  Sat Mar  1 15:42:05 2008
From: cjfields at uiuc.edu (Chris Fields)
Date: Sat, 1 Mar 2008 14:42:05 -0600
Subject: [BioSQL-l] BioSQL bug in bugzilla
Message-ID: <C4A1873F-C433-492A-8282-CDE6F54B0493@uiuc.edu>

Hilmar,

Just wanted to point out a bug which I thought was bioperl-db-related  
but is really BioSQL.  Could you take a look to see what you think?

http://bugzilla.open-bio.org/show_bug.cgi?id=2389

chris

From hlapp at gmx.net  Sat Mar  1 19:06:55 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 1 Mar 2008 19:06:55 -0500
Subject: [BioSQL-l] biosql usage/user survey
In-Reply-To: <9692f0e9a791c7d0bf942e497668fdce@gmx.net>
References: <9692f0e9a791c7d0bf942e497668fdce@gmx.net>
Message-ID: <E68FD0A3-4203-4A99-BB35-8430EEC10CCF@gmx.net>

I sent this survey request back in 2005 and received a number of  
direct responses. I am assuming that since I said I was going to use  
them for the paper everyone was assuming that their BioSQL usage  
would be made public.

I am going to assemble the responses into a Wiki page as Malcolm  
suggested; if you responded to me and do not want to appear on that  
page, please let me know.

	-hilmar


On Nov 3, 2005, at 11:53 AM, Hilmar Lapp wrote:

> Hi all,
>
> I am writing up a paper on BioSQL and would like to include some  
> current usage figures to support its utility.
>
> Therefore, if you are using BioSQL I'd be glad if you could drop me  
> an email; if you can include a word or two (not more than 1  
> sentence) on what you use it for that'd be great too.
>
> Thanks in advance,
>
> 	-hilmar
> -- 
> -------------------------------------------------------------
> Hilmar Lapp                            email: lapp at gnf.org
> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> -------------------------------------------------------------
>

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From cjfields at uiuc.edu  Sat Mar  1 20:16:24 2008
From: cjfields at uiuc.edu (Chris Fields)
Date: Sat, 1 Mar 2008 19:16:24 -0600
Subject: [BioSQL-l] multiple species for a sequence
Message-ID: <0CFEE9C6-4012-4E6F-B81F-33EE7379036F@uiuc.edu>

I'm looking at a bioperl bug I filed a while back that deals with  
multiple species in a sequence file, such as found for AJ428955:

ID   AJ428955; SV 1; linear; mRNA; STD; VRL; 8027 BP.
XX
AC   AJ428955;
XX
DT   09-JUL-2002 (Rel. 72, Created)
DT   15-APR-2005 (Rel. 83, Last updated, Version 4)
XX
DE   Hepatitis GB virus B subgenomic replicon neoRepB
XX
KW   core-neo fusion protein; core-neo gene; polyprotein.
XX
OS   Hepatitis GB virus B
OC   Viruses; ssRNA positive-strand viruses, no DNA stage; Flaviviridae.
XX
OS   Encephalomyocarditis virus
OC   Viruses; ssRNA positive-strand viruses, no DNA stage;  
Picornaviridae;
OC   Cardiovirus.

...

We could probably add support in bioperl fairly easily (Bio::Seq could  
just return an array or the first species object based on context),  
but would BioSQL support sequences like this?

chris


From hlapp at gmx.net  Sun Mar  2 12:33:23 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 2 Mar 2008 12:33:23 -0500
Subject: [BioSQL-l] multiple species for a sequence
In-Reply-To: <0CFEE9C6-4012-4E6F-B81F-33EE7379036F@uiuc.edu>
References: <0CFEE9C6-4012-4E6F-B81F-33EE7379036F@uiuc.edu>
Message-ID: <86EDE863-287A-48BF-9ED8-13219C3D342E@gmx.net>


On Mar 1, 2008, at 8:16 PM, Chris Fields wrote:

> I'm looking at a bioperl bug I filed a while back that deals with  
> multiple species in a sequence file, such as found for AJ428955:
>
> ID   AJ428955; SV 1; linear; mRNA; STD; VRL; 8027 BP.
> XX
> AC   AJ428955;
> XX
> DT   09-JUL-2002 (Rel. 72, Created)
> DT   15-APR-2005 (Rel. 83, Last updated, Version 4)
> XX
> DE   Hepatitis GB virus B subgenomic replicon neoRepB
> XX
> KW   core-neo fusion protein; core-neo gene; polyprotein.
> XX
> OS   Hepatitis GB virus B
> OC   Viruses; ssRNA positive-strand viruses, no DNA stage;  
> Flaviviridae.
> XX
> OS   Encephalomyocarditis virus
> OC   Viruses; ssRNA positive-strand viruses, no DNA stage;  
> Picornaviridae;
> OC   Cardiovirus.
>
> ...
>
> We could probably add support in bioperl fairly easily (Bio::Seq  
> could just return an array or the first species object based on  
> context), but would BioSQL support sequences like this?

No it wouldn't. There may only be one species (taxon) per sequence.

There has been a lot of discussion about this in the past mostly  
driven by the former SwissProt peculiarity of collapsing sequences by  
sequence identity into a single record. We held out and eventually  
UniProt dropped this practice.

I guess we never quite decided what to do about chimeric sequences  
like the above. Note that the GenBank record gives this differently:

http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=21727885

Here, there's one taxon (ORGANISM line) reference, but two localized  
'source' features in the feature table. (I'm actually not 100% sure  
what the genbank parser would do with this - i.e., whether the second  
source feature will override the taxon_id found in the first.)  
Because seqfeatures (in BioSQL) don't have a link to taxon, you  
wouldn't be able to hit the sequence by its second (chimeric) taxon  
if that were your query criteria (though you could store it fine, and  
if you queried by dbxrefs of features of type 'source', you would  
find it).

At the end of the day, BioSQL will evolve (hopefully) quickly to  
support what the Bio* toolkits support, and will be much slower to  
change in ways that Bio* wouldn't be able to take advantage of  
anyway. At least that's my current vision of it, and of course is up  
for debate as to whether that's a useful vision as much as anything  
else.

So, as you say, right now BioPerl, and AFAIAA any of the other Bio*  
toolkits, doesn't support more than one species per sequence, but as  
soon as that changes, there's a clear need for BioSQL to follow along.

Does that make sense?

	-hilmar

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Sun Mar  2 12:39:17 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 2 Mar 2008 12:39:17 -0500
Subject: [BioSQL-l] BioSQL bug in bugzilla
In-Reply-To: <C4A1873F-C433-492A-8282-CDE6F54B0493@uiuc.edu>
References: <C4A1873F-C433-492A-8282-CDE6F54B0493@uiuc.edu>
Message-ID: <CCBB04C4-5436-4742-94DE-450511C5C628@gmx.net>

I don't think it's a good idea to just replace all varchar() types  
with type text.

First of all, having reasonable constraints is a Good Thing(tm) in my  
book as the majority of times I found them violated it revealed a  
parsing error, rather than the constraints not fitting the data.  
Second, this won't solve the problem for the other RDBMS versions for  
which there is a real performance penalty and other implications when  
having unreasonably large column widths.

That said, if the constraint is indeed not compatible with current  
data (such as Uniprot) we have a problem that needs to be fixed. So,  
what I would like to find out is

1) is this in reality a parsing error, or is there indeed a value for  
a column that in BioSQL is constrained to 40 chars, and

2) if so, which column in which table is the problem.

Erik - would you mind sending me the full error stack if you still  
have it? Usually load_seqdatabase.pl will also print an extra warning  
message saying what it couldn't store. That message would be great  
too. If you don't have either anymore, do you remember vaguely what  
those messsages said? Alternatively, do you have the offending  
uniprot entry (or its accession)?

I suspect that it's actually the constraint on dbxref.accession. Does  
that ring a bell?

	-hilmar


On Mar 1, 2008, at 3:42 PM, Chris Fields wrote:

> Hilmar,
>
> Just wanted to point out a bug which I thought was bioperl-db- 
> related but is really BioSQL.  Could you take a look to see what  
> you think?
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2389
>
> chris
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From cjfields at uiuc.edu  Sun Mar  2 13:00:50 2008
From: cjfields at uiuc.edu (Chris Fields)
Date: Sun, 2 Mar 2008 12:00:50 -0600
Subject: [BioSQL-l] multiple species for a sequence
In-Reply-To: <86EDE863-287A-48BF-9ED8-13219C3D342E@gmx.net>
References: <0CFEE9C6-4012-4E6F-B81F-33EE7379036F@uiuc.edu>
	<86EDE863-287A-48BF-9ED8-13219C3D342E@gmx.net>
Message-ID: <BFA173E9-1DB8-47F8-BC5F-25C35324C69B@uiuc.edu>


On Mar 2, 2008, at 11:33 AM, Hilmar Lapp wrote:

> On Mar 1, 2008, at 8:16 PM, Chris Fields wrote:
>
>> I'm looking at a bioperl bug I filed a while back that deals with  
>> multiple species in a sequence file, such as found for AJ428955:
>>
>> ID   AJ428955; SV 1; linear; mRNA; STD; VRL; 8027 BP.
>> XX
>> AC   AJ428955;
>> XX
>> DT   09-JUL-2002 (Rel. 72, Created)
>> DT   15-APR-2005 (Rel. 83, Last updated, Version 4)
>> XX
>> DE   Hepatitis GB virus B subgenomic replicon neoRepB
>> XX
>> KW   core-neo fusion protein; core-neo gene; polyprotein.
>> XX
>> OS   Hepatitis GB virus B
>> OC   Viruses; ssRNA positive-strand viruses, no DNA stage;  
>> Flaviviridae.
>> XX
>> OS   Encephalomyocarditis virus
>> OC   Viruses; ssRNA positive-strand viruses, no DNA stage;  
>> Picornaviridae;
>> OC   Cardiovirus.
>>
>> ...
>>
>> We could probably add support in bioperl fairly easily (Bio::Seq  
>> could just return an array or the first species object based on  
>> context), but would BioSQL support sequences like this?
>
> No it wouldn't. There may only be one species (taxon) per sequence.
>
> There has been a lot of discussion about this in the past mostly  
> driven by the former SwissProt peculiarity of collapsing sequences  
> by sequence identity into a single record. We held out and  
> eventually UniProt dropped this practice.

I'm unsure how often these pop up.  The behavior of both EMBL and  
GenBank parsers assumes one species (as does Bio::Seq); the embl  
parser picks up both and just replaces the first with the second:

...
DE   Hepatitis GB virus B subgenomic replicon neoRepB
XX
KW   core-neo fusion protein; core-neo gene; polyprotein.
XX
OS   Encephalomyocarditis virus
OC   Viruses; ssRNA positive-strand viruses, no DNA stage;  
Picornaviridae;
OC   Cardiovirus.
XX
RN   [1]
...

> I guess we never quite decided what to do about chimeric sequences  
> like the above. Note that the GenBank record gives this differently:
>
> http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=21727885
>
> Here, there's one taxon (ORGANISM line) reference, but two localized  
> 'source' features in the feature table. (I'm actually not 100% sure  
> what the genbank parser would do with this - i.e., whether the  
> second source feature will override the taxon_id found in the  
> first.) Because seqfeatures (in BioSQL) don't have a link to taxon,  
> you wouldn't be able to hit the sequence by its second (chimeric)  
> taxon if that were your query criteria (though you could store it  
> fine, and if you queried by dbxrefs of features of type 'source',  
> you would find it).

The genbank parser gets the taxon and tax ID correct; I would think  
when it hit the next source feature key it would assign the wrong tax  
ID to the species object but maybe there's a secondary check.  Both  
output the source in feature tables just fine.

> At the end of the day, BioSQL will evolve (hopefully) quickly to  
> support what the Bio* toolkits support, and will be much slower to  
> change in ways that Bio* wouldn't be able to take advantage of  
> anyway. At least that's my current vision of it, and of course is up  
> for debate as to whether that's a useful vision as much as anything  
> else.
>
> So, as you say, right now BioPerl, and AFAIAA any of the other Bio*  
> toolkits, doesn't support more than one species per sequence, but as  
> soon as that changes, there's a clear need for BioSQL to follow along.
>
> Does that make sense?
>
> 	-hilmar

Yes.  I think we could add in support for multiple species fairly  
easily but I'll probably hold off on anything until after a 1.6  
release (i.e. push it to the next developer series, which gives us  
more time to think on how to implement this in a BioSQL-friendly way).

chris

From er at xs4all.nl  Sun Mar  2 13:34:10 2008
From: er at xs4all.nl (Erik Rijkers)
Date: Sun, 2 Mar 2008 19:34:10 +0100 (CET)
Subject: [BioSQL-l] BioSQL bug in bugzilla
Message-ID: <25296.156.83.0.185.1204482850.squirrel@webmail.xs4all.nl>

Hi Hilmar,

Sorry, it's too long ago.  I can run it again (with new
versions) somewhere next week.  I don't remember which of
the two problems (parser or data size) it was in my case.

If it is true what you say (that most errors are due to
the parser), it might indeed be better to leave those
constraints in until such time that the parser has become
more trustworthy, and use the database as a test
instrument...

What is really needed of course is a place to run these
loading scrips continually against any appearing new
versions of parsable text, and against the different
database backends.

Does that already happen somewhere?

Should we consider such a bioperl buildfarm / loadfarm?

(I might be able to help with any postgres loading tests.)

Thanks,

Erik Rijkers

On Sun, March 2, 2008 18:39, Hilmar Lapp wrote:
> I don't think it's a good idea to just replace all
> varchar() types
> with type text.
>
> First of all, having reasonable constraints is a Good
> Thing(tm) in my
> book as the majority of times I found them violated it
> revealed a
> parsing error, rather than the constraints not fitting the
> data.
> Second, this won't solve the problem for the other RDBMS
> versions for
> which there is a real performance penalty and other
> implications when
> having unreasonably large column widths.
>
> That said, if the constraint is indeed not compatible with
> current
> data (such as Uniprot) we have a problem that needs to be
> fixed. So,
> what I would like to find out is
>
> 1) is this in reality a parsing error, or is there indeed
> a value for
> a column that in BioSQL is constrained to 40 chars, and
>
> 2) if so, which column in which table is the problem.
>
> Erik - would you mind sending me the full error stack if
> you still
> have it? Usually load_seqdatabase.pl will also print an
> extra warning
> message saying what it couldn't store. That message would
> be great
> too. If you don't have either anymore, do you remember
> vaguely what
> those messsages said? Alternatively, do you have the
> offending
> uniprot entry (or its accession)?
>
> I suspect that it's actually the constraint on
> dbxref.accession. Does
> that ring a bell?
>
> 	-hilmar
>
>
> On Mar 1, 2008, at 3:42 PM, Chris Fields wrote:
>
>> Hilmar,
>>
>> Just wanted to point out a bug which I thought was
>> bioperl-db-
>> related but is really BioSQL.  Could you take a look to
>> see what
>> you think?
>>
>> http://bugzilla.open-bio.org/show_bug.cgi?id=2389
>>
>> chris
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biosql-l
>
> --
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net
> :
> ===========================================================
>
>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
>


From hlapp at gmx.net  Sun Mar  2 14:20:21 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 2 Mar 2008 14:20:21 -0500
Subject: [BioSQL-l] database loading test server (was: BioSQL bug in
	bugzilla)
In-Reply-To: <25296.156.83.0.185.1204482850.squirrel@webmail.xs4all.nl>
References: <25296.156.83.0.185.1204482850.squirrel@webmail.xs4all.nl>
Message-ID: <52F7E02B-7192-4F3C-9B48-887583B25CCE@gmx.net>

Hi Erik,

On Mar 2, 2008, at 1:34 PM, Erik Rijkers wrote:

> What is really needed of course is a place to run these
> loading scrips continually against any appearing new
> versions of parsable text, and against the different
> database backends.

very true indeed.

>
> Does that already happen somewhere?
>
> Should we consider such a bioperl buildfarm / loadfarm?
>
> (I might be able to help with any postgres loading tests.)


Coincidentally we have been batting around the idea to have a OBF  
machine dedicated to serve for testing and proof-of-concept  
demonstrations of OBF projects. Indeed one of the services we had  
thought about setting up is a BioSQL database, and it's reassuring to  
hear independently that that would be useful.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From er at xs4all.nl  Sun Mar  2 15:01:46 2008
From: er at xs4all.nl (Erik Rijkers)
Date: Sun, 2 Mar 2008 21:01:46 +0100 (CET)
Subject: [BioSQL-l] database loading test server (was: BioSQL bug in
 bugzilla)
Message-ID: <9081.156.83.0.185.1204488106.squirrel@webmail.xs4all.nl>

Maybe we can use some ideas from the way the PostgreSQL
project has setup a distributed buildfarm (conceived by
Andrew Dunstan, I think):

  see: http://www.pgbuildfarm.org/

it lets members of the community use a standardized setup
for building postgresql on their own machines and
automates all steps involved.

I know the projects and the communities are different, but
the general idea to have a standard process to set up
machines for whomever wants to dedicate some hardware and
time seems like a good idea.


Erik Rijkers

On Sun, March 2, 2008 20:20, Hilmar Lapp wrote:
> Hi Erik,
>
> On Mar 2, 2008, at 1:34 PM, Erik Rijkers wrote:
>
>> What is really needed of course is a place to run these
>> loading scrips continually against any appearing new
>> versions of parsable text, and against the different
>> database backends.
>
> very true indeed.
>
>>
>> Does that already happen somewhere?
>>
>> Should we consider such a bioperl buildfarm / loadfarm?
>>
>> (I might be able to help with any postgres loading
>> tests.)
>
>
> Coincidentally we have been batting around the idea to
> have a OBF
> machine dedicated to serve for testing and
> proof-of-concept
> demonstrations of OBF projects. Indeed one of the services
> we had
> thought about setting up is a BioSQL database, and it's
> reassuring to
> hear independently that that would be useful.
>
> 	-hilmar
> --
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net
> :
> ===========================================================
>
>
>
>


From hlapp at gmx.net  Sun Mar  2 15:38:27 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 2 Mar 2008 15:38:27 -0500
Subject: [BioSQL-l] enhancement request scheduling
Message-ID: <5D2BC733-9A44-4EEA-B1D7-6DF90116B50E@gmx.net>

FYI, I have added the chimeric sequence problem and the character  
column width issue to the Enhancement Requests page on the wiki:

http://www.biosql.org/wiki/Enhancement_Requests

I've also started to arrange individual requests in a first draft  
towards scheduling them for implementation. This is very much up for  
debate, so let me know any feedback or disagreement you have or votes  
you might want to put in.

	-hilmar

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Sun Mar  2 17:53:34 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 2 Mar 2008 22:53:34 +0000
Subject: [BioSQL-l] database loading test server (was: BioSQL bug in
	bugzilla)
In-Reply-To: <52F7E02B-7192-4F3C-9B48-887583B25CCE@gmx.net>
References: <25296.156.83.0.185.1204482850.squirrel@webmail.xs4all.nl>
	<52F7E02B-7192-4F3C-9B48-887583B25CCE@gmx.net>
Message-ID: <320fb6e00803021453h553a5c2ay8c50381ef39d0b6a@mail.gmail.com>

>
>  Coincidentally we have been batting around the idea to have a OBF
>  machine dedicated to serve for testing and proof-of-concept
>  demonstrations of OBF projects. Indeed one of the services we had
>  thought about setting up is a BioSQL database, and it's reassuring to
>  hear independently that that would be useful.
>

The BioSQL test database would be especially useful if we have all the
Bio* projects hooked up to it, to automatically check they can all
read records written by each other.  I still haven't made time to get
BioPerl setup on my machine to check the BioSQL compatibility with
Biopython...

Peter

From hlapp at gmx.net  Sun Mar  2 22:18:47 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 2 Mar 2008 22:18:47 -0500
Subject: [BioSQL-l] small "bug" correction in package BioSql
In-Reply-To: <473455BE.6040807@ebi.ac.uk>
References: <762277.43372.qm@web26507.mail.ukl.yahoo.com>
	<ECAC265E-DDD2-4403-800E-8B150A980093@gmx.net>
	<473336E6.6000100@ebi.ac.uk>
	<9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net>
	<473455BE.6040807@ebi.ac.uk>
Message-ID: <4C9ACC1A-8C61-4611-8083-EFAD34D186EF@gmx.net>

Just FYI, I added a section to this extent to the Enhancement Requests:

http://www.biosql.org/wiki/ 
Enhancement_Requests#Check_constraint_on_biosequence.alphabet

Feel free to fix/add as appropriate.

	-hilmar

On Nov 9, 2007, at 7:42 AM, Richard Holland wrote:

> I did a bit of poking around in our code and internally BioJava
> represents all the default alphabet names (Protein, DNA, etc.) in  
> upper
> case. It also allows for mixed case alphabet names.
>
> It's not quite as easy as I thought to change these to lower case as
> they are often referenced by text name, meaning other people's code
> might break if I change them.
>
> Also, as it allows for mixed-case alphabet names, I can't do a
> toUpper/toLower fudge on persistence to BioSQL, as I wouldn't
> necessarily get out what I put in!

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Sun Mar  2 22:38:59 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 2 Mar 2008 22:38:59 -0500
Subject: [BioSQL-l] Fwd: error on insert new sequences from GenBank: no
	annotations saved in BioSQL database
References: <A1A51FDB-C4A6-4894-8C9C-12A210B73C0D@gmx.net>
Message-ID: <917E3FCD-5FDB-460D-8F63-B897ACA5CDD2@gmx.net>

FYI, I used this to start a page on the recommended mapping of  
sequence annotation to BioSQL:

http://www.biosql.org/wiki/Annotation_Mapping

Obviously, this is very rudimentary, but everyone is welcome to add  
to it or comment with further questions. Also, one of the most  
important questions, namely a consistent vocabulary for annotation  
(qualifier) tags, isn't mentioned there (yet).

	-hilmar

Begin forwarded message:

> From: Hilmar Lapp <hlapp at gmx.net>
> Date: November 8, 2007 3:28:19 PM EST
> To: Eric Gibert <ericgibert at yahoo.fr>
> Cc: biopython at lists.open-bio.org, BioJava <biojava-l at biojava.org>
> Subject: Re: [Biojava-l] [BioPython] error on insert new sequences  
> from GenBank: no annotations saved in BioSQL database
>
> Maybe we need to hold some mini-hackathon to make the different
> toolkits compatible in how they map annotation to the schema.
> Obviously I don't know whether you have the latest Biojava setup
> here, but I'll just comment how BioPerl/Bioperl-db would map this:
>
> 'ORIGIN' - if I'm not mistaken this is only a token that introduces
> the actual sequence. I'm not sure what Biojava is storing as value  
> here.
>
> 'DIVISION' - this maps to column division in table bioentry (though I
> agree that if  perfectly following the weak typing principle this
> should be tag/value association, but at present it's still an actual
> column)
>
> 'genbank_accessions' - secondary accession numbers indeed go into the
> qualifier value table. The primary accession maps to column accession
> in table bioentry
>
> 'TITLE' - this is part of a publication reference, and should map to
> column title in table reference (which it does in bioperl-db)
>
> 'cross_references' - not sure where these would be coming from in
> GenBank format; for EMBL this will map to the dbxref table
>
> 'data_file_division' - not sure what this is (same as DIVISION?)
>
> 'VERSION' - in BioPerl we parse this apart into a version for the
> accession (which is column version in table bioentry) and the GI
> number, which maps to column identifier in table bioentry
>
> 'references' - these map to table reference (and bioentry_reference
> for association with the bioentry)
>
> 'KEYWORDS' - indeed these map to bioentry_qualifier_value
>
> 'GI' - maps to column identifier in table bioentry
>
> 'SIZE' - not sure what size that is. If it is the length of the
> sequence, it should (and in BioPerl/bioperl-db does) map to column
> length in table biosequence
>
> 'DEFINITION' - maps to column description in table bioentry
>
> 'REFERENCE' - should be the same as for 'references'
>
> 'MDAT' - not sure what this is
>
> 'ORGANISM' - this is the organism and maps to the table taxon (and
> taxon_name), with a foreign key in bioentry pointing to the taxon
>
> 'JOURNAL' - this is part of a reference, see 'references'
>
> 'ACCESSION' - the primary accession, maps to column accession in
> table bioentry
>
> 'LOCUS' - in the file itself this is an entire line consisting of
> multiple fields; BioPerl/bioperl-db maps the locus name (the first
> token after the literal token LOCUS) to column name in table bioentry
>
> 'SOURCE' - this is the organism, see 'ORGANISM'
>
> 'PUBMED' - this is part of a literature reference, and maps to a
> foreign key in the reference table (reference.dbxref) to a dbxref
> entry with PUBMED or PMID as the database and the pubmed ID as the
> accession
>
> 'AUTHORS' - part of a literature reference, maps to column authors in
> table reference
>
> 'TYPE' - not sure what this is. If it's the alphabet, it maps to
> table biosequence, column alphabet
>
> 'CIRCULAR' - this at present indeed maps to bioentry_qualifier_value,
> though there have been plans to make it a column in table biosequence.
>
> Note that this could in fact be the way Biojava stores it too, but
> upon retrieval represents it in the way you are seeing it.
>
> Hth,
>
> 	-hilmar
>
> On Nov 8, 2007, at 12:50 PM, Eric Gibert wrote:
>
>> Dear all,
>>
>> When I retrieve a BioSQL.BioSeq.DBSeqRecord which was inserted
>> previously by my BioJava application, I have:
>>
>> print "Debug on Seq:", Seq.id, "=", Seq.annotations.keys()
>>
>> Debug on Seq: AJ459190.1 = ['ORIGIN', 'DIVISION',
>> 'genbank_accessions', 'TITLE', 'cross_references',
>> 'data_file_division', 'VERSION', 'references', 'KEYWORDS', 'GI',
>> 'SIZE', 'DEFINITION', 'REFERENCE', 'MDAT', 'ORGANISM', 'JOURNAL',
>> 'ACCESSION', 'LOCUS', 'SOURCE', 'PUBMED', 'AUTHORS', 'TYPE',
>> 'CIRCULAR']
>>
>> but a freshly inserted BioSeq by BioPython 1.44 only gives me:
>> Debug on Seq: EF631597.1 =  ['cross_references', 'dates',
>> 'references', 'gi', 'data_file_division']
>>
>>
>> Once I look in the table bioentry_qualifier_value
>>
>> * 20 records for a Sequence imported by BioJava
>> * 1 only for a Sequence inserted by BioPython: the date which
>> should be inserted by "_load_bioentry_date" in BioSQL/Loader.py
>>
>> Quite a few annotations missing, no?
>>
>> Any idea?
>>
>> Eric
>>
>>
>>
>>
>>
>> _____________________________________________________________________ 
>> _
>> _______
>> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers
>> Yahoo! Mail
>> _______________________________________________
>> BioPython mailing list  -  BioPython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>
> -- 
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================
>
>
>
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From cjfields at uiuc.edu  Sun Mar  2 23:36:56 2008
From: cjfields at uiuc.edu (Chris Fields)
Date: Sun, 2 Mar 2008 22:36:56 -0600
Subject: [BioSQL-l] Fwd: error on insert new sequences from GenBank: no
	annotations saved in BioSQL database
In-Reply-To: <917E3FCD-5FDB-460D-8F63-B897ACA5CDD2@gmx.net>
References: <A1A51FDB-C4A6-4894-8C9C-12A210B73C0D@gmx.net>
	<917E3FCD-5FDB-460D-8F63-B897ACA5CDD2@gmx.net>
Message-ID: <E90DD156-E6DE-47F8-AE0C-5FC1D039E377@uiuc.edu>


On Mar 2, 2008, at 9:38 PM, Hilmar Lapp wrote:

> FYI, I used this to start a page on the recommended mapping of  
> sequence annotation to BioSQL:
>
> http://www.biosql.org/wiki/Annotation_Mapping
>
> Obviously, this is very rudimentary, but everyone is welcome to add  
> to it or comment with further questions. Also, one of the most  
> important questions, namely a consistent vocabulary for annotation  
> (qualifier) tags, isn't mentioned there (yet).
>
> 	-hilmar
>
>> ...
>> Maybe we need to hold some mini-hackathon to make the different
>> toolkits compatible in how they map annotation to the schema.
>> Obviously I don't know whether you have the latest Biojava setup
>> here, but I'll just comment how BioPerl/Bioperl-db would map this:

These are the ones I know of:

>> 'cross_references' - not sure where these would be coming from in
>> GenBank format; for EMBL this will map to the dbxref table

GenPept has DBSOURCE, so maybe from there?

>> 'data_file_division' - not sure what this is (same as DIVISION?)

Note sure about that one, but division sounds right.

>> 'MDAT' - not sure what this is

Modification Date, I think.  'MDAT' is a field name used for limits in  
Entrez searches:

	Field code: MDAT
	      name: Modification Date
	      desc: Date of last update
	     count: 4012
	Attributes: is_date,is_singletoken

chris

From markjschreiber at gmail.com  Tue Mar  4 21:06:17 2008
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Wed, 5 Mar 2008 10:06:17 +0800
Subject: [BioSQL-l] multiple species for a sequence
In-Reply-To: <BFA173E9-1DB8-47F8-BC5F-25C35324C69B@uiuc.edu>
References: <0CFEE9C6-4012-4E6F-B81F-33EE7379036F@uiuc.edu>
	<86EDE863-287A-48BF-9ED8-13219C3D342E@gmx.net>
	<BFA173E9-1DB8-47F8-BC5F-25C35324C69B@uiuc.edu>
Message-ID: <93b45ca50803041806o6f802548g4e408339d1a40c27@mail.gmail.com>

BioJava doesn't support multiple taxa per sequence.  It's something to
consider though.

Philosophically you really have to wonder about he meaning of species
when you have a chimera : )  Should it not be a hybrid species all on
it's own?  I wonder what they will do when Craig Venter produces
Craigus ventus...

- Mark

On Mon, Mar 3, 2008 at 2:00 AM, Chris Fields <cjfields at uiuc.edu> wrote:
>
>
>  On Mar 2, 2008, at 11:33 AM, Hilmar Lapp wrote:
>
>  > On Mar 1, 2008, at 8:16 PM, Chris Fields wrote:
>  >
>  >> I'm looking at a bioperl bug I filed a while back that deals with
>  >> multiple species in a sequence file, such as found for AJ428955:
>  >>
>  >> ID   AJ428955; SV 1; linear; mRNA; STD; VRL; 8027 BP.
>  >> XX
>  >> AC   AJ428955;
>  >> XX
>  >> DT   09-JUL-2002 (Rel. 72, Created)
>  >> DT   15-APR-2005 (Rel. 83, Last updated, Version 4)
>  >> XX
>  >> DE   Hepatitis GB virus B subgenomic replicon neoRepB
>  >> XX
>  >> KW   core-neo fusion protein; core-neo gene; polyprotein.
>  >> XX
>  >> OS   Hepatitis GB virus B
>  >> OC   Viruses; ssRNA positive-strand viruses, no DNA stage;
>  >> Flaviviridae.
>  >> XX
>  >> OS   Encephalomyocarditis virus
>  >> OC   Viruses; ssRNA positive-strand viruses, no DNA stage;
>  >> Picornaviridae;
>  >> OC   Cardiovirus.
>  >>
>  >> ...
>  >>
>  >> We could probably add support in bioperl fairly easily (Bio::Seq
>  >> could just return an array or the first species object based on
>  >> context), but would BioSQL support sequences like this?
>  >
>  > No it wouldn't. There may only be one species (taxon) per sequence.
>  >
>  > There has been a lot of discussion about this in the past mostly
>  > driven by the former SwissProt peculiarity of collapsing sequences
>  > by sequence identity into a single record. We held out and
>  > eventually UniProt dropped this practice.
>
>  I'm unsure how often these pop up.  The behavior of both EMBL and
>  GenBank parsers assumes one species (as does Bio::Seq); the embl
>  parser picks up both and just replaces the first with the second:
>
>  ...
>
> DE   Hepatitis GB virus B subgenomic replicon neoRepB
>  XX
>  KW   core-neo fusion protein; core-neo gene; polyprotein.
>  XX
>
> OS   Encephalomyocarditis virus
>  OC   Viruses; ssRNA positive-strand viruses, no DNA stage;
>  Picornaviridae;
>  OC   Cardiovirus.
>  XX
>  RN   [1]
>  ...
>
>
>  > I guess we never quite decided what to do about chimeric sequences
>  > like the above. Note that the GenBank record gives this differently:
>  >
>  > http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=21727885
>  >
>  > Here, there's one taxon (ORGANISM line) reference, but two localized
>  > 'source' features in the feature table. (I'm actually not 100% sure
>  > what the genbank parser would do with this - i.e., whether the
>  > second source feature will override the taxon_id found in the
>  > first.) Because seqfeatures (in BioSQL) don't have a link to taxon,
>  > you wouldn't be able to hit the sequence by its second (chimeric)
>  > taxon if that were your query criteria (though you could store it
>  > fine, and if you queried by dbxrefs of features of type 'source',
>  > you would find it).
>
>  The genbank parser gets the taxon and tax ID correct; I would think
>  when it hit the next source feature key it would assign the wrong tax
>  ID to the species object but maybe there's a secondary check.  Both
>  output the source in feature tables just fine.
>
>
>  > At the end of the day, BioSQL will evolve (hopefully) quickly to
>  > support what the Bio* toolkits support, and will be much slower to
>  > change in ways that Bio* wouldn't be able to take advantage of
>  > anyway. At least that's my current vision of it, and of course is up
>  > for debate as to whether that's a useful vision as much as anything
>  > else.
>  >
>  > So, as you say, right now BioPerl, and AFAIAA any of the other Bio*
>  > toolkits, doesn't support more than one species per sequence, but as
>  > soon as that changes, there's a clear need for BioSQL to follow along.
>  >
>  > Does that make sense?
>  >
>  >       -hilmar
>
>  Yes.  I think we could add in support for multiple species fairly
>  easily but I'll probably hold off on anything until after a 1.6
>  release (i.e. push it to the next developer series, which gives us
>  more time to think on how to implement this in a BioSQL-friendly way).
>
>  chris
>
>
> _______________________________________________
>  BioSQL-l mailing list
>  BioSQL-l at lists.open-bio.org
>  http://lists.open-bio.org/mailman/listinfo/biosql-l
>

From cjfields at uiuc.edu  Wed Mar  5 18:24:03 2008
From: cjfields at uiuc.edu (Chris Fields)
Date: Wed, 5 Mar 2008 17:24:03 -0600
Subject: [BioSQL-l] bioperl-db bugs
Message-ID: <C46D2101-D4FF-47E9-89BA-0D84114CCB7B@uiuc.edu>

Hilmar,

I think I have two bioperl-db bugs sorted out, but I'm trying to  
determine whether the solution is a side-effect, a feature, or a bug.   
Dmitry has filed two bug reports which are somewhat related:

http://bugzilla.open-bio.org/show_bug.cgi?id=2280
http://bugzilla.open-bio.org/show_bug.cgi?id=2281

I have added my comments to it, but maybe you can shed some more light  
on this.  What he is trying to do is copy a persistent Seq object to a  
different namespace; load_seqdatabase.pl won't let him do that  
directly using the same sequence file.  If he changes the namespace()  
and store()s it using a script, the seq is moved to the new namespace,  
not updated.

My reasoning is this is a feature (by not changing the primary_key,  
you don't store a new sequence but update the current one).  However,  
if the primary_key is unset (undef), then it appears you can copy the  
sequence over (from Dmitry's script, with my addition noted):

...
my $ns1 = 'space1';
my $ns2 = 'space2';

my $seqadp = $db->get_object_adaptor('Bio::SeqI');
my $aux_seq = Bio::Seq::RichSeq->new(
     -accession_number => 'NC_005982',
     -version => 1,
     -namespace => $ns1);
my $seq = $seqadp->find_by_unique_key($aux_seq);

# store the found sequence in the second biodatabase:
my $pseq = $seqadp->create_persistent($ns2);
$pseq->namespace('bioperl2');
$pseq->primary_key(undef);  # my addition, which appears to work
$pseq->store();
$seqadp->commit;
...

My question: is this an intended effect?  The ability to assign undef  
to primary_key seems intentional based on the method code, but I'm a  
bit uncertain here.

chris

From hlapp at gmx.net  Thu Mar  6 00:03:26 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 6 Mar 2008 00:03:26 -0500
Subject: [BioSQL-l] Announcement: BioSQL v1.0.0 released
Message-ID: <B496379B-74F4-4A95-9F50-3DC13770EE5F@gmx.net>

BioSQL v1.0.0 Release
=====================

I am extremely pleased to announce the release of version 1.0.0
(code-named Tokyo, see below) of BioSQL. The release can be
downloaded at the following location, in the following formats:

http://biosql.org/DIST/biosql-1.0.0.tar.gz
http://biosql.org/DIST/biosql-1.0.0.tar.bz2
http://biosql.org/DIST/biosql-1.0.0.zip (has Windows-style EOL)

MD5 signatures (http://biosql.org/DIST/SIGNATURES.md5):
MD5(biosql-1.0.0.tar.bz2)= 2b09a821b9d94bb1e94c3c79dc2f4cff
MD5(biosql-1.0.0.tar.gz)= e47982d979ddb98aae640b5ab55ce2c6
MD5(biosql-1.0.0.zip)= 06913c8639ca4fe7f9000b556d8a04ed

The core BioSQL schema is a generic, extensible relational model for
sequences, sequence features, their annotation, and ontology terms. It
is also designed as the interoperable persistence interface between
the Bio* projects.

This version of the schema has essentially been the same since
November 2004. Software that worked with schema versions downloaded
from CVS (or, as of lately, svn) after November 2004 should work with
all 1.0.x releases.

This release contains
  - the core BioSQL schema as DDL (Data Definition Language) for the
    following RDBMSs: MySQL, PostgreSQL, Oracle, HSQLDB, and Apache  
Derby,
  - ancillary (but optional) schema files for PostgreSQL,
  - documentation and an ERD (Entity-Relationship Diagram), and
  - a Perl script that can pre-load (and update) a BioSQL instance with
    the NCBI taxonomy.

Installation instructions for MySQL and PostgreSQL are in the file
INSTALL, and the file doc/bj_and_bsql_oracle_howto.htm has
instructions for installing the Oracle version.

Additional information regarding BioSQL, including links to language
bindings, a roadmap to future releases and enhancements, and possible
local optimizations is available from the BioSQL website at
http://biosql.org.

On behalf of the BioSQL developers,

       Hilmar Lapp

Acknowledgments
---------------

BioSQL in general and this releases in particular owes enormously to a
number of number of people and would not exist without their
contributions, the contributions of people on the biosql-l mailing
list, and the support of other developers and users from the Bio*
community.

Ewan Birney created the first version of the schema and during the
2003 BioHackathon in Singapore tested and wrote much of the INSTALL
document. Elia Stupka and Chris Mungall made significant changes at
the 2002 BioHackathons in Tucson, AZ, and Cape Town, South
Africa. Aaron Mackey was instrumental in the changes made at the
Singapore BioHackathon, which set the path to the version (code-named
'post-Singapore') that eventually stabilized as v1.0. Matthew Pocock
and Thomas Down provided important input for the ontology model.

This release and the accompanying work on cleaning up, updating
documentation, and jump-starting a useful (wiki) website was
irreversibly set in motion at the BioHackathon 2008 in Tokyo, and
would not have happened without the active encouragement from
several participants, especially Heikki Lehvahslaiho, Mark Schreiber,
Richard Holland, and Raoul Bonnal. Finally, without the superb and
prompt help from Mauricio Herrera Cuadra and Jason Stajich with
various wiki and other admin issues that occasionally reared their
heads we wouldn't have made it to this point.

In recognition of the role the BioHackathon 2008 played in getting
this release out the door, and in keeping with an informal tradition
held up since the first BioHackathon, I am code-naming the 1.0.x
release series the Tokyo release series of BioSQL.

Thank you to everyone!

License
-------

BioSQL is free software: you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License as published
by the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.


-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Sun Mar  9 19:38:18 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 9 Mar 2008 19:38:18 -0400
Subject: [BioSQL-l] bioperl-db bugs
In-Reply-To: <C46D2101-D4FF-47E9-89BA-0D84114CCB7B@uiuc.edu>
References: <C46D2101-D4FF-47E9-89BA-0D84114CCB7B@uiuc.edu>
Message-ID: <DBCA4AD5-9C5D-499F-9506-5E8F3245DB28@gmx.net>

Hi Chris,

I added comments to both bug reports. This belongs to BioPerl,  
though, as it has only to do with its language binding.

The tidbit may be worth keeping in mind for a general BioSQL audience  
is that bioentry namespace (foreign key to biodatabase) is part of  
the (compound) bioentry unique keys. The identifier column used to be  
unique by itself (and could still be made such in a local instance,  
there's a comment to this effect in the DDL), but that was changed a  
while ago. (Also, if one uses any of the Bio* language bindings,  
changing a unique key constraint to something that differs from what  
the language binding assumes may be asking for a lot of trouble.  
Bioperl-db will expect the combination of primary_id() and namespace 
() to match if the latter is provided.)

	-hilmar

On Mar 5, 2008, at 6:24 PM, Chris Fields wrote:

> Hilmar,
>
> I think I have two bioperl-db bugs sorted out, but I'm trying to  
> determine whether the solution is a side-effect, a feature, or a  
> bug.  Dmitry has filed two bug reports which are somewhat related:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2280
> http://bugzilla.open-bio.org/show_bug.cgi?id=2281
>
> I have added my comments to it, but maybe you can shed some more  
> light on this.  What he is trying to do is copy a persistent Seq  
> object to a different namespace; load_seqdatabase.pl won't let him  
> do that directly using the same sequence file.  If he changes the  
> namespace() and store()s it using a script, the seq is moved to the  
> new namespace, not updated.
>
> My reasoning is this is a feature (by not changing the primary_key,  
> you don't store a new sequence but update the current one).   
> However, if the primary_key is unset (undef), then it appears you  
> can copy the sequence over (from Dmitry's script, with my addition  
> noted):
>
> ...
> my $ns1 = 'space1';
> my $ns2 = 'space2';
>
> my $seqadp = $db->get_object_adaptor('Bio::SeqI');
> my $aux_seq = Bio::Seq::RichSeq->new(
>     -accession_number => 'NC_005982',
>     -version => 1,
>     -namespace => $ns1);
> my $seq = $seqadp->find_by_unique_key($aux_seq);
>
> # store the found sequence in the second biodatabase:
> my $pseq = $seqadp->create_persistent($ns2);
> $pseq->namespace('bioperl2');
> $pseq->primary_key(undef);  # my addition, which appears to work
> $pseq->store();
> $seqadp->commit;
> ...
>
> My question: is this an intended effect?  The ability to assign  
> undef to primary_key seems intentional based on the method code,  
> but I'm a bit uncertain here.
>
> chris
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From jswetnam at gmail.com  Mon Mar 10 15:27:46 2008
From: jswetnam at gmail.com (James Swetnam)
Date: Mon, 10 Mar 2008 15:27:46 -0400
Subject: [BioSQL-l] Possible Mysql 5.x bug
Message-ID: <2ABD56A9-9632-4AB1-BC54-B0AF71037DC8@gmail.com>

First off, thank you very much to the developers for creating and
maintaining such a useful and interesting project.  I think I have
found a small syntactical bug; as a caveat, however, I am not a
database developer and have very little experience in these matters.
    I do know how to read documentation though, which I've relied  
heavily
    on to write this email.
    As per the biopython setup tutorial I'm attempting to run the  
biosqldb-
    mysql.sql file on Mac OS X Leopard.  Here is my mysql version  
string:
    cardozo13:sql james$ mysql -V
    mysql  Ver 14.12 Distrib 5.0.54-20071214, for apple-darwin9.1.0
    (powerpc) using  EditLine wrapper
    And my procedure (after grabbing the biosql source via CVS).
    cardozo13:sql james$ mysqladmin -u root -p create bioseqdb
    Enter password:
    cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb-
    mysql.sqlEnter password:
    ERROR 1064 (42000) at line 169: You have an error in your SQL  
syntax;
    check the manual that corresponds to your MySQL server version for  
the
    right syntax to use near '--CREATE INDEX ontrel_subjectid ON
    term_relationship(subject_term_id)' at line 1
    Interesting.  Let's take a look at line 169:

    --CREATE INDEX ontrel_subjectid ON  
term_relationship(subject_term_id);

    And an excerpt from the documentation for my version of MySQL (5.0
    reference manual), section 1.8.5.6. '--' as the Start of a Comment:

    Standard SQL uses ?--? as a start-comment sequence. MySQL Server  
uses
    ?#? as the start comment character. MySQL Server 3.23.3 and up also
    supports a variant of the ?--? comment style. That is, the ?--?  
start-
    comment sequence must be followed by a space (or by a control
    character such as a newline). The space is required to prevent
    problems with automatically generated SQL queries that use  
constructs
    such as the following, where we automatically insert the value of  
the
    payment for payment:

    OK. So after replacing all the lines in which -- is not followed  
by a
    space (thank you regexps), it works beautifully.

    cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb-mysql.sql
    Enter password:

    Should this change be implemented?  Or am i missing something?

    James Swetnam
    Research Technician
    New York University School of Medicine


- Done.


---------- Forwarded message ----------
From: "James Swetnam" <jswetnam at gmail.com>
To: biosql-l-request at lists.open-bio.org
Date: Thu, 6 Mar 2008 18:10:07 -0500
Subject: Comment Syntax bug Generates error on
Hello.

First off, thank you very much to the developers for creating and
maintaining such a useful and interesting project.  I think I have
found a small syntactical bug; as a caveat, however, I am not a
database developer and have very little experience in these matters.
I do know how to read documentation though, which I've relied heavily
on to write this email.

As per the biopython setup tutorial I'm attempting to run the biosqldb-
mysql.sql file on Mac OS X Leopard.  Here is my mysql version string:

cardozo13:sql james$ mysql -V
mysql  Ver 14.12 Distrib 5.0.54-20071214, for apple-darwin9.1.0
(powerpc) using  EditLine wrapper

And my procedure (after grabbing the biosql source via CVS).

cardozo13:sql james$ mysqladmin -u root -p create bioseqdb
Enter password:
cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb-
mysql.sqlEnter password:
ERROR 1064 (42000) at line 169: You have an error in your SQL syntax;
check the manual that corresponds to your MySQL server version for the
right syntax to use near '--CREATE INDEX ontrel_subjectid ON
term_relationship(subject_term_id)' at line 1

Interesting.  Let's take a look at line 169:

--CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id);

And an excerpt from the documentation for my version of MySQL (5.0
reference manual), section 1.8.5.6. '--' as the Start of a Comment:

Standard SQL uses ?--? as a start-comment sequence. MySQL Server uses
?#? as the start comment character. MySQL Server 3.23.3 and up also
supports a variant of the ?--? comment style. That is, the ?--? start-
comment sequence must be followed by a space (or by a control
character such as a newline). The space is required to prevent
problems with automatically generated SQL queries that use constructs
such as the following, where we automatically insert the value of the
payment for payment:

OK. So after replacing all the lines in which -- is not followed by a
space (thank you regexps), it works beautifully.

cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb-mysql.sql
Enter password:

Should this change be implemented?  Or am i missing something?

James Swetnam
Research Technician
New York University School of Medicine


Reply
		
Forward
		
	
From hlapp at gmx.net  Mon Mar 10 23:05:32 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Mon, 10 Mar 2008 23:05:32 -0400
Subject: [BioSQL-l] Possible Mysql 5.x bug
In-Reply-To: <2ABD56A9-9632-4AB1-BC54-B0AF71037DC8@gmail.com>
References: <2ABD56A9-9632-4AB1-BC54-B0AF71037DC8@gmail.com>
Message-ID: <9051AFFE-8660-4E21-B25F-93D1FB70D98B@gmx.net>

Hi James,

thanks for reporting this. Sebastian Bassi beat you to it, though,  
and it has actually been fixed in svn, and is also fixed in the 1.0.0  
release.

BioSQL is meanwhile on svn; the anonymous cvs server is still up, but  
doesn't get updated since the switch-over to svn. Instructions for  
downloading from svn and download location of the 1.0.0 release are  
on the BioSQL wiki at http://biosql.org.

Let us know if you encounter any difficulties. And great that you're  
finding the project useful!

	-hilmar

On Mar 10, 2008, at 3:27 PM, James Swetnam wrote:

> First off, thank you very much to the developers for creating and
> maintaining such a useful and interesting project.  I think I have
> found a small syntactical bug; as a caveat, however, I am not a
> database developer and have very little experience in these matters.
>    I do know how to read documentation though, which I've relied  
> heavily
>    on to write this email.
>    As per the biopython setup tutorial I'm attempting to run the  
> biosqldb-
>    mysql.sql file on Mac OS X Leopard.  Here is my mysql version  
> string:
>    cardozo13:sql james$ mysql -V
>    mysql  Ver 14.12 Distrib 5.0.54-20071214, for apple-darwin9.1.0
>    (powerpc) using  EditLine wrapper
>    And my procedure (after grabbing the biosql source via CVS).
>    cardozo13:sql james$ mysqladmin -u root -p create bioseqdb
>    Enter password:
>    cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb-
>    mysql.sqlEnter password:
>    ERROR 1064 (42000) at line 169: You have an error in your SQL  
> syntax;
>    check the manual that corresponds to your MySQL server version  
> for the
>    right syntax to use near '--CREATE INDEX ontrel_subjectid ON
>    term_relationship(subject_term_id)' at line 1
>    Interesting.  Let's take a look at line 169:
>
>    --CREATE INDEX ontrel_subjectid ON term_relationship 
> (subject_term_id);
>
>    And an excerpt from the documentation for my version of MySQL (5.0
>    reference manual), section 1.8.5.6. '--' as the Start of a Comment:
>
>    Standard SQL uses ?--? as a start-comment sequence. MySQL Server  
> uses
>    ?#? as the start comment character. MySQL Server 3.23.3 and up also
>    supports a variant of the ?--? comment style. That is, the ?--?  
> start-
>    comment sequence must be followed by a space (or by a control
>    character such as a newline). The space is required to prevent
>    problems with automatically generated SQL queries that use  
> constructs
>    such as the following, where we automatically insert the value  
> of the
>    payment for payment:
>
>    OK. So after replacing all the lines in which -- is not followed  
> by a
>    space (thank you regexps), it works beautifully.
>
>    cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb-mysql.sql
>    Enter password:
>
>    Should this change be implemented?  Or am i missing something?
>
>    James Swetnam
>    Research Technician
>    New York University School of Medicine
>
>
>
>
>
>
>
> - Done.
>
>
>
> ---------- Forwarded message ----------
> From: "James Swetnam" <jswetnam at gmail.com>
> To: biosql-l-request at lists.open-bio.org
> Date: Thu, 6 Mar 2008 18:10:07 -0500
> Subject: Comment Syntax bug Generates error on
> Hello.
>
> First off, thank you very much to the developers for creating and
> maintaining such a useful and interesting project.  I think I have
> found a small syntactical bug; as a caveat, however, I am not a
> database developer and have very little experience in these matters.
> I do know how to read documentation though, which I've relied heavily
> on to write this email.
>
> As per the biopython setup tutorial I'm attempting to run the  
> biosqldb-
> mysql.sql file on Mac OS X Leopard.  Here is my mysql version string:
>
> cardozo13:sql james$ mysql -V
> mysql  Ver 14.12 Distrib 5.0.54-20071214, for apple-darwin9.1.0
> (powerpc) using  EditLine wrapper
>
> And my procedure (after grabbing the biosql source via CVS).
>
> cardozo13:sql james$ mysqladmin -u root -p create bioseqdb
> Enter password:
> cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb-
> mysql.sqlEnter password:
> ERROR 1064 (42000) at line 169: You have an error in your SQL syntax;
> check the manual that corresponds to your MySQL server version for the
> right syntax to use near '--CREATE INDEX ontrel_subjectid ON
> term_relationship(subject_term_id)' at line 1
>
> Interesting.  Let's take a look at line 169:
>
> --CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id);
>
> And an excerpt from the documentation for my version of MySQL (5.0
> reference manual), section 1.8.5.6. '--' as the Start of a Comment:
>
> Standard SQL uses ?--? as a start-comment sequence. MySQL Server uses
> ?#? as the start comment character. MySQL Server 3.23.3 and up also
> supports a variant of the ?--? comment style. That is, the ?--? start-
> comment sequence must be followed by a space (or by a control
> character such as a newline). The space is required to prevent
> problems with automatically generated SQL queries that use constructs
> such as the following, where we automatically insert the value of the
> payment for payment:
>
> OK. So after replacing all the lines in which -- is not followed by a
> space (thank you regexps), it works beautifully.
>
> cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb-mysql.sql
> Enter password:
>
> Should this change be implemented?  Or am i missing something?
>
> James Swetnam
> Research Technician
> New York University School of Medicine
>
>
>
>
>
>
>
> Reply
> 		
> Forward
> 		
> 	
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Tue Mar 11 14:51:47 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 11 Mar 2008 18:51:47 +0000
Subject: [BioSQL-l] Biopython documentation in BioSQL SVN
In-Reply-To: <320fb6e00803111148k52150fdcu3b3c22f72f59f514@mail.gmail.com>
References: <320fb6e00803111148k52150fdcu3b3c22f72f59f514@mail.gmail.com>
Message-ID: <320fb6e00803111151q645c1f0cgead7842e8ab0d0d@mail.gmail.com>

Hello,

Over on the Biopython mailing list, James Swetnam drew my
attention to the fact that we still had documentation referring to
installing BioSQL from CVS (predating both the move to SVN
and the official 1.0 release).

I've updated our wiki page, http://biopython.org/wiki/BioSQL

However, there is some older LaTeX based documentation on our webpage,
http://biopython.org/DIST/docs/biosql/python_biosql_basic.html
http://biopython.org/DIST/docs/biosql/python_biosql_basic.pdf

These are currently living in the BioSQL repository, which I don't
think I have access to.
http://code.open-bio.org/svnweb/index.cgi/biosql/browse/biosql-schema/trunk/doc/biopython/

Does it make sense to have this documentation separate from the
Biopython code it refers to (which lives in the Biopython repository)?
For one thing, it complicates access rights for developers.

What I would suggest is just to:

(*) add a disclaimer to the top of python_biosql_basic.tex saying this
    document is depreciated, and giving a link to the wiki page,
    http://biopython.org/wiki/BioSQL
(*) regenerate the PDF and HTML files.
(*) Update these three files in BioSQL's SVN repository.
(*) Copy the new PDF and HTML files over to the Biopython webserver.

Thanks

Peter

From hlapp at gmx.net  Tue Mar 11 15:57:16 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Tue, 11 Mar 2008 15:57:16 -0400
Subject: [BioSQL-l] Biopython documentation in BioSQL SVN
In-Reply-To: <320fb6e00803111151q645c1f0cgead7842e8ab0d0d@mail.gmail.com>
References: <320fb6e00803111148k52150fdcu3b3c22f72f59f514@mail.gmail.com>
	<320fb6e00803111151q645c1f0cgead7842e8ab0d0d@mail.gmail.com>
Message-ID: <B66B13C8-6520-46F8-AD96-DABB4D06F91D@gmx.net>


On Mar 11, 2008, at 2:51 PM, Peter wrote:

> However, there is some older LaTeX based documentation on our webpage,
> http://biopython.org/DIST/docs/biosql/python_biosql_basic.html
> http://biopython.org/DIST/docs/biosql/python_biosql_basic.pdf
>
> These are currently living in the BioSQL repository,

You mean that the originals are, i.e., the source .tex file, right?  
The files in the BioSQL repository have been updated, and the updates  
should be in the v1.0.0 release.

> [...]
> Does it make sense to have this documentation separate from the
> Biopython code it refers to (which lives in the Biopython repository)?
> For one thing, it complicates access rights for developers.

Indeed. You can have write access but that doesn't mean it would then  
be easy to maintain for you folks (as it being in a non-biopython  
repository likely makes it slip from your mind again).

However, at the end of the day it is your call. I'm happy to leave it  
there, especially if there is continuing interest from Biopython  
folks to keep it updated (if there isn't, I may schedule it for  
deletion for one of the 1.1 or higher releases).

>
> What I would suggest is just to:
>
> (*) add a disclaimer to the top of python_biosql_basic.tex saying this
>     document is depreciated, and giving a link to the wiki page,
>     http://biopython.org/wiki/BioSQL

Just send me a patch of the change you would like to make.

> (*) regenerate the PDF and HTML files.

Those have been regenerated already, before the v1.0.0 release (by  
me, under some pains trying to get HeVeA to do what the original  
creators seemed to have gotten it to do).

> (*) Update these three files in BioSQL's SVN repository.

Done already as far as the change to svn is concerned. Actually, some  
Biopythonist (Sebastian?) walked through the file and made sure  
everything works as described, giving rise to an additional change.

> (*) Copy the new PDF and HTML files over to the Biopython webserver.


Feel free to grab them from svn (or from the BioSQL 1.0.0 release,  
there haven't been any changes since the release).

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Thu Mar 13 11:06:18 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 13 Mar 2008 15:06:18 +0000
Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id
Message-ID: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com>

Dear list,

One of the unresolved issues with Biopython's BioSQL interface is
dealing with the NCBI taxon ID when loading sequences into the
database.

As I understand it, ideally before loading any sequences, the user
will have loaded in the entire NCBI taxonomy using the
load_ncbi_taxonomy.pl script, as I described here:
http://biopython.org/wiki/BioSQL#NCBI_Taxonomy

When a new sequence is added to the database with a known taxon id,
there is no problem.  But happens if its a recently sequenced organism
which isn't defined yet in the BioSQL taxonomy tables?  Could/should
the user re-run load_ncbi_taxonomy.pl, and then load in their new
sequence?

Right now in Biopython due what appears to have been intended as a
short term hack, we simple don't record the taxon id at all (!), and I
would like to fix this (bug 2422).
http://bugzilla.open-bio.org/show_bug.cgi?id=2422

How do BioPerl et al deal with this issue?  Do they try and update the
taxonomy tables using the available information in the new record's
annotation (i.e. the new taxon id and the species name)?  Do they
lookup the NCBI taxonomy definition via the internet?  Do they throw
an error and halt?

Thanks,

Peter
(Biopython)

From hlapp at gmx.net  Thu Mar 13 18:51:13 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 13 Mar 2008 18:51:13 -0400
Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id
In-Reply-To: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com>
References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com>
Message-ID: <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net>

(this is more of a bioperl question than a biosql one)

The load_ncbi_taxonomy.pl script is designed to update the taxon  
tables in a non-disruptive way, and if there weren't many changes  
shouldn't actually take that long (except that recalculating the  
nested set values may take a couple of minutes).

Bioperl-db will store the taxon information it finds in the  
Bio::Species object if it can't locate the taxon by lookup, and will  
not raise an error. The problem with this is that it relies on the  
Bio::SeqIO parser to have gotten the species and lineage information  
correct, which is sometimes a wrong assumption for exotic species.  
Most often the error will not manifest itself at the time of storing  
the erroneously parsed information, but when it is re-retrieved and  
used to populate a Bio::Species object.

For the SymAtlas project we had this situation (new species in  
sequence updates that the last NCBI taxonomy update hadn't yet  
brought in) quite regularly. I wrote a SQL script would fix those  
'haphazard' additions such that load_ncbi_taxonomy would update them  
to their correct values come the next NCBI taxonomy update. I can  
send you the script (it would be for the Oracle version), but I'm not  
sure this is a widely viable strategy.

	-hilmar

On Mar 13, 2008, at 11:06 AM, Peter wrote:

> Dear list,
>
> One of the unresolved issues with Biopython's BioSQL interface is
> dealing with the NCBI taxon ID when loading sequences into the
> database.
>
> As I understand it, ideally before loading any sequences, the user
> will have loaded in the entire NCBI taxonomy using the
> load_ncbi_taxonomy.pl script, as I described here:
> http://biopython.org/wiki/BioSQL#NCBI_Taxonomy
>
> When a new sequence is added to the database with a known taxon id,
> there is no problem.  But happens if its a recently sequenced organism
> which isn't defined yet in the BioSQL taxonomy tables?  Could/should
> the user re-run load_ncbi_taxonomy.pl, and then load in their new
> sequence?
>
> Right now in Biopython due what appears to have been intended as a
> short term hack, we simple don't record the taxon id at all (!), and I
> would like to fix this (bug 2422).
> http://bugzilla.open-bio.org/show_bug.cgi?id=2422
>
> How do BioPerl et al deal with this issue?  Do they try and update the
> taxonomy tables using the available information in the new record's
> annotation (i.e. the new taxon id and the species name)?  Do they
> lookup the NCBI taxonomy definition via the internet?  Do they throw
> an error and halt?
>
> Thanks,
>
> Peter
> (Biopython)
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Thu Mar 13 19:13:32 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 13 Mar 2008 23:13:32 +0000
Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id
In-Reply-To: <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net>
References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com>
	<32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net>
Message-ID: <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com>

On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
> (this is more of a bioperl question than a biosql one)

Well, yes and no.  And I'm not subscribed to the Bioperl list, nor the
BioJava one, nor the BioRuby one.

>  The load_ncbi_taxonomy.pl script is designed to update the taxon
>  tables in a non-disruptive way, and if there weren't many changes
>  shouldn't actually take that long (except that recalculating the
>  nested set values may take a couple of minutes).

Do you think when faced with a novel taxon id, Biopython/BioPerl/...
could write some minimal taxonomy entry (without any guess work based
on the species name), in order to record the sequence's taxon - and
then running an improved load_ncbi_taxonomy.pl at a later date would
sort out the proper taxonomy?

>  Bioperl-db will store the taxon information it finds in the
>  Bio::Species object if it can't locate the taxon by lookup, and will
>  not raise an error. The problem with this is that it relies on the
>  Bio::SeqIO parser to have gotten the species and lineage information
>  correct, which is sometimes a wrong assumption for exotic species.
>  Most often the error will not manifest itself at the time of storing
>  the erroneously parsed information, but when it is re-retrieved and
>  used to populate a Bio::Species object.

This is what I would like to avoid with Biopython.

>  For the SymAtlas project we had this situation (new species in
>  sequence updates that the last NCBI taxonomy update hadn't yet
>  brought in) quite regularly. I wrote a SQL script would fix those
>  'haphazard' additions such that load_ncbi_taxonomy would update them
>  to their correct values come the next NCBI taxonomy update. I can
>  send you the script (it would be for the Oracle version), but I'm not
>  sure this is a widely viable strategy.

So this wasn't integrated with load_ncbi_taxonomy.pl at all?

Peter

From hlapp at gmx.net  Thu Mar 13 19:41:43 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 13 Mar 2008 19:41:43 -0400
Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id
In-Reply-To: <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com>
References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com>
	<32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net>
	<320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com>
Message-ID: <CEA4F4E7-A66B-4C62-AE32-511E177BC485@gmx.net>


On Mar 13, 2008, at 7:13 PM, Peter wrote:

> On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
>> [...]
>>  The load_ncbi_taxonomy.pl script is designed to update the taxon
>>  tables in a non-disruptive way, and if there weren't many changes
>>  shouldn't actually take that long (except that recalculating the
>>  nested set values may take a couple of minutes).
>
> Do you think when faced with a novel taxon id, Biopython/BioPerl/...
> could write some minimal taxonomy entry (without any guess work based
> on the species name), in order to record the sequence's taxon

This is what Bioperl-db does. There isn't any guesswork. If  
Bio::Species has lineage information it will also insert the lineage  
information, though.

> - and then running an improved load_ncbi_taxonomy.pl at a later  
> date would
> sort out the proper taxonomy?

If I remember correctly, the script makes (and hence expects) the  
primary key and the NCBI taxonomy ID to be identical. If your loading  
procedure can achieve that already then load_ncbi_taxonomy.pl should  
pick them up and fix them. You can try that by loading the taxonomy  
through the script, then arbitrarily choose a taxon, create a stub  
bioentry for it and set its taxon_id foreign key to the chosen  
taxon,  change its taxon_name.name to some bogus value (for the  
'scientific name' class, for example) (and feel free to change the  
left_id and right_id values in taxon too), and rerun the script. It  
should fix the change you made, and your bioentry should still point  
to the same taxon (because its primary key did not change, and did  
not get deleted either; otherwise the bioentry would now have a null  
value in the foreign key).

The Bioperl-db way of storing things does not give control over  
primary key assignment to Bioperl-db, so the database will assign it.

> [...]
>>  For the SymAtlas project we had this situation (new species in
>>  sequence updates that the last NCBI taxonomy update hadn't yet
>>  brought in) quite regularly. I wrote a SQL script would fix those
>>  'haphazard' additions such that load_ncbi_taxonomy would update them
>>  to their correct values come the next NCBI taxonomy update. I can
>>  send you the script (it would be for the Oracle version), but I'm  
>> not
>>  sure this is a widely viable strategy.
>
> So this wasn't integrated with load_ncbi_taxonomy.pl at all?

No, but now that you say it I don't see any reason why I couldn't.  
Maybe that's just what I should do.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From mrphysh at juno.com  Thu Mar 13 21:58:25 2008
From: mrphysh at juno.com (mrphysh at juno.com)
Date: Fri, 14 Mar 2008 01:58:25 GMT
Subject: [BioSQL-l] bioperl basics
Message-ID: <20080313.195825.6855.0@webmail20.vgs.untd.com>

I am a molecular biologist studying bioinformatics from a Perl background and making progress.  I am realizing that without tapping into the existing infrastructure, I will be writing code for ever.  Bioperl is the path for me.  I am moving forward.  

the error I encounter is 

can't locate Cache/FileCache in @INC (@INC contains /etc/perl/ /usr/locaql/lib/perl/5.8.8 .....)    and so forth.

I found the files in a home directory.  I must have told the install to put them there...?


anyway:  How do I edit this environmental variable..... @INC.  I cannot find anything in my book.

thanks
john brigham


I will be writing code for years and need to tap into the  
_____________________________________________________________
Need cash? Click to get an emergency loan, bad credit ok
http://thirdpartyoffers.juno.com/TGL2121/fc/Ioyw6i3mKmyQsg01zMPK1Qa0178ZfajwTEBgEXdzlmb9zLLZc8pLOU/


From barry.moore at genetics.utah.edu  Thu Mar 13 23:08:19 2008
From: barry.moore at genetics.utah.edu (Barry Moore)
Date: Thu, 13 Mar 2008 21:08:19 -0600
Subject: [BioSQL-l] bioperl basics
In-Reply-To: <20080313.195825.6855.0@webmail20.vgs.untd.com>
References: <20080313.195825.6855.0@webmail20.vgs.untd.com>
Message-ID: <E6BF1E75-E367-4F99-B910-FF8D4C307E86@genetics.utah.edu>

John,

@INC is not an environment variable, it is a perl variable that gets  
populated by the environment variable PERL5LIB.  You would normally  
set that environment variable by doing something like 'export  
PERL5LIB='/path/to/perl/libraries':$PERL5LIB' if you use bash shell  
or setenv PERL5LIB "/path/to/perl/libraries:$PERL5LIB" if you use c  
shell and you'll want to put those lines into the appropriate start  
up files so that they get set everytime you log in.  This will be  
different on a windows system but I'm afraid I can't help with that.

If you are having trouble installing bioperl I would encourage you to  
read the installation documentation at http://www.bioperl.org/wiki/ 
Installing_BioPerl.  Beyond that you will find a wealth of help with  
your beginning perl questions by searching the web with Google,  
asking at perlmonks.org or joining one of the many perl mailing lists  
that you can find at http://lists.cpan.org/.

The bioperl mailing list and this mailing list (BioSQL) are devoted  
specifically to discussions directly related to Bioperl and BioSQL  
respectively.  You should search for answers to questions like this  
one first on the web, then on one of the general perl mailing lists  
or web sites mentioned above.  When you have questions (even beginner  
ones) that are specific to Bioperl or BioSQL you are welcome post to  
those lists at any time.

Barry


On Mar 13, 2008, at 7:58 PM, mrphysh at juno.com wrote:

> I am a molecular biologist studying bioinformatics from a Perl  
> background and making progress.  I am realizing that without  
> tapping into the existing infrastructure, I will be writing code  
> for ever.  Bioperl is the path for me.  I am moving forward.
>
> the error I encounter is
>
> can't locate Cache/FileCache in @INC (@INC contains /etc/perl/ /usr/ 
> locaql/lib/perl/5.8.8 .....)    and so forth.
>
> I found the files in a home directory.  I must have told the  
> install to put them there...?
>
>
> anyway:  How do I edit this environmental variable..... @INC.  I  
> cannot find anything in my book.
>
> thanks
> john brigham
>
>
> I will be writing code for years and need to tap into the
> _____________________________________________________________
> Need cash? Click to get an emergency loan, bad credit ok
> http://thirdpartyoffers.juno.com/TGL2121/fc/ 
> Ioyw6i3mKmyQsg01zMPK1Qa0178ZfajwTEBgEXdzlmb9zLLZc8pLOU/
>
>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l


From markjschreiber at gmail.com  Fri Mar 14 09:48:38 2008
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Fri, 14 Mar 2008 21:48:38 +0800
Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id
In-Reply-To: <CEA4F4E7-A66B-4C62-AE32-511E177BC485@gmx.net>
References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com>
	<32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net>
	<320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com>
	<CEA4F4E7-A66B-4C62-AE32-511E177BC485@gmx.net>
Message-ID: <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com>

>From memory BioJava will add it if it is not already in there. If the
taxid can be found then the system connects you with whatever is in
that taxid, it doesn't overwrite it.

This has two curious side effects. Because the details associated with
a taxid sometimes change (eg common name changes a lot) you can get
connected to an outdated version (if your record is newer than your
NCBI taxonomy) or you can get connected with a version that is newer
than your record which means when you round-trip you don't get
complete identity.

For compatibility across the projects some kind of consensus would be good.

- Mark

On Fri, Mar 14, 2008 at 7:41 AM, Hilmar Lapp <hlapp at gmx.net> wrote:
>
>
> On Mar 13, 2008, at 7:13 PM, Peter wrote:
>
> > On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
> >> [...]
>
> >>  The load_ncbi_taxonomy.pl script is designed to update the taxon
> >>  tables in a non-disruptive way, and if there weren't many changes
> >>  shouldn't actually take that long (except that recalculating the
> >>  nested set values may take a couple of minutes).
> >
> > Do you think when faced with a novel taxon id, Biopython/BioPerl/...
> > could write some minimal taxonomy entry (without any guess work based
> > on the species name), in order to record the sequence's taxon
>
> This is what Bioperl-db does. There isn't any guesswork. If
> Bio::Species has lineage information it will also insert the lineage
> information, though.
>
>
> > - and then running an improved load_ncbi_taxonomy.pl at a later
> > date would
> > sort out the proper taxonomy?
>
> If I remember correctly, the script makes (and hence expects) the
> primary key and the NCBI taxonomy ID to be identical. If your loading
> procedure can achieve that already then load_ncbi_taxonomy.pl should
> pick them up and fix them. You can try that by loading the taxonomy
> through the script, then arbitrarily choose a taxon, create a stub
> bioentry for it and set its taxon_id foreign key to the chosen
> taxon,  change its taxon_name.name to some bogus value (for the
> 'scientific name' class, for example) (and feel free to change the
> left_id and right_id values in taxon too), and rerun the script. It
> should fix the change you made, and your bioentry should still point
> to the same taxon (because its primary key did not change, and did
> not get deleted either; otherwise the bioentry would now have a null
> value in the foreign key).
>
> The Bioperl-db way of storing things does not give control over
> primary key assignment to Bioperl-db, so the database will assign it.
>
> > [...]
>
> >>  For the SymAtlas project we had this situation (new species in
> >>  sequence updates that the last NCBI taxonomy update hadn't yet
> >>  brought in) quite regularly. I wrote a SQL script would fix those
> >>  'haphazard' additions such that load_ncbi_taxonomy would update them
> >>  to their correct values come the next NCBI taxonomy update. I can
> >>  send you the script (it would be for the Oracle version), but I'm
> >> not
> >>  sure this is a widely viable strategy.
> >
> > So this wasn't integrated with load_ncbi_taxonomy.pl at all?
>
> No, but now that you say it I don't see any reason why I couldn't.
> Maybe that's just what I should do.
>
>        -hilmar
>
> --
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================
>
>
>
> _______________________________________________
>
>
>
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
>

From cjfields at uiuc.edu  Fri Mar 14 10:31:09 2008
From: cjfields at uiuc.edu (Chris Fields)
Date: Fri, 14 Mar 2008 09:31:09 -0500
Subject: [BioSQL-l] [Bioperl-l] Loading sequences with novel NCBI taxon
	id
In-Reply-To: <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com>
References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com>
	<32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net>
	<320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com>
	<CEA4F4E7-A66B-4C62-AE32-511E177BC485@gmx.net>
	<93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com>
Message-ID: <CE3675B2-2AFD-46AA-A348-16C9FEA51E0E@uiuc.edu>

The counter to that perspective (using new sequences with old tax  
info) would be to regularly update NCBI taxonomy, particularly in  
circumstances prior to adding new sequences.  Hilmar mentioned that  
once tax is loaded it doesn't take as long to update, so you could set  
up a cron job to update regularly.

I remember someone mentioning weekly or monthly updates on the list  
quite a while ago, but I'm unsure how often NCBI updates tax  
information (i.e. with every release, monthly, weekly, etc).  I can  
see instances popping up where you used the an up-to-date taxonomy but  
a new sequence contains a tax ID not present.  I think bioperl-db  
handles these but I'm not sure what other Bio* do.

chris

On Mar 14, 2008, at 8:48 AM, Mark Schreiber wrote:

>> From memory BioJava will add it if it is not already in there. If the
> taxid can be found then the system connects you with whatever is in
> that taxid, it doesn't overwrite it.
>
> This has two curious side effects. Because the details associated with
> a taxid sometimes change (eg common name changes a lot) you can get
> connected to an outdated version (if your record is newer than your
> NCBI taxonomy) or you can get connected with a version that is newer
> than your record which means when you round-trip you don't get
> complete identity.
>
> For compatibility across the projects some kind of consensus would  
> be good.
>
> - Mark
> On Fri, Mar 14, 2008 at 7:41 AM, Hilmar Lapp <hlapp at gmx.net> wrote:
>>
>>
>> On Mar 13, 2008, at 7:13 PM, Peter wrote:
>>
>>> On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
>>>> [...]
>>
>>>> The load_ncbi_taxonomy.pl script is designed to update the taxon
>>>> tables in a non-disruptive way, and if there weren't many changes
>>>> shouldn't actually take that long (except that recalculating the
>>>> nested set values may take a couple of minutes).
>>>
>>> Do you think when faced with a novel taxon id, Biopython/BioPerl/...
>>> could write some minimal taxonomy entry (without any guess work  
>>> based
>>> on the species name), in order to record the sequence's taxon
>>
>> This is what Bioperl-db does. There isn't any guesswork. If
>> Bio::Species has lineage information it will also insert the lineage
>> information, though.
>>
>>
>>> - and then running an improved load_ncbi_taxonomy.pl at a later
>>> date would
>>> sort out the proper taxonomy?
>>
>> If I remember correctly, the script makes (and hence expects) the
>> primary key and the NCBI taxonomy ID to be identical. If your loading
>> procedure can achieve that already then load_ncbi_taxonomy.pl should
>> pick them up and fix them. You can try that by loading the taxonomy
>> through the script, then arbitrarily choose a taxon, create a stub
>> bioentry for it and set its taxon_id foreign key to the chosen
>> taxon,  change its taxon_name.name to some bogus value (for the
>> 'scientific name' class, for example) (and feel free to change the
>> left_id and right_id values in taxon too), and rerun the script. It
>> should fix the change you made, and your bioentry should still point
>> to the same taxon (because its primary key did not change, and did
>> not get deleted either; otherwise the bioentry would now have a null
>> value in the foreign key).
>>
>> The Bioperl-db way of storing things does not give control over
>> primary key assignment to Bioperl-db, so the database will assign it.
>>
>>> [...]
>>
>>>> For the SymAtlas project we had this situation (new species in
>>>> sequence updates that the last NCBI taxonomy update hadn't yet
>>>> brought in) quite regularly. I wrote a SQL script would fix those
>>>> 'haphazard' additions such that load_ncbi_taxonomy would update  
>>>> them
>>>> to their correct values come the next NCBI taxonomy update. I can
>>>> send you the script (it would be for the Oracle version), but I'm
>>>> not
>>>> sure this is a widely viable strategy.
>>>
>>> So this wasn't integrated with load_ncbi_taxonomy.pl at all?
>>
>> No, but now that you say it I don't see any reason why I couldn't.
>> Maybe that's just what I should do.
>>
>>       -hilmar
>>
>> --
>> ===========================================================
>> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
>> ===========================================================
>>
>>
>>
>> _______________________________________________
>>
>>
>>
>> BioSQL-l mailing list
>> BioSQL-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biosql-l
>>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign


From markjschreiber at gmail.com  Fri Mar 14 20:56:37 2008
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Sat, 15 Mar 2008 08:56:37 +0800
Subject: [BioSQL-l] [Bioperl-l] Loading sequences with novel NCBI taxon
	id
In-Reply-To: <CE3675B2-2AFD-46AA-A348-16C9FEA51E0E@uiuc.edu>
References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com>
	<32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net>
	<320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com>
	<CEA4F4E7-A66B-4C62-AE32-511E177BC485@gmx.net>
	<93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com>
	<CE3675B2-2AFD-46AA-A348-16C9FEA51E0E@uiuc.edu>
Message-ID: <93b45ca50803141756m3d7f022cnb57bd39f37270682@mail.gmail.com>

I agree. A regular update would be best.

Of course if your BioSQL db is limited to one or a few organisms you can
just keep a fragment of the db.

- Mark

On Fri, Mar 14, 2008 at 10:31 PM, Chris Fields <cjfields at uiuc.edu> wrote:

> The counter to that perspective (using new sequences with old tax
> info) would be to regularly update NCBI taxonomy, particularly in
> circumstances prior to adding new sequences.  Hilmar mentioned that
> once tax is loaded it doesn't take as long to update, so you could set
> up a cron job to update regularly.
>
> I remember someone mentioning weekly or monthly updates on the list
> quite a while ago, but I'm unsure how often NCBI updates tax
> information (i.e. with every release, monthly, weekly, etc).  I can
> see instances popping up where you used the an up-to-date taxonomy but
> a new sequence contains a tax ID not present.  I think bioperl-db
> handles these but I'm not sure what other Bio* do.
>
> chris
>
> On Mar 14, 2008, at 8:48 AM, Mark Schreiber wrote:
>
> >> From memory BioJava will add it if it is not already in there. If the
> > taxid can be found then the system connects you with whatever is in
> > that taxid, it doesn't overwrite it.
> >
> > This has two curious side effects. Because the details associated with
> > a taxid sometimes change (eg common name changes a lot) you can get
> > connected to an outdated version (if your record is newer than your
> > NCBI taxonomy) or you can get connected with a version that is newer
> > than your record which means when you round-trip you don't get
> > complete identity.
> >
> > For compatibility across the projects some kind of consensus would
> > be good.
> >
> > - Mark
> > On Fri, Mar 14, 2008 at 7:41 AM, Hilmar Lapp <hlapp at gmx.net> wrote:
> >>
> >>
> >> On Mar 13, 2008, at 7:13 PM, Peter wrote:
> >>
> >>> On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
> >>>> [...]
> >>
> >>>> The load_ncbi_taxonomy.pl script is designed to update the taxon
> >>>> tables in a non-disruptive way, and if there weren't many changes
> >>>> shouldn't actually take that long (except that recalculating the
> >>>> nested set values may take a couple of minutes).
> >>>
> >>> Do you think when faced with a novel taxon id, Biopython/BioPerl/...
> >>> could write some minimal taxonomy entry (without any guess work
> >>> based
> >>> on the species name), in order to record the sequence's taxon
> >>
> >> This is what Bioperl-db does. There isn't any guesswork. If
> >> Bio::Species has lineage information it will also insert the lineage
> >> information, though.
> >>
> >>
> >>> - and then running an improved load_ncbi_taxonomy.pl at a later
> >>> date would
> >>> sort out the proper taxonomy?
> >>
> >> If I remember correctly, the script makes (and hence expects) the
> >> primary key and the NCBI taxonomy ID to be identical. If your loading
> >> procedure can achieve that already then load_ncbi_taxonomy.pl should
> >> pick them up and fix them. You can try that by loading the taxonomy
> >> through the script, then arbitrarily choose a taxon, create a stub
> >> bioentry for it and set its taxon_id foreign key to the chosen
> >> taxon,  change its taxon_name.name to some bogus value (for the
> >> 'scientific name' class, for example) (and feel free to change the
> >> left_id and right_id values in taxon too), and rerun the script. It
> >> should fix the change you made, and your bioentry should still point
> >> to the same taxon (because its primary key did not change, and did
> >> not get deleted either; otherwise the bioentry would now have a null
> >> value in the foreign key).
> >>
> >> The Bioperl-db way of storing things does not give control over
> >> primary key assignment to Bioperl-db, so the database will assign it.
> >>
> >>> [...]
> >>
> >>>> For the SymAtlas project we had this situation (new species in
> >>>> sequence updates that the last NCBI taxonomy update hadn't yet
> >>>> brought in) quite regularly. I wrote a SQL script would fix those
> >>>> 'haphazard' additions such that load_ncbi_taxonomy would update
> >>>> them
> >>>> to their correct values come the next NCBI taxonomy update. I can
> >>>> send you the script (it would be for the Oracle version), but I'm
> >>>> not
> >>>> sure this is a widely viable strategy.
> >>>
> >>> So this wasn't integrated with load_ncbi_taxonomy.pl at all?
> >>
> >> No, but now that you say it I don't see any reason why I couldn't.
> >> Maybe that's just what I should do.
> >>
> >>       -hilmar
> >>
> >> --
> >> ===========================================================
> >> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> >> ===========================================================
> >>
> >>
> >>
> >> _______________________________________________
> >>
> >>
> >>
> >> BioSQL-l mailing list
> >> BioSQL-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biosql-l
> >>
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign
>
>
>
>

From biopython at maubp.freeserve.co.uk  Sun Mar 16 15:16:04 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 16 Mar 2008 19:16:04 +0000
Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id
In-Reply-To: <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com>
References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com>
	<32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net>
	<320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com>
	<CEA4F4E7-A66B-4C62-AE32-511E177BC485@gmx.net>
	<93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com>
Message-ID: <320fb6e00803161216q51488e5bo8d1538fb616edb88@mail.gmail.com>

On Fri, Mar 14, 2008 Mark Schreiber wrote:
> From memory BioJava will add it if it is not already in there. If the
> taxid can be found then the system connects you with whatever is in
> that taxid, it doesn't overwrite it.

BioPerl does this to, so there is consensus on this at least.  But see
below regarding the lineage.

>  This has two curious side effects. Because the details associated with
>  a taxid sometimes change (eg common name changes a lot) you can get
>  connected to an outdated version (if your record is newer than your
>  NCBI taxonomy) or you can get connected with a version that is newer
>  than your record which means when you round-trip you don't get
>  complete identity.

This is understandable, even if a little unexpected.

I (Peter) wrote:
>  > > Do you think when faced with a novel taxon id, Biopython/BioPerl/...
>  > > could write some minimal taxonomy entry (without any guess work based
>  > > on the species name), in order to record the sequence's taxon

Hilmar Lapp replied:
>  > This is what Bioperl-db does. There isn't any guesswork. If
>  > Bio::Species has lineage information it will also insert the lineage
>  > information, though.

I am planing to fix Biopython so that once again, it will record the
taxon id against new sequences if the species is already in the table,
and add it to the taxonomy if it isn't there already.

Should we also try and add the lineage into the taxon/taxon_name
tables, linking to existing entries based on matching scientific names
where possible?  Or, should we just add a single taxonomy entry for
the new species, with no lineage links at all?

The old Biopython code also used to add taxon table entries for the
full lineage - trying to reuse existing entries based on string
matching to the scientific name field in the taxon_name table.  This
strikes me as a little unreliable (which is why I used the term "guess
work" in my earlier email).  I am also concerned that this complicates
the clean up operation for load_ncbi_taxonomy.pl, but have not looked
into this.

Hilmar Lapp wrote:
>  > If I remember correctly, the script makes (and hence expects) the
>  > primary key and the NCBI taxonomy ID to be identical.

Really? Perhaps I have misunderstood you.  That would cause problems
if we want to record a new sequence entry with species information but
no NCBI taxonomy ID (e.g. an in house sequencing project).  The
Biopython code doesn't seem to assume the taxon table ID bears any
resemblance to the the NCBI taxonomy ID.  When creating new taxon
table entries, we let the database will assign the taxon table id
(primary key).

Peter

From hlapp at gmx.net  Sun Mar 16 18:00:12 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 16 Mar 2008 18:00:12 -0400
Subject: [BioSQL-l] BioSQL + Embl + Comments
In-Reply-To: <1205182450.18769.20.camel@Graco>
References: <1205182450.18769.20.camel@Graco>
Message-ID: <D9EE6AD3-AC97-4ACD-85AE-4C7E87C1A7A9@gmx.net>

Hi Raoul,

On Mar 10, 2008, at 4:54 PM, Raoul Jean Pierre Bonnal wrote:

> Dear Hilmar,
> I'm here for asking you some help.
>
> BioRuby guys chosen as example for round trip tests the sequence ID
> AJ224122; SV 3; linear; genomic DNA; STD; PLN; 3827 BP.
>
> I have problem with the references/comments informations.
> In biosql "comment" seems to be something generic not directly  
> binded to
> a reference.

Comment in BioSQL is a piece of annotation of type comment. The  
schema at present only allows you to attach those to bioentries, and  
in fact one particular comment can be assigned to only one bioentry  
(1:n relationship).

> If you look at the AJ224122's embl format a comment is
> connected with the reference.

You're referring to the following line, right?

RC   revised by [3]

> There is no problem with genbank because there is only a generic  
> comment
> and BioSQL works correctly in this case.
> So, how can I manage the problem with Embl ? I was thinking to add a
> column the "comment_id" to "bioentry_reference" as fk to "comment"  
> table
> in a way that a bioentry_reference can have more comments.

One question here is whether the comment is specific to the  
association of the reference with the bioentry, or to the reference  
in general.

The next thing to note is that the comment above is not just text, it  
actually establishes a relationship to another reference (or to  
another reference to bioentry association). So to really capture it  
you would want a typed link between bioentry_reference rows (in this  
case the relationship type would be 'revises' or 'revised by',  
depending on direction).

The question is whether this depth of modeling is needed or useful,  
aside from the fact that I'm pretty sure that none of the Bio*  
libraries supports it (but maybe they want to?).

So if not, I guess this goes back to the use-case of round-tripping?  
Maybe to satisfy that a bioentry_reference_qualifier table would  
suffice (assuming that the comment does apply rather to the reference/ 
bioentry association than directly to the reference).

>
> PS: I don't know if this stuff should be emailed to biosql list

Yes, I actually hadn't realized that you hadn't posted this to the  
list. Should have forwarded right away, sorry for sitting on it.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Sun Mar 16 18:54:45 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 16 Mar 2008 18:54:45 -0400
Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id
In-Reply-To: <320fb6e00803161216q51488e5bo8d1538fb616edb88@mail.gmail.com>
References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com>
	<32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net>
	<320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com>
	<CEA4F4E7-A66B-4C62-AE32-511E177BC485@gmx.net>
	<93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com>
	<320fb6e00803161216q51488e5bo8d1538fb616edb88@mail.gmail.com>
Message-ID: <9BC888F9-1DB1-40CC-93DA-C27E30019C04@gmx.net>


On Mar 16, 2008, at 3:16 PM, Peter wrote:

> [...] I (Peter) wrote:
>>>> Do you think when faced with a novel taxon id, Biopython/ 
>>>> BioPerl/...
>>>> could write some minimal taxonomy entry (without any guess work  
>>>> based
>>>> on the species name), in order to record the sequence's taxon
>
> Hilmar Lapp replied:
>>> This is what Bioperl-db does. There isn't any guesswork. If
>>> Bio::Species has lineage information it will also insert the lineage
>>> information, though.
>
> I am planing to fix Biopython so that once again, it will record the
> taxon id against new sequences if the species is already in the table,
> and add it to the taxonomy if it isn't there already.
>
> Should we also try and add the lineage into the taxon/taxon_name
> tables, linking to existing entries based on matching scientific names
> where possible?  Or, should we just add a single taxonomy entry for
> the new species, with no lineage links at all?

This should probably depend on how good or complete the lineage  
information is that you have. BioPerl parses this out of the sequence  
files (for formats that have it, such as GenBank, EMBL, UniProt), and  
so except for exotic clades that don't follow the typical patterns it  
is usually in good shape (though one might say that the majority of  
clades are exotic).

Moreover, it's worth noting that the NCBI taxonomy often contains  
more nodes in a lineage than are shown in the GenBank record. In this  
case, unless you know which levels (ranks) to print and which not to,  
having the full NCBI taxonomy information may in fact cause problems  
for round-tripping.

>
> The old Biopython code also used to add taxon table entries for the
> full lineage - trying to reuse existing entries based on string
> matching to the scientific name field in the taxon_name table.  This
> strikes me as a little unreliable (which is why I used the term "guess
> work" in my earlier email).

It's pretty unreliable actually. There is not only synonymy but also  
rampant homonymy in taxonomic names. There are plenty of examples for  
the same scientific name in use for a plant and for some animal, for  
example. So in order to be unambiguous you will need to know (and  
check) the kingdom.

> I am also concerned that this complicates the clean up operation  
> for load_ncbi_taxonomy.pl, but have not looked into this.

It shouldn't. The script makes no difference between tip (species or  
subspecies) nodes or internal nodes.

>
> Hilmar Lapp wrote:
>>> If I remember correctly, the script makes (and hence expects) the
>>> primary key and the NCBI taxonomy ID to be identical.
>
> Really? Perhaps I have misunderstood you.  That would cause problems
> if we want to record a new sequence entry with species information but
> no NCBI taxonomy ID (e.g. an in house sequencing project).  The
> Biopython code doesn't seem to assume the taxon table ID bears any
> resemblance to the the NCBI taxonomy ID.  When creating new taxon
> table entries, we let the database will assign the taxon table id
> (primary key).

Right, that's what I said Bioperl-db does too, and is the reason I  
had to regularly run that SQL script that would migrate the primary  
keys.

Doing that isn't a big deal but I guess this could also be fixed in  
load_ncbi_taxonomy.pl so that it doesn't need to rely on this  
assumption. Would someone mind filing the bug report? (We have a  
BioSQL category now on bugzilla.open-bio.org.)

Cheers,

	-hilmar

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Mon Mar 17 12:08:43 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 17 Mar 2008 16:08:43 +0000
Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id
In-Reply-To: <9BC888F9-1DB1-40CC-93DA-C27E30019C04@gmx.net>
References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com>
	<32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net>
	<320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com>
	<CEA4F4E7-A66B-4C62-AE32-511E177BC485@gmx.net>
	<93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com>
	<320fb6e00803161216q51488e5bo8d1538fb616edb88@mail.gmail.com>
	<9BC888F9-1DB1-40CC-93DA-C27E30019C04@gmx.net>
Message-ID: <320fb6e00803170908x76f0b9a3he57f4653d2fd433@mail.gmail.com>

On Sun, Mar 16, 2008 at 10:54 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
>  > Should we [Biopython] also try and add the lineage into the taxon/
>  > taxon_name tables, linking to existing entries based on matching scientific
>  > names where possible?  Or, should we just add a single taxonomy entry
>  > for the new species, with no lineage links at all?
>
>  This should probably depend on how good or complete the lineage
>  information is that you have. BioPerl parses this out of the sequence
>  files (for formats that have it, such as GenBank, EMBL, UniProt), and
>  so except for exotic clades that don't follow the typical patterns it
>  is usually in good shape (though one might say that the majority of
>  clades are exotic).

I'm currently testing with GenBank, EMBL and SwissProt/UniProt files.
Some of these files are several years old, and include have horrible
multi-species SwissProt files with "species" names longer than 255
characters etc.  The good news is that as you pointed out on another
thread on the BioSQL mailing list earlier this month, they don't seem
to do this anymore.

>  Moreover, it's worth noting that the NCBI taxonomy often contains
>  more nodes in a lineage than are shown in the GenBank record. In this
>  case, unless you know which levels (ranks) to print and which not to,
>  having the full NCBI taxonomy information may in fact cause problems
>  for round-tripping.

I've come to accept that taxonomy information won't always survive a round trip.

>  > The old Biopython code also used to add taxon table entries for the
>  > full lineage - trying to reuse existing entries based on string
>  > matching to the scientific name field in the taxon_name table.  This
>  > strikes me as a little unreliable (which is why I used the term "guess
>  > work" in my earlier email).
>
>  It's pretty unreliable actually. There is not only synonymy but also
>  rampant homonymy in taxonomic names. There are plenty of examples for
>  the same scientific name in use for a plant and for some animal, for
>  example. So in order to be unambiguous you will need to know (and
>  check) the kingdom.

I don't think the current Biopython code for recording the lineages checks the
kingdom... could someone point me at the relevant bit of BioPerl and I'll see
if I can understand exactly what they do?

Hilmar Lapp wrote:
>  If I remember correctly, the script makes (and hence expects) the
>  primary key and the NCBI taxonomy ID to be identical.
>  ...
>  Doing that isn't a big deal but I guess this could also be fixed in
>  load_ncbi_taxonomy.pl so that it doesn't need to rely on this
>  assumption. Would someone mind filing the bug report? (We have a
>  BioSQL category now on bugzilla.open-bio.org.)

I've filed Bug 2470 on this, http://bugzilla.open-bio.org/show_bug.cgi?id=2470

Regards,

Peter

From hlapp at gmx.net  Tue Mar 18 08:30:34 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Tue, 18 Mar 2008 08:30:34 -0400
Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id
In-Reply-To: <320fb6e00803170908x76f0b9a3he57f4653d2fd433@mail.gmail.com>
References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com>
	<32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net>
	<320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com>
	<CEA4F4E7-A66B-4C62-AE32-511E177BC485@gmx.net>
	<93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com>
	<320fb6e00803161216q51488e5bo8d1538fb616edb88@mail.gmail.com>
	<9BC888F9-1DB1-40CC-93DA-C27E30019C04@gmx.net>
	<320fb6e00803170908x76f0b9a3he57f4653d2fd433@mail.gmail.com>
Message-ID: <418EB160-7848-4F1A-A88B-99B00003F8A2@gmx.net>


On Mar 17, 2008, at 12:08 PM, Peter wrote:

>> [...]
>>  It's pretty unreliable actually. There is not only synonymy but also
>>  rampant homonymy in taxonomic names. There are plenty of examples  
>> for
>>  the same scientific name in use for a plant and for some animal, for
>>  example. So in order to be unambiguous you will need to know (and
>>  check) the kingdom.
>
> I don't think the current Biopython code for recording the lineages  
> checks the
> kingdom... could someone point me at the relevant bit of BioPerl  
> and I'll see
> if I can understand exactly what they do?

Bioperl-db locates by NCBI taxon id first and then by scientific  
name. It does not take kingdom into account.

You can find the persisted columns, unique key queries etc in Bio/DB/ 
BioSQL and then the respective adapter, in this case  
SpeciesAdapter.pm. The unique key queries are defined in  
get_unique_key_query().

>
> Hilmar Lapp wrote:
>>  If I remember correctly, the script makes (and hence expects) the
>>  primary key and the NCBI taxonomy ID to be identical.
>>  ...
>>  Doing that isn't a big deal but I guess this could also be fixed in
>>  load_ncbi_taxonomy.pl so that it doesn't need to rely on this
>>  assumption. Would someone mind filing the bug report? (We have a
>>  BioSQL category now on bugzilla.open-bio.org.)
>
> I've filed Bug 2470 on this, http://bugzilla.open-bio.org/ 
> show_bug.cgi?id=2470

Thanks for the help, great, appreciated!

	-hilmar

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at nescent.org  Sun Mar  9 19:36:26 2008
From: hlapp at nescent.org (Hilmar Lapp)
Date: Sun, 9 Mar 2008 19:36:26 -0400
Subject: [BioSQL-l] bioperl-db bugs
In-Reply-To: <C46D2101-D4FF-47E9-89BA-0D84114CCB7B@uiuc.edu>
References: <C46D2101-D4FF-47E9-89BA-0D84114CCB7B@uiuc.edu>
Message-ID: <E39A75D2-D93B-493E-BE5F-747BAB91EEFD@nescent.org>

Hi Chris,

I added comments to both bug reports. This belongs to BioPerl,  
though, as it has only to do with its language binding.

The tidbit may be worth keeping in mind for a general BioSQL audience  
is that bioentry namespace (foreign key to biodatabase) is part of  
the (compound) bioentry unique keys. The identifier column used to be  
unique by itself (and could still be made such in a local instance,  
there's a comment to this effect in the DDL), but that was changed a  
while ago. (Also, if one uses any of the Bio* language bindings,  
changing a unique key constraint to something that differs from what  
the language binding assumes may be asking for a lot of trouble.  
Bioperl-db will expect the combination of primary_id() and namespace 
() to match if the latter is provided.)

	-hilmar

On Mar 5, 2008, at 6:24 PM, Chris Fields wrote:

> Hilmar,
>
> I think I have two bioperl-db bugs sorted out, but I'm trying to  
> determine whether the solution is a side-effect, a feature, or a  
> bug.  Dmitry has filed two bug reports which are somewhat related:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2280
> http://bugzilla.open-bio.org/show_bug.cgi?id=2281
>
> I have added my comments to it, but maybe you can shed some more  
> light on this.  What he is trying to do is copy a persistent Seq  
> object to a different namespace; load_seqdatabase.pl won't let him  
> do that directly using the same sequence file.  If he changes the  
> namespace() and store()s it using a script, the seq is moved to the  
> new namespace, not updated.
>
> My reasoning is this is a feature (by not changing the primary_key,  
> you don't store a new sequence but update the current one).   
> However, if the primary_key is unset (undef), then it appears you  
> can copy the sequence over (from Dmitry's script, with my addition  
> noted):
>
> ...
> my $ns1 = 'space1';
> my $ns2 = 'space2';
>
> my $seqadp = $db->get_object_adaptor('Bio::SeqI');
> my $aux_seq = Bio::Seq::RichSeq->new(
>     -accession_number => 'NC_005982',
>     -version => 1,
>     -namespace => $ns1);
> my $seq = $seqadp->find_by_unique_key($aux_seq);
>
> # store the found sequence in the second biodatabase:
> my $pseq = $seqadp->create_persistent($ns2);
> $pseq->namespace('bioperl2');
> $pseq->primary_key(undef);  # my addition, which appears to work
> $pseq->store();
> $seqadp->commit;
> ...
>
> My question: is this an intended effect?  The ability to assign  
> undef to primary_key seems intentional based on the method code,  
> but I'm a bit uncertain here.
>
> chris
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:- Durham, NC -:- informatics.nescent.org :
===========================================================


From darin.london at duke.edu  Tue Mar 18 14:16:59 2008
From: darin.london at duke.edu (darin.london at duke.edu)
Date: Tue, 18 Mar 2008 13:16:59 -0500
Subject: [BioSQL-l] BOSC 2008 Announcement and Call For Submissions
Message-ID: <200803181816.m2IIGx2k007275@tenero.duhs.duke.edu>


BOSC 2008 Call for Abstracts

The 9th annual Bioinformatics Open Source Conference (BOSC 2008) will take place in Toronto, Ontario, Canada, as one of several Special Interest Group (SIG) meetings occurring in conjunction with the 16th annual Intelligent Systems for Molecular Biology Conference (ISMB 2008).

The Bioinformatics Open Source Conference (BOSC) is sponsored by the Open Bioinformatics Foundation (O|B|F), a non-profit group dedicated to promoting the practice and philosophy of Open Source software development within the biological research community. Many Open Source bioinformatics packages are widely used by the research community across many application areas and form a cornerstone in enabling research in the genomic and post-genomic era. Open source bioinformatics software has facilitated rapid innovation and dissemination of new computational methods as well as informatics infrastructure. Since the work of the Open Source Bioinformatics Community represents some of the most cutting edge of Bioinformatics in general, the overall theme for the conference this year is "Tackling Hard Problems with Emerging Technologies". Topics under this umbrella include cyberinfrastructure, grid computing and workflow management and discovery, and visualization. We will also have a series of update talks about the main Open Source Bioinformatics Software suites.

One of the hallmarks of BOSC is the coming together of the open source developer community in one location. A face-to-face meeting of this community creates synergy where participants can work together to create use cases, prototype working code, or run bootcamps for developers from other projects as short, informal, and hands-on tutorials in new software packages and emerging technologies. In short, BOSC is not just a conference for presentations of completed work, but is a dynamic meeting where collaborative work gets done.

This year, BOSC is accepting abstract submissions on the conference theme "Tackling Hard Problems with Emerging Technologies". The conference theme reflects that there are new technologies emerging on both the scientific front (new sequencing technologies, etc.) and the IT front (workflows, mashup/web 2.0, improvements in all of the major programming languages, etc.), which may allow the open source community to solve problems that were previously intractable. Abstracts may be submitted for the following topics.

1. Cyberinfrastructure - We are interested in presentations on topics dealing with the development of infrastructure on the web to facilitate software and data re-use (mashups, or traditional), interoperability and inter-process communication, system/service discovery, and data movement and modeling in distributed systems. This may include peer-to-peer systems of data transfer, Web Services, various flavors of data representation (SOAP, JSON, XML, others), and technologies commonly referred to under the Web 2.0 paradigm (e.g. folksonomies/tagging, user-based content generation, content feeds, and Social Networking).

2. Grid Computing and Workflow Management and Discovery - We particularly invite talks that report progress in making workflow systems easier to use and on how to do distributed-collaborative research , e.g. workflows that encompass the coordination of systems running in different parts of the world.

3. Visualization - Visualization is a maturing area of open source software development. We particularly invite talks that demonstrate innovative visualization systems in the context of workflows.

4. Open Source Software - Speakers will present talks on the use, development, or philosophy of open source software in bioinformatics.

5. Bio* Open Source Project Updates - We invite abstracts from the representatives of the open source projects sponsored by or affiliated to the O|B|F (see Projects).


Please consult the official BOSC 2008 website at http://www.open-bio.org/wiki/Upcoming_BOSC_conference  for all updates and extra information.

Submission Process:
All abstracts must be submitted through our Open Conference Systems site (http://events.open-bio.org/BOSC2008/openconf.php).
The form will ask for a small Abstract Text to be pasted into it, and a full paper.  The small Abstract text should be a summary, while the longer abstract (should provide more details, including the open-source license requirement details)
Full-length abstracts are limited to one page with one inch (2.5 cm) margins on the top, sides, and bottom.  The full-length abstract should include the title, authors, and affiliations.  We prefer your abstract to be in PDF format, although plain t

Important Dates:
May 11: Abstract submission deadline.
June 2: Notification of accepted talks.
June 4: Early registration discount cut-off.
July 18-19: BOSC 2008!

We hope to see you at BOSC 2008!

Kam Dahlquist and Darin London
BOSC 2008 Co-organizers

			 
From er at xs4all.nl  Thu Mar 20 15:24:12 2008
From: er at xs4all.nl (Erik)
Date: Thu, 20 Mar 2008 20:24:12 +0100 (CET)
Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer
Message-ID: <5095.156.83.1.251.1206041052.squirrel@webmail.xs4all.nl>

Hi,

(latest BioSQL, bioperl-db, and bioperl-live installed.)

Postgres 8.3 will not auto-cast text (='character
varying') to integer any longer, which causes test
t/16odba.t to fail:


------------- EXCEPTION: Bio::Root::Exception -------------
MSG: error while executing query in
Bio::DB::BioSQL::SeqAdaptor::find_by_query: ERROR: 
operator does not exist: character varying = integer
LINE 1: ...eq.taxon_id FROM bioentry seq WHERE
seq.identifier = 5456929

It seems likely to cause many similar statements to fail;
how should this be solved?

I tried to fix it but I couldn't find the place where the
statement/clauses are put together.


Thanks,

Erik Rijkers


From hlapp at gmx.net  Thu Mar 20 18:49:41 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 20 Mar 2008 18:49:41 -0400
Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer
In-Reply-To: <5095.156.83.1.251.1206041052.squirrel@webmail.xs4all.nl>
References: <5095.156.83.1.251.1206041052.squirrel@webmail.xs4all.nl>
Message-ID: <0F80B40B-0232-4367-8433-992588B6E71B@gmx.net>

Hi Erik, thanks for the report. Given the error message, it looks  
more like the integer (which in reality is a string) can't be  
automatically converted to a string.

That would be equally interesting, though. DBI I thought used to bind  
all parameters as string by default, but maybe that has changed?

The parameter values are indeed all bound generically (and the query  
is created dynamically too), and I'm leaving it up to the DBD drivers  
to do the "Right Thing". I could obviously force everything into type  
string, but that is likely to have it's own repercussions on various  
RDBMSs.

So could you file this as a bug report on bugzilla.open-bio.org  
(category bioperl-db, this is actually not a BioSQL problem), and run  
the following test on your 8.3 instance (which minor version actually?):

CREATE TABLE t1 (a varchar(10), b text, c integer);

SELECT * from t1 WHERE a = 1;
SELECT * from t1 WHERE b = 1;
SELECT * from t1 WHERE c = '1';

INSERT INTO t1 (a,b,c) VALUES ('a','b',1);

SELECT * from t1 WHERE a = 1;
SELECT * from t1 WHERE b = 1;
SELECT * from t1 WHERE c = '1';

SELECT * from t1 WHERE a = 1::text;
SELECT * from t1 WHERE b = 1::text;
SELECT * from t1 WHERE c = integer '1';

DROP TABLE t1;

These work all fine on my 8.1.4 instance.

	-hilmar

On Mar 20, 2008, at 3:24 PM, Erik wrote:
> Hi,
>
> (latest BioSQL, bioperl-db, and bioperl-live installed.)
>
> Postgres 8.3 will not auto-cast text (='character
> varying') to integer any longer, which causes test
> t/16odba.t to fail:
>
>
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: error while executing query in
> Bio::DB::BioSQL::SeqAdaptor::find_by_query: ERROR:
> operator does not exist: character varying = integer
> LINE 1: ...eq.taxon_id FROM bioentry seq WHERE
> seq.identifier = 5456929
>
> It seems likely to cause many similar statements to fail;
> how should this be solved?
>
> I tried to fix it but I couldn't find the place where the
> statement/clauses are put together.
>
>
> Thanks,
>
> Erik Rijkers
>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From er at xs4all.nl  Thu Mar 20 19:30:03 2008
From: er at xs4all.nl (Erik)
Date: Fri, 21 Mar 2008 00:30:03 +0100 (CET)
Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer
Message-ID: <15786.156.83.1.157.1206055803.squirrel@webmail.xs4all.nl>

On Thu, March 20, 2008 23:49, Hilmar Lapp wrote:
> Hi Erik, thanks for the report. Given the error message,
> it looks
> more like the integer (which in reality is a string) can't
> be automatically converted to a string.

you are right, of course :)


Here is the postgres 8.3.1 result of your sql statements:

CREATE TABLE t1 (a varchar(10), b text, c integer);

SELECT * from t1 WHERE a = 1;   -- fails in 8.3.1
SELECT * from t1 WHERE b = 1;	  -- fails in 8.3.1
SELECT * from t1 WHERE c = '1'; -- ok

INSERT INTO t1 (a,b,c) VALUES ('a','b',1);

SELECT * from t1 WHERE a = 1;	  -- fails in 8.3.1
SELECT * from t1 WHERE b = 1;	  -- fails in 8.3.1
SELECT * from t1 WHERE c = '1'; -- ok

SELECT * from t1 WHERE a = 1::text;     -- ok
SELECT * from t1 WHERE b = 1::text;     -- ok
SELECT * from t1 WHERE c = integer '1'; -- ok

The failure is always (virtually) the same:
ERROR:  operator does not exist: character varying = integer
LINE 1: SELECT * from t1 WHERE a = 1;
                                 ^
HINT:  No operator matches the given name and argument
type(s). You might need to add explicit type casts.


Then there is the cast function: for instance, I can let
the test in t/16odba.t proceed faultlessly with

 $seq = $biodb->get_Seq_by_id( "cast(5456929 as text)" );


I am also doubtful/curious as to how this would affect the
various loading scripts which I was going to use - I want
to set up a GBrowse with human/mouse/flybase sequence
annotation to show ChipSeq data against.

But one thing at a time, I guess...


> So could you file this as a bug report on
> bugzilla.open-bio.org
> (category bioperl-db, this is actually not a BioSQL
> problem),

I'll make an entry in bugzilla/bioperl-db.


Thanks for you quick reply!


Erik Rijkers


From hlapp at gmx.net  Thu Mar 20 20:34:42 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 20 Mar 2008 20:34:42 -0400
Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer
In-Reply-To: <15786.156.83.1.157.1206055803.squirrel@webmail.xs4all.nl>
References: <15786.156.83.1.157.1206055803.squirrel@webmail.xs4all.nl>
Message-ID: <987C9C0E-840B-44AD-B3E9-0FC2809FF4F4@gmx.net>


On Mar 20, 2008, at 7:30 PM, Erik wrote:
> Here is the postgres 8.3.1 result of your sql statements:
>
> CREATE TABLE t1 (a varchar(10), b text, c integer);
>
> SELECT * from t1 WHERE a = 1;   -- fails in 8.3.1
> SELECT * from t1 WHERE b = 1;	  -- fails in 8.3.1
> SELECT * from t1 WHERE c = '1'; -- ok
>
> [...]
> The failure is always (virtually) the same:
> ERROR:  operator does not exist: character varying = integer
> LINE 1: SELECT * from t1 WHERE a = 1;
>                                  ^
> HINT:  No operator matches the given name and argument
> type(s). You might need to add explicit type casts.


So it's indeed the backend that changed behavior. It's actually  
documented as I see now:

http://www.postgresql.org/docs/8.3/static/release-8-3.html

scroll to section E.2.2. Migration to Version 8.3, E.2.2.1. General,  
and the first item there:

<quote>
Non-character data types are no longer automatically cast to TEXT  
(Peter, Tom)

Previously, if a non-character value was supplied to an operator or  
function that requires text input, it was automatically cast to text,  
for most (though not all) built-in data types. This no longer  
happens: an explicit cast to text is now required for all non- 
character-string types.
</quote>

I can see the arguments there but this will prevent upgrading to 8.3  
for many many applications, and the comments from the Pg developers  
('fix your SQL to use casts') that I've seen there on the mailing  
lists are just not helpful. Fixing SQL is for many legacy  
applications is just not an option.

In the case of Bioperl-db it's very non-trivial, because all of a  
sudden we would be changing from a hands-off and let-the-driver- 
figure-it-out approach to forcing types everywhere.

So I think at this point with this change I have to declare Bioperl- 
db officially incompatible with PostgreSQL 8.3+ until we've found a  
solution to this, which is too bad because it seems 8.3 has some  
really nice performance features added.

One possible solution might be to create a CAST in the database  
(namely the one that was taken away, restoring behavior to pre-8.3).  
Another possibility is to move the parameter binding method into the  
driver adaptor which would then delegate to the DBI method but would  
be overridden for the PostgreSQL adapter to force all bindings to  
type string.

Which leads me back to the surprise observation that the parameter  
was bound as an integer in the first place, when DBD::Pg used to bind  
everything as string unless you told it otherwise. Which DBD::Pg  
version is it that you are using? I would suspect (or hope) that  
maybe there is soon an update release of DBD::Pg that fixes this  
problem by going back to binding everything as string by default (and  
as the tests show PostgreSQL will still convert strings to integer if  
necessary).

Depending on what I (or can someone else update us on this?) find out  
for the DBD::Pg plans, I'll probably start looking into moving the  
parameter binding into the driver adapters. Though it does feel  
pathetic that this is now also not transparent between drivers.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From er at xs4all.nl  Thu Mar 20 20:51:43 2008
From: er at xs4all.nl (Erik)
Date: Fri, 21 Mar 2008 01:51:43 +0100 (CET)
Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer
Message-ID: <4483.156.83.1.157.1206060703.squirrel@webmail.xs4all.nl>

On Fri, March 21, 2008 01:34, Hilmar Lapp wrote:
>
> So I think at this point with this change I have to
> declare Bioperl-
> db officially incompatible with PostgreSQL 8.3+ until
> we've found a
> solution to this, which is too bad because it seems 8.3
> has some
> really nice performance features added.

Pg 8.3 is indeed very noticably faster, and it has other
excellent new features like full text indexing. (This also
makes that downgrading is not really an option)


> Which DBD::Pg version is it that you are using?

DBD::Pg 2.3.0


Thanks,

Erik Rijkers


From hlapp at gmx.net  Thu Mar 20 21:36:50 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 20 Mar 2008 21:36:50 -0400
Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer
In-Reply-To: <4483.156.83.1.157.1206060703.squirrel@webmail.xs4all.nl>
References: <4483.156.83.1.157.1206060703.squirrel@webmail.xs4all.nl>
Message-ID: <071CB899-AB3E-40B8-9477-82AE98DB88B1@gmx.net>


On Mar 20, 2008, at 8:51 PM, Erik wrote:
> On Fri, March 21, 2008 01:34, Hilmar Lapp wrote:
>>
>> So I think at this point with this change I have to declare  
>> Bioperl-db officially incompatible with PostgreSQL 8.3+ until  
>> we've found a solution to this, which is too bad because it seems  
>> 8.3 has some really nice performance features added.
>
> Pg 8.3 is indeed very noticably faster, and it has other
> excellent new features like full text indexing. (This also
> makes that downgrading is not really an option)

Right, I saw that too. It is, however, just migrated from what was a  
contrib module before, so downgrading and using the contrib module is  
an option.

Furthermore, folding these new features together with a behavior  
change that is backwards incompatible was a choice the PostgreSQL  
people made, not we.

We also aren't doing poor typing that deserves fixing; we're just not  
doing any typing by treating everything as a string. This is the Perl  
paradigm.

At this point it's actually unclear to me how this new behavior is  
compatible with untyped scripting languages unless you know the type  
of each column that you're binding a value for, because if you  
actually force typecasts to string for everything you get an error if  
an integer is indeed what's needed.

I'm wondering what I'm missing.

	-hilmar

BTW what does the following query yield on your 8.3.1 database:

select s.typname as source, t.typname as target, f.proname as  
function, c.castcontextfrom pg_cast c, pg_type s, pg_type t, pg_proc  
f where c.castsource = s.oid and c.casttarget = t.oid and c.castfunc  
= f.oidand t.typname = 'text';

On my 8.1.4 database I get:

   source    | target | function | castcontext
-------------+--------+----------+-------------
  bpchar      | text   | text     | i
  char        | text   | text     | i
  name        | text   | text     | i
  int8        | text   | text     | i
  int2        | text   | text     | i
  int4        | text   | text     | i
  oid         | text   | text     | i
  float4      | text   | text     | i
  float8      | text   | text     | i
  macaddr     | text   | text     | e
  cidr        | text   | text     | e
  inet        | text   | text     | e
  date        | text   | text     | i
  time        | text   | text     | i
  timestamp   | text   | text     | i
  timestamptz | text   | text     | i
  interval    | text   | text     | i
  timetz      | text   | text     | i
  numeric     | text   | text     | i
(19 rows)

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From greg at turnstep.com  Thu Mar 20 22:41:10 2008
From: greg at turnstep.com (Greg Sabino Mullane)
Date: Fri, 21 Mar 2008 02:41:10 -0000
Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer
In-Reply-To: <987C9C0E-840B-44AD-B3E9-0FC2809FF4F4@gmx.net>
Message-ID: <19ecb7a297f64722c4f63f10ed2ebdce@biglumber.com>


-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160


> Which leads me back to the surprise observation that the parameter
> was bound as an integer in the first place, when DBD::Pg used to bind
> everything as string unless you told it otherwise. Which DBD::Pg
> version is it that you are using? I would suspect (or hope) that
> maybe there is soon an update release of DBD::Pg that fixes this
> problem by going back to binding everything as string by default (and
> as the tests show PostgreSQL will still convert strings to integer if
> necessary).
>
> Depending on what I (or can someone else update us on this?) find out
> for the DBD::Pg plans, I'll probably start looking into moving the
> parameter binding into the driver adapters. Though it does feel
> pathetic that this is now also not transparent between drivers.

What you are probably looking for is already there, namely:

$dbh->{pg_server_prepare} = 0;

There's good reasons for the casting enforcement in 8.3, although I've
been a sharp critic of the change, and certainly of the suddeness
of it. Another solution to consider is adding the casts back in:

http://people.planetpostgresql.org/peter/index.php?/archives/2008/03.html
(the March 4th entry)

- --
Greg Sabino Mullane greg at turnstep.com
PGP Key: 0x14964AC8 200803202237
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAkfjIBYACgkQvJuQZxSWSsiamwCdEbNrC4F4oU7AGHrbHAm1YNXG
HbUAoIRJtGW4brvMKklxZYG6pusbcTqf
=Zawx
-----END PGP SIGNATURE-----


From hlapp at gmx.net  Fri Mar 21 08:52:39 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Fri, 21 Mar 2008 08:52:39 -0400
Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer
In-Reply-To: <19ecb7a297f64722c4f63f10ed2ebdce@biglumber.com>
References: <19ecb7a297f64722c4f63f10ed2ebdce@biglumber.com>
Message-ID: <C24DE5CA-F433-48A1-BF08-A6D056A2EBCE@gmx.net>

Hi Greg - thanks for your email, it's very helpful.

On Mar 20, 2008, at 10:41 PM, Greg Sabino Mullane wrote:
>>
>> Depending on what I (or can someone else update us on this?) find out
>> for the DBD::Pg plans, I'll probably start looking into moving the
>> parameter binding into the driver adapters. Though it does feel
>> pathetic that this is now also not transparent between drivers.
>
> What you are probably looking for is already there, namely:
>
> $dbh->{pg_server_prepare} = 0;

So disabling server-side prepares will leave values quoted? Having  
server-side prepares would be very useful though, especially for  
Bioperl-db with its many lookup queries that all use similar  
parameter values.

>
> There's good reasons for the casting enforcement in 8.3

I do understand that, but it's also a sharp contrast to other RDBMSs  
that doesn't it make it easier for people to choose Pg when they  
should, and doesn't help writing cross-platform database applications  
either.

> although I've been a sharp critic of the change, and certainly of  
> the suddeness
> of it. Another solution to consider is adding the casts back in:
>
> http://people.planetpostgresql.org/peter/index.php?/archives/ 
> 2008/03.html
> (the March 4th entry)


Thanks for this, that helps a lot.

Do you have links to some of the key threads showing what rationale  
went into the decision? (Or should I just search for your name?) I'd  
like to read up on that first before pouring more oil into the fire.  
I suspect that many of those who made the decision are never faced  
with needing to write cross-RDBMS code.

Also, I wonder why this wasn't made a configurable option so it can  
be disabled by a simple config file change (such as the move away  
from automatic OID columns). But obviously this is the wrong list for  
discussing this (though Bioperl-db *is* one of those pieces of  
software that must be cross-RDBMS).

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From er at xs4all.nl  Fri Mar 21 17:43:47 2008
From: er at xs4all.nl (Erik)
Date: Fri, 21 Mar 2008 22:43:47 +0100 (CET)
Subject: [BioSQL-l] [Bioperl-l] postgres 8.3 - load_seqdatabase.pl /
 swissprot
Message-ID: <16589.156.83.1.157.1206135827.squirrel@webmail.xs4all.nl>

Hi,

PostgreSQL 8.3.1
DBD::Pg 2.3.0
perl 5.8.8

(The following error may have to do with the 8.3 problems
that I reported yesterday (bug 2472) - I don't know)

 I ran biosql-schema/scripts/load_ncbi_taxonomy.pl without
problem.

Then I ran scripts/biosql/load_seqdatabase.pl as:

perl scripts/biosql/load_seqdatabase.pl \
  -driver Pg \
  -dbuser xxxxxxx \
  -dbname bioseqdb \
  -namespace swissprot \
  -format swiss \
   /DATA/ms/ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat

It took two hours to load 26504 records (7%) of
uniprot_sprot.dat (is it expected to be so slow?), then
failed with:

Could not store Q2UXW0:
------------- EXCEPTION: Bio::Root::Exception -------------
MSG: create: object (Bio::Species) failed to insert or to
be found by unique key
STACK: Error::throw
STACK: Bio::Root::Root::throw
/home/aardvark/bin/perl/lib/site_perl/5.8.8/Bio/Root/Root.pm:357
STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create
/home/aardvark/bin/perl/lib/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206
STACK: Bio::DB::Persistent::PersistentObject::create
/home/aardvark/bin/perl/lib/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:244
STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create
/home/aardvark/bin/perl/lib/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:169
STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::store
/home/aardvark/bin/perl/lib/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
STACK: Bio::DB::Persistent::PersistentObject::store
/home/aardvark/bin/perl/lib/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:271
STACK: scripts/biosql/load_seqdatabase.pl:630
-----------------------------------------------------------


I don't know if this is directly related to the 8.3
casting problems I reported yesterday (bug 2472), or a
separate Bio::Species issue


regards,

Erik Rijkers


From hlapp at gmx.net  Sat Mar 22 14:18:45 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 22 Mar 2008 14:18:45 -0400
Subject: [BioSQL-l] Call for Student Applications - NESCent participates in
	the Google Summer of Code
In-Reply-To: <0025B440-EF1E-4632-9DB4-B98489BF3550@duke.edu>
Message-ID: <5AC4F213-8D88-41C6-B380-59B2EF7831F0@gmx.net>

Hi all - just wanted to draw your attention to our Google Summer of  
Code participation this year. One of the projects deals directly with  
BioPerl, another one builds on BioSQL (and could be implemented  
taking advantage of BioPerl or Bio::Phylo, or Biojava).

Cheers,

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================

Phyloinformatics Summer of Code 2008
http://phyloinformatics.net/Phyloinformatics_Summer_of_Code_2008

*** Please disseminate this announcement widely to appropriate students
at your institution ***

The National Evolutionary Synthesis Center (NESCent: http:// 
www.nescent.org/) is participating in 2008 for the second year as a  
mentoring organization in the Google Summer of Code (http:// 
code.google.com/soc). Through this program, Google provides  
undergraduate, masters, and PhD students with a unique opportunity to  
obtain hands-on experience writing and extending open-source software  
under the mentorship of experienced developers from around the world.

Our goal in participating is to train future researchers and  
developers to not only have awareness and understanding of the value  
of open-source and collaboratively developed software, but also to  
gain the programming and remote collaboration skills needed to  
successfully contribute to such projects. Students will receive a  
stipend from Google, and may work from their home, or home  
institution, for the duration of the 3 month program. Students will  
each have one or more dedicated mentors with expertise in  
phylogenetic methods and open-source software development.

NESCent is particularly targeting students interested in both  
evolutionary biology and software development. Project ideas (see URL  
below) range from visualizing phylogenetic data in R, to development  
of a Mesquite module, web-services for phylogenetic data providers or  
geophylogeny mashups, implementing phyloXML support, navigating  
databases of networks, topology queries for PhyloCode registries, to  
phylogenetic tree mining in a MapReduce framework, and more.

The project ideas are flexible and many can be adjusted in scope to  
match the skills of the student. If the program sounds interesting to  
you but you are unsure whether you have the necessary skills, please  
email the mentors at the address below.  We will work with you to  
find a project that fits your interests and skills.

INQUIRIES:
Email any questions, including self-proposed project ideas, to  
phylosoc {at}
nescent {dot} org.

TO APPLY:
Apply on-line at the Google Summer of Code website
(http://code.google.com/soc/2008), where you will also find GSoC program
rules and eligibility requirements.  The 1-week application period for
students opens on Monday March 24th and runs through Monday, March  
31st, 2008.

Hilmar Lapp and Todd Vision
US National Evolutionary Synthesis Center

=====
URLs:
=====

2008 NESCent Phyloinformatics Summer of Code:
http://phyloinformatics.net/Phyloinformatics_Summer_of_Code_2008

Eligibility requirements:
http://code.google.com/opensource/gsoc/2008/faqs.html#0.1_eligibility

Stipends:
http://code.google.com/opensource/gsoc/2008/faqs.html#0.1_administrivia

To sign up for quarterly NESCent newsletters: with announcements about
upcoming programs at the Center:
http://www.nescent.org/about/contact.php


From hlapp at gmx.net  Sat Mar 22 16:01:51 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 22 Mar 2008 16:01:51 -0400
Subject: [BioSQL-l] [Bioperl-l] postgres 8.3 - load_seqdatabase.pl /
	swissprot
In-Reply-To: <16589.156.83.1.157.1206135827.squirrel@webmail.xs4all.nl>
References: <16589.156.83.1.157.1206135827.squirrel@webmail.xs4all.nl>
Message-ID: <69D3EA33-810B-40EA-8687-752FA1A34FBF@gmx.net>

Forgot to respond to this:

On Mar 21, 2008, at 5:43 PM, Erik wrote:
> It took two hours to load 26504 records (7%) of uniprot_sprot.dat  
> (is it expected to be so slow?)


The last time I used to load those regularly it was a bit faster (~ 5  
seqs/s) but it is in a ballpark that wouldn't raise a red flag for me.

BTW you can make it print statistics using the --logchunk N option,  
where N is the number of seqs after which you want the current count  
and the #recs/s printed.

You may get it to be faster if you tune the database (e.g., make sure  
there is enough memory for index reorganization, transaction log and  
tablespace datafile are on separate disks, etc; fiddling with the  
query optimizer has probably little effect as almost all queries are  
simple lookups or inserts).

That all said, the strength of load_seqdatabase.pl isn't speed. It  
doesn't make use of any bulk upload optimizations, and therefore the  
initial load of a very large database will take its time. The power  
is more in subsequent updates where you can configure what you want  
to happen, and during which the database is never in an inconsistent  
state, so it can run in the background.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From greg at turnstep.com  Sun Mar 23 20:42:36 2008
From: greg at turnstep.com (Greg Sabino Mullane)
Date: Mon, 24 Mar 2008 00:42:36 -0000
Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer
In-Reply-To: <C24DE5CA-F433-48A1-BF08-A6D056A2EBCE@gmx.net>
Message-ID: <4ab14dcc59d7566b55ba87027055e9fd@biglumber.com>


-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160


>> Depending on what I (or can someone else update us on this?) find out
>> for the DBD::Pg plans, I'll probably start looking into moving the
>> parameter binding into the driver adapters. Though it does feel
>> pathetic that this is now also not transparent between drivers.
>
> What you are probably looking for is already there, namely:
>
> $dbh->{pg_server_prepare} = 0;

> So disabling server-side prepares will leave values quoted? Having
> server-side prepares would be very useful though, especially for
> Bioperl-db with its many lookup queries that all use similar
> parameter values.

Yes, it forces DBD::Pg to do the quoting itself, which basically means
that everything is shipped to the server as a single SQL string, and
no placeholders are used. In the grand scheme of things, the speed
difference is not large for most queries. Certainly one way would be
to turn this on for 8.3 and above, and slowly migrate the queries/schema
over time.

>> There's good reasons for the casting enforcement in 8.3

> I do understand that, but it's also a sharp contrast to other RDBMSs
> that doesn't it make it easier for people to choose Pg when they
> should, and doesn't help writing cross-platform database applications
> either.

I'm not overly familiar with how other databases treat this, but I've
heard DB2 can be a stickler about this too. I've not dug into the bioperl
code in a while, to be honest, so I'm not sure what sort of queries we're
talking about. Certainly long-term the code and schema should move away
from implicit casting. Maybe a better short-term solution is addind
the more obvious casts (e.g. text<->int) back in.

> Do you have links to some of the key threads showing what rationale
> went into the decision? (Or should I just search for your name?) I'd
> like to read up on that first before pouring more oil into the fire.
> I suspect that many of those who made the decision are never faced
> with needing to write cross-RDBMS code.
>
> Also, I wonder why this wasn't made a configurable option so it can
> be disabled by a simple config file change (such as the move away
> from automatic OID columns). But obviously this is the wrong list for
. discussing this (though Bioperl-db *is* one of those pieces of
> software that must be cross-RDBMS).

I did ask about that, and was told it would not have been easy to do so.
But I agree, a phasing in period (heck, even a warning) would have been
nice. Feel free to pour some oil on the fire, I think this is one of
many apps that has been affected. (I've run across two other major
cross-DB apps (Interchange and MediaWiki) that are struggling with the
same pain. I managed to painfully fix the latter, but the former is way
too complex to tackle at the moment).

I could not find the thread(s?) I weighed in on, but you can find some
relevant discussions by googling "strict-typing benefits grokbase"

- --
Greg Sabino Mullane greg at turnstep.com
PGP Key: 0x14964AC8 200803232039
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAkfm+NAACgkQvJuQZxSWSsi4ogCdGNWvCJIzXxb+YKzdm6wwxQMv
p3AAnizkWXoo/rvxv4KVdC8tD0vF87k3
=dNYi
-----END PGP SIGNATURE-----


From biopython at maubp.freeserve.co.uk  Tue Mar 25 11:56:16 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 25 Mar 2008 15:56:16 +0000
Subject: [BioSQL-l] [BioPython] Concerns the update of BioSQL.taxon table
In-Reply-To: <320fb6e00803250853i629e59aj310ddc5667ea57d@mail.gmail.com>
References: <711039.40736.qm@web26505.mail.ukl.yahoo.com>
	<320fb6e00803250853i629e59aj310ddc5667ea57d@mail.gmail.com>
Message-ID: <320fb6e00803250856n1001d74dxeb8560652f594e51@mail.gmail.com>

On Tue, Mar 25, 2008 at 3:53 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi Eric,
>
>  Your issue is almost certainly due to switching from Biopython 1.44 to
>  1.45, rather than from a prerelease BioSQL to the recently released
>  BioSQL 1.0.0.
>
>  For background, you should read Bug 2422 and the BioSQL thread it points to.
>  http://bugzilla.open-bio.org/show_bug.cgi?id=2422
>
>  Biopython 1.44 never recorded the taxon id (and therefore didn't use
>  the taxon/taxon_name tables)
>  Biopython 1.45 does record the taxon id, and attempts to fill in
>  missing taxon/taxon_name entries
>
>  I'm a little unclear on what is going wrong for you.  Did you pre-load
>  the NCBI taxonomy for example?  The script you are talking about, is
>  this your own?
>
>  Peter
>

P.S. Did you mean to send your original message to the BioSQL list as well Eric?

You need biosql-l at lists.open-bio.org not biosql at lists.open-bio.org

Peter

From ericgibert at yahoo.fr  Wed Mar 26 07:29:24 2008
From: ericgibert at yahoo.fr (Eric Gibert)
Date: Wed, 26 Mar 2008 11:29:24 +0000 (GMT)
Subject: [BioSQL-l] Concerns the update of BioSQL.taxon table
Message-ID: <290936.61510.qm@web26510.mail.ukl.yahoo.com>

Thank you Peter for the correct email of the BioSQL list.

No, it is not something linked to BioPython 1.45 upgrade: same behavior as 1.44. My problem is linked to the fact  that the BioSQl schema version 1.0.0 defines a *unique* index on taxon.ncbi_taxon_id. I did not have this index before.

I have written a script that connects to the taxonomy database of NCBI and get the XML data for the species. Then it updates the taxon table, replacing the ncbi_taxon_id and node_rank NULL by their values for all the lineage. I call it after the loading of BioSeqs in the database.

Example:
I load a BioSeq for Nannophya pygmaea then I run my script to update the  ncbi_taxon_id and rank:
+----------+---------------+-----------------+--------------+
| taxon_id | ncbi_taxon_id | parent_taxon_id | node_rank    |
+----------+---------------+-----------------+--------------+
|       13 |          2759 |            NULL | superkingdom |
|       14 |         33208 |              13 | kingdom      |
|       15 |          6656 |              14 | phylum       |
|       16 |          6960 |              15 | superclass   |
|       17 |         50557 |              16 | class        |
|       18 |          7496 |              17 | no rank      |
|       19 |         33339 |              18 | subclass     |
|       20 |          6961 |              19 | order        |
|       21 |          6962 |              20 | suborder     |
|       22 |          6964 |              21 | family       |
|       23 |        229390 |              22 | genus        |
|       24 |        229391 |              23 | species      |

No problem.

Now I insert/load another Libellulideae (Orthetrum sabina ): 'empty/NULL' taxons records are inserted by the db.load() BioPython function:
|       25 |          NULL |            NULL | NULL         |
|       26 |          NULL |              25 | NULL         |
|       27 |          NULL |              26 | NULL         |
|       28 |          NULL |              27 | NULL         |
|       29 |          NULL |              28 | NULL         |
|       30 |          NULL |              29 | NULL         |
|       31 |          NULL |              30 | NULL         |
|       32 |          NULL |              31 | NULL         |
|       33 |          NULL |              32 | NULL         |
|       34 |          NULL |              33 | NULL         |
|       35 |          NULL |              34 | genus        |
|       36 |        320892 |              35 | species      |

then I try to run my script: this time I have an update failure because the record 34 is the SAME family hence same ncbi_taxon_id as record 22: 'duplicate entry on key 2'.

Either this *unique* index is new and it is a BioSQL "issue" (as said, this index did not exist in my previous BioSQL db so I never encountered this issue before), OR the way BioPython "repeats" existing taxons is incorrect/not compatible. In that case, when inserting the second BioSeq, record 34 should not be created but record 35 (the genus) should "point" to the already existing family at record 22 as its father.

Thus I would have the confirmation on by BioSQL team that the unique index is valid. If that is the case, then we can have a BioPython separate talk about how to improve the management of the taxon table.


Best regards,

Eric


      _____________________________________________________________________________ 
Envoyez avec Yahoo! Mail. Capacit? de stockage illimit?e pour vos emails. http://mail.yahoo.fr

From holland at ebi.ac.uk  Wed Mar 26 08:00:03 2008
From: holland at ebi.ac.uk (Richard Holland)
Date: Wed, 26 Mar 2008 12:00:03 +0000
Subject: [BioSQL-l] Concerns the update of BioSQL.taxon table
In-Reply-To: <290936.61510.qm@web26510.mail.ukl.yahoo.com>
References: <290936.61510.qm@web26510.mail.ukl.yahoo.com>
Message-ID: <47EA3AC3.20104@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Purely from a database perspective, the index is correct. There should
be no need to have a duplicate entry in ncbi_taxon_id. The implication
is that taxon_id is a 1:1 mapping to ncbi_taxon_id. There should be no
need to have two separate local taxon_id values referring to one NCBI taxon.

Ideally, when you run your update script, for each taxon_id record it
processes it should be checking for an existing entry with the same
ncbi_taxon_id, getting the taxon_id for that existing entry, then
removing the duplicate entry and updating the relevant parent_taxon_id
values in other records to refer to the existing taxon_id instead.

BioPython would need to be making similar checks when it inserts new
entries. If it isn't, then it needs to be fixed.

cheers,
Richard

Eric Gibert wrote:
> Thank you Peter for the correct email of the BioSQL list.
> 
> No, it is not something linked to BioPython 1.45 upgrade: same behavior as 1.44. My problem is linked to the fact  that the BioSQl schema version 1.0.0 defines a *unique* index on taxon.ncbi_taxon_id. I did not have this index before.
> 
> I have written a script that connects to the taxonomy database of NCBI and get the XML data for the species. Then it updates the taxon table, replacing the ncbi_taxon_id and node_rank NULL by their values for all the lineage. I call it after the loading of BioSeqs in the database.
> 
> Example:
> I load a BioSeq for Nannophya pygmaea then I run my script to update the  ncbi_taxon_id and rank:
> +----------+---------------+-----------------+--------------+
> | taxon_id | ncbi_taxon_id | parent_taxon_id | node_rank    |
> +----------+---------------+-----------------+--------------+
> |       13 |          2759 |            NULL | superkingdom |
> |       14 |         33208 |              13 | kingdom      |
> |       15 |          6656 |              14 | phylum       |
> |       16 |          6960 |              15 | superclass   |
> |       17 |         50557 |              16 | class        |
> |       18 |          7496 |              17 | no rank      |
> |       19 |         33339 |              18 | subclass     |
> |       20 |          6961 |              19 | order        |
> |       21 |          6962 |              20 | suborder     |
> |       22 |          6964 |              21 | family       |
> |       23 |        229390 |              22 | genus        |
> |       24 |        229391 |              23 | species      |
> 
> No problem.
> 
> Now I insert/load another Libellulideae (Orthetrum sabina ): 'empty/NULL' taxons records are inserted by the db.load() BioPython function:
> |       25 |          NULL |            NULL | NULL         |
> |       26 |          NULL |              25 | NULL         |
> |       27 |          NULL |              26 | NULL         |
> |       28 |          NULL |              27 | NULL         |
> |       29 |          NULL |              28 | NULL         |
> |       30 |          NULL |              29 | NULL         |
> |       31 |          NULL |              30 | NULL         |
> |       32 |          NULL |              31 | NULL         |
> |       33 |          NULL |              32 | NULL         |
> |       34 |          NULL |              33 | NULL         |
> |       35 |          NULL |              34 | genus        |
> |       36 |        320892 |              35 | species      |
> 
> then I try to run my script: this time I have an update failure because the record 34 is the SAME family hence same ncbi_taxon_id as record 22: 'duplicate entry on key 2'.
> 
> Either this *unique* index is new and it is a BioSQL "issue" (as said, this index did not exist in my previous BioSQL db so I never encountered this issue before), OR the way BioPython "repeats" existing taxons is incorrect/not compatible. In that case, when inserting the second BioSeq, record 34 should not be created but record 35 (the genus) should "point" to the already existing family at record 22 as its father.
> 
> Thus I would have the confirmation on by BioSQL team that the unique index is valid. If that is the case, then we can have a BioPython separate talk about how to improve the management of the taxon table.
> 
> 
> Best regards,
> 
> Eric
> 
> 
> 
> 
> 
> 
>       _____________________________________________________________________________ 
> Envoyez avec Yahoo! Mail. Capacit? de stockage illimit?e pour vos emails. http://mail.yahoo.fr
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
> 

- --
Richard Holland (BioMart)
EMBL EBI, Wellcome Trust Genome Campus,
Hinxton, Cambridgeshire CB10 1SD, UK
Tel. +44 (0)1223 494416

http://www.biomart.org/
http://www.biojava.org/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH6jrD4C5LeMEKA/QRAu7rAJ9TBYt0CeTTrPi0QN7Vm/UwiBANQwCfeoqz
0uTvcXXteholK+4xxuxjCXw=
=qhOf
-----END PGP SIGNATURE-----

From biopython at maubp.freeserve.co.uk  Wed Mar 26 08:30:50 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 26 Mar 2008 12:30:50 +0000
Subject: [BioSQL-l] [BioPython] Concerns the update of BioSQL.taxon table
In-Reply-To: <290936.61510.qm@web26510.mail.ukl.yahoo.com>
References: <290936.61510.qm@web26510.mail.ukl.yahoo.com>
Message-ID: <320fb6e00803260530w72cca900mc19654798d5d7e13@mail.gmail.com>

On Wed, Mar 26, 2008 at 11:29 AM, Eric Gibert <ericgibert at yahoo.fr> wrote:
> Thank you Peter for the correct email of the BioSQL list.
>
> No, it is not something linked to BioPython 1.45 upgrade: same behavior as 1.44.
> My problem is linked to the fact  that the BioSQl schema version 1.0.0 defines a
> *unique* index on taxon.ncbi_taxon_id. I did not have this index before.
>
>  I have written a script that connects to the taxonomy database of NCBI and get
>  the XML data for the species. Then it updates the taxon table, replacing the
>  ncbi_taxon_id and node_rank NULL by their values for all the lineage. I call it
>  after the loading of BioSeqs in the database.

So you wrote your own version of the BioSQL perl script load_ncbi_taxonomy.pl?

>  Example:
>  I load a BioSeq for Nannophya pygmaea then I run my script to update the  ncbi_taxon_id and rank:
>  +----------+---------------+-----------------+--------------+
>  | taxon_id | ncbi_taxon_id | parent_taxon_id | node_rank    |
>  +----------+---------------+-----------------+--------------+
>  |       13 |          2759 |            NULL | superkingdom |
>  |       14 |         33208 |              13 | kingdom      |
>  |       15 |          6656 |              14 | phylum       |
>  |       16 |          6960 |              15 | superclass   |
>  |       17 |         50557 |              16 | class        |
>  |       18 |          7496 |              17 | no rank      |
>  |       19 |         33339 |              18 | subclass     |
>  |       20 |          6961 |              19 | order        |
>  |       21 |          6962 |              20 | suborder     |
>  |       22 |          6964 |              21 | family       |
>  |       23 |        229390 |              22 | genus        |
>  |       24 |        229391 |              23 | species      |
>
>  No problem.
>
>  Now I insert/load another Libellulideae (Orthetrum sabina ): 'empty/NULL'
>  taxons records are inserted by the db.load() BioPython function:

These records are "guess work" based on the lineage in the GenBank
file - we don't know the NCBI taxon ids, so they are NULL, nor the
rank, but there is a scientific name in the lined taxon_name table.  I
am open to the idea of not writing this guessed lineage, and just
writing one entry for the species and the given NCBI taxon ID.

However, as the new entry Orthetrum sabina should share some of its
lineage with Nannophya pygmaea, then I agree Biopython *should* be
re-using those existing taxon entries, if it can match them safely
using the scientific name.  Re-reading the relevant bit of old code,
it doesn't seem to do this.  I've file bug 2475:
http://bugzilla.open-bio.org/show_bug.cgi?id=2475

This is actually a tricky problem, requiring some a 'clever' parent
linkage as you said in your earlier email.  Hilmar wrote this about
the equivalent code in BioPerl:

>>  It's pretty unreliable actually. There is not only synonymy but also
>>  rampant homonymy in taxonomic names. There are plenty of examples
>>  for the same scientific name in use for a plant and for some animal, for
>>  example. So in order to be unambiguous you will need to know (and
>>  check) the kingdom.

See http://lists.open-bio.org/pipermail/biosql-l/2008-March/001207.html

Eric wrote:
>  then I try to run my script: this time I have an update failure because the
> record 34 is the SAME family hence same ncbi_taxon_id as record 22:
> 'duplicate entry on key 2'.
>
>  Either this *unique* index is new and it is a BioSQL "issue" (as said, this index
> did not exist in my previous BioSQL db so I never encountered this issue before),

Hopefully Hilmar from BioSQL can answer this.

> OR the way BioPython "repeats" existing taxons is incorrect/not compatible.
> In that case, when inserting the second BioSeq, record 34 should not be created
> but record 35 (the genus) should "point" to the already existing family at record
> 22 as its father.

This example might be easier to follow if the scientific names from
the taxon_name were included.  I would check the lineage but the NCBI
wepage is being very slow for me right now.

In the short term, as a quick fix, your script could first remove
taxon entries with a blank NCBI taxon ID (and clear any keys pointing
to them).  Not elegent - but it would work.

Thanks Eric

Peter

From hlapp at gmx.net  Wed Mar 26 09:29:01 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 26 Mar 2008 09:29:01 -0400
Subject: [BioSQL-l] Concerns the update of BioSQL.taxon table
In-Reply-To: <290936.61510.qm@web26510.mail.ukl.yahoo.com>
References: <290936.61510.qm@web26510.mail.ukl.yahoo.com>
Message-ID: <EFDC1E5E-1379-435F-A6F6-79E0C382F18D@gmx.net>


On Mar 26, 2008, at 7:29 AM, Eric Gibert wrote:
> Either this *unique* index is new and it is a BioSQL "issue" (as  
> said, this index did not exist in my previous BioSQL db so I never  
> encountered this issue before)


The unique index has been there since Feb 2003 (the Singapore  
Biohackathon). I'm not sure how you got a version that doesn't have it.

The unique key constraint on the identifier column is also necessary  
- otherwise you cannot guarantee lookups by the NCBI taxonID to  
return either one or zero rows. Like Peter and Richard, I also don't  
understand what the point would be in allowing the same taxon (which  
in essence is a node), as identified by taxonID, to exist more than  
once.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From pan.mueller at yahoo.de  Thu Mar 27 15:33:34 2008
From: pan.mueller at yahoo.de (=?iso-8859-1?Q?Peter_M=FCller?=)
Date: Thu, 27 Mar 2008 20:33:34 +0100 (CET)
Subject: [BioSQL-l] bioentries in a sequence cluster
Message-ID: <664425.11239.qm@web28203.mail.ukl.yahoo.com>


Dear list,

I have a few questions, but maybe with a working example, I can derive the rest.

With perl-db I can fetch a Bio::Cluster Object wit this query:
(I found no documentation about c::subject and p::object ...)

$query->datacollections(
          ["Bio::PrimarySeqI c::subject",
          "Bio::PrimarySeqI p::object",
         "Bio::PrimarySeqI<=>Bio::ClusterI<=>Bio::Ontology::TermI"]);

$query->where(["p.accession_number = 'NM_000015'"]);

my $adp = $db->get_object_adaptor('Bio::Cluster');
my $qres = $adp->find_by_query($query);


That's great - but here I ask for a sequence accession-number.

Is it possible to aks for the Clone (IMAGE:4722596) or for an STS accession-number where the result is also a cluster object?
"give me the cluster(s) where in the sequence-line is a clone-entry with this number 'IMAGE:4722596' ....
"give me the cluster(s) where in the STS-line is an accession-number with this value 'PMC310725P3'...
PROTID and NID would be also interesting.

UniGene-snippet:
STS         ACC=PMC310725P3 UNISTS=272646
PROTSIM     ORG=10090; PROTGI=6754794; PROTID=NP_035004.1; PCT=76.55; ALN=288
SEQUENCE    ACC=BG569293.1; NID=g13576946; CLONE=IMAGE:4722596; END=5'; LID=6989; SEQTYPE=EST; TRACE=44157214

regards
pan


      Machen Sie Yahoo! zu Ihrer Startseite. Los geht's: 
http://de.yahoo.com/set


From hlapp at gmx.net  Sun Mar 30 01:00:25 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 30 Mar 2008 01:00:25 -0400
Subject: [BioSQL-l] bioentries in a sequence cluster
In-Reply-To: <664425.11239.qm@web28203.mail.ukl.yahoo.com>
References: <664425.11239.qm@web28203.mail.ukl.yahoo.com>
Message-ID: <8083537C-C721-48C2-A838-AAC2B178468A@gmx.net>


On Mar 27, 2008, at 3:33 PM, Peter M?ller wrote:
>
>
> Dear list,
>
> I have a few questions, but maybe with a working example, I can  
> derive the rest.
>
> With perl-db I can fetch a Bio::Cluster Object wit this query:
> (I found no documentation about c::subject and p::object ...)

Yes, sorry, this needs a lot more documentation. The suffix of the  
alias separated from it by '::' is the 'context'. This is needed if  
the same entity participates more than once in an association. What's  
confusing the issue further here is that at the object level each  
object entity (Bio::PrimarySeq, Bio::ClusterI, Bio::Ontology::TermI)  
is participating only once, though in reality Bio::ClusterI and  
Bio::PrimarySeqI both map to table bioentry.

>
> $query->datacollections(
>           ["Bio::PrimarySeqI c::subject",
>           "Bio::PrimarySeqI p::object",

I think that Bio::PrimarySeqI can be substituted with Bio::ClusterI  
in the second line. This would make the mapping clearer I guess. I'm  
not sure why I wrote the example that way, but I'd be surprised if  
Bio::ClusterI does not work here.

>          "Bio::PrimarySeqI<=>Bio::ClusterI<=>Bio::Ontology::TermI"]);
>
> $query->where(["p.accession_number = 'NM_000015'"]);

Actually I think you need to use c.accession_number to query by  
sequence accession. The c (child) alias is the cluster member, and  
the p (parent) alias is the cluster itself.

>
> my $adp = $db->get_object_adaptor('Bio::Cluster');
> my $qres = $adp->find_by_query($query);
>
>
> That's great - but here I ask for a sequence accession-number.
>
> Is it possible to aks for the Clone (IMAGE:4722596) or for an STS  
> accession-number where the result is also a cluster object?
> "give me the cluster(s) where in the sequence-line is a clone-entry  
> with this number 'IMAGE:4722596' ....
> "give me the cluster(s) where in the STS-line is an accession- 
> number with this value 'PMC310725P3'...
> PROTID and NID would be also interesting.

PID and NID should become the primary_id() of the sequence members.  
Hence, you would say c.primary_id where you have c.accession_number  
above.

Each STS line should be in a qualifier/value pair attached to the  
cluster bioentry, under the tag 'sts' (which from what I can see  
would consist of whole lines, not ACC= and UNISTS= values parsed out,  
though I may be mistaken). So you would add

"Bio::PrimarySeqI<=>Bio::Annotation::SimpleValue sv"

to the datacollections, and "sv.value = 'ACC=PMC310725P3  
UNISTS=272646'" and "sv.tagname = 'sts'" to the where() array.

The same goes for IMAGE clone IDs, except that the tag name is  
'clone' and the qualifier/value is attached to the member sequence,  
not the cluster; also here not the entire line is stored, but rather  
parsed into tokens.

Does this help?

	-hilmar

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From cjfields at uiuc.edu  Sat Mar  1 20:42:05 2008
From: cjfields at uiuc.edu (Chris Fields)
Date: Sat, 1 Mar 2008 14:42:05 -0600
Subject: [BioSQL-l] BioSQL bug in bugzilla
Message-ID: <C4A1873F-C433-492A-8282-CDE6F54B0493@uiuc.edu>

Hilmar,

Just wanted to point out a bug which I thought was bioperl-db-related  
but is really BioSQL.  Could you take a look to see what you think?

http://bugzilla.open-bio.org/show_bug.cgi?id=2389

chris


From hlapp at gmx.net  Sun Mar  2 00:06:55 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 1 Mar 2008 19:06:55 -0500
Subject: [BioSQL-l] biosql usage/user survey
In-Reply-To: <9692f0e9a791c7d0bf942e497668fdce@gmx.net>
References: <9692f0e9a791c7d0bf942e497668fdce@gmx.net>
Message-ID: <E68FD0A3-4203-4A99-BB35-8430EEC10CCF@gmx.net>

I sent this survey request back in 2005 and received a number of  
direct responses. I am assuming that since I said I was going to use  
them for the paper everyone was assuming that their BioSQL usage  
would be made public.

I am going to assemble the responses into a Wiki page as Malcolm  
suggested; if you responded to me and do not want to appear on that  
page, please let me know.

	-hilmar


On Nov 3, 2005, at 11:53 AM, Hilmar Lapp wrote:

> Hi all,
>
> I am writing up a paper on BioSQL and would like to include some  
> current usage figures to support its utility.
>
> Therefore, if you are using BioSQL I'd be glad if you could drop me  
> an email; if you can include a word or two (not more than 1  
> sentence) on what you use it for that'd be great too.
>
> Thanks in advance,
>
> 	-hilmar
> -- 
> -------------------------------------------------------------
> Hilmar Lapp                            email: lapp at gnf.org
> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> -------------------------------------------------------------
>

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From cjfields at uiuc.edu  Sun Mar  2 01:16:24 2008
From: cjfields at uiuc.edu (Chris Fields)
Date: Sat, 1 Mar 2008 19:16:24 -0600
Subject: [BioSQL-l] multiple species for a sequence
Message-ID: <0CFEE9C6-4012-4E6F-B81F-33EE7379036F@uiuc.edu>

I'm looking at a bioperl bug I filed a while back that deals with  
multiple species in a sequence file, such as found for AJ428955:

ID   AJ428955; SV 1; linear; mRNA; STD; VRL; 8027 BP.
XX
AC   AJ428955;
XX
DT   09-JUL-2002 (Rel. 72, Created)
DT   15-APR-2005 (Rel. 83, Last updated, Version 4)
XX
DE   Hepatitis GB virus B subgenomic replicon neoRepB
XX
KW   core-neo fusion protein; core-neo gene; polyprotein.
XX
OS   Hepatitis GB virus B
OC   Viruses; ssRNA positive-strand viruses, no DNA stage; Flaviviridae.
XX
OS   Encephalomyocarditis virus
OC   Viruses; ssRNA positive-strand viruses, no DNA stage;  
Picornaviridae;
OC   Cardiovirus.

...

We could probably add support in bioperl fairly easily (Bio::Seq could  
just return an array or the first species object based on context),  
but would BioSQL support sequences like this?

chris


From hlapp at gmx.net  Sun Mar  2 17:33:23 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 2 Mar 2008 12:33:23 -0500
Subject: [BioSQL-l] multiple species for a sequence
In-Reply-To: <0CFEE9C6-4012-4E6F-B81F-33EE7379036F@uiuc.edu>
References: <0CFEE9C6-4012-4E6F-B81F-33EE7379036F@uiuc.edu>
Message-ID: <86EDE863-287A-48BF-9ED8-13219C3D342E@gmx.net>


On Mar 1, 2008, at 8:16 PM, Chris Fields wrote:

> I'm looking at a bioperl bug I filed a while back that deals with  
> multiple species in a sequence file, such as found for AJ428955:
>
> ID   AJ428955; SV 1; linear; mRNA; STD; VRL; 8027 BP.
> XX
> AC   AJ428955;
> XX
> DT   09-JUL-2002 (Rel. 72, Created)
> DT   15-APR-2005 (Rel. 83, Last updated, Version 4)
> XX
> DE   Hepatitis GB virus B subgenomic replicon neoRepB
> XX
> KW   core-neo fusion protein; core-neo gene; polyprotein.
> XX
> OS   Hepatitis GB virus B
> OC   Viruses; ssRNA positive-strand viruses, no DNA stage;  
> Flaviviridae.
> XX
> OS   Encephalomyocarditis virus
> OC   Viruses; ssRNA positive-strand viruses, no DNA stage;  
> Picornaviridae;
> OC   Cardiovirus.
>
> ...
>
> We could probably add support in bioperl fairly easily (Bio::Seq  
> could just return an array or the first species object based on  
> context), but would BioSQL support sequences like this?

No it wouldn't. There may only be one species (taxon) per sequence.

There has been a lot of discussion about this in the past mostly  
driven by the former SwissProt peculiarity of collapsing sequences by  
sequence identity into a single record. We held out and eventually  
UniProt dropped this practice.

I guess we never quite decided what to do about chimeric sequences  
like the above. Note that the GenBank record gives this differently:

http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=21727885

Here, there's one taxon (ORGANISM line) reference, but two localized  
'source' features in the feature table. (I'm actually not 100% sure  
what the genbank parser would do with this - i.e., whether the second  
source feature will override the taxon_id found in the first.)  
Because seqfeatures (in BioSQL) don't have a link to taxon, you  
wouldn't be able to hit the sequence by its second (chimeric) taxon  
if that were your query criteria (though you could store it fine, and  
if you queried by dbxrefs of features of type 'source', you would  
find it).

At the end of the day, BioSQL will evolve (hopefully) quickly to  
support what the Bio* toolkits support, and will be much slower to  
change in ways that Bio* wouldn't be able to take advantage of  
anyway. At least that's my current vision of it, and of course is up  
for debate as to whether that's a useful vision as much as anything  
else.

So, as you say, right now BioPerl, and AFAIAA any of the other Bio*  
toolkits, doesn't support more than one species per sequence, but as  
soon as that changes, there's a clear need for BioSQL to follow along.

Does that make sense?

	-hilmar

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Sun Mar  2 17:39:17 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 2 Mar 2008 12:39:17 -0500
Subject: [BioSQL-l] BioSQL bug in bugzilla
In-Reply-To: <C4A1873F-C433-492A-8282-CDE6F54B0493@uiuc.edu>
References: <C4A1873F-C433-492A-8282-CDE6F54B0493@uiuc.edu>
Message-ID: <CCBB04C4-5436-4742-94DE-450511C5C628@gmx.net>

I don't think it's a good idea to just replace all varchar() types  
with type text.

First of all, having reasonable constraints is a Good Thing(tm) in my  
book as the majority of times I found them violated it revealed a  
parsing error, rather than the constraints not fitting the data.  
Second, this won't solve the problem for the other RDBMS versions for  
which there is a real performance penalty and other implications when  
having unreasonably large column widths.

That said, if the constraint is indeed not compatible with current  
data (such as Uniprot) we have a problem that needs to be fixed. So,  
what I would like to find out is

1) is this in reality a parsing error, or is there indeed a value for  
a column that in BioSQL is constrained to 40 chars, and

2) if so, which column in which table is the problem.

Erik - would you mind sending me the full error stack if you still  
have it? Usually load_seqdatabase.pl will also print an extra warning  
message saying what it couldn't store. That message would be great  
too. If you don't have either anymore, do you remember vaguely what  
those messsages said? Alternatively, do you have the offending  
uniprot entry (or its accession)?

I suspect that it's actually the constraint on dbxref.accession. Does  
that ring a bell?

	-hilmar


On Mar 1, 2008, at 3:42 PM, Chris Fields wrote:

> Hilmar,
>
> Just wanted to point out a bug which I thought was bioperl-db- 
> related but is really BioSQL.  Could you take a look to see what  
> you think?
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2389
>
> chris
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From cjfields at uiuc.edu  Sun Mar  2 18:00:50 2008
From: cjfields at uiuc.edu (Chris Fields)
Date: Sun, 2 Mar 2008 12:00:50 -0600
Subject: [BioSQL-l] multiple species for a sequence
In-Reply-To: <86EDE863-287A-48BF-9ED8-13219C3D342E@gmx.net>
References: <0CFEE9C6-4012-4E6F-B81F-33EE7379036F@uiuc.edu>
	<86EDE863-287A-48BF-9ED8-13219C3D342E@gmx.net>
Message-ID: <BFA173E9-1DB8-47F8-BC5F-25C35324C69B@uiuc.edu>


On Mar 2, 2008, at 11:33 AM, Hilmar Lapp wrote:

> On Mar 1, 2008, at 8:16 PM, Chris Fields wrote:
>
>> I'm looking at a bioperl bug I filed a while back that deals with  
>> multiple species in a sequence file, such as found for AJ428955:
>>
>> ID   AJ428955; SV 1; linear; mRNA; STD; VRL; 8027 BP.
>> XX
>> AC   AJ428955;
>> XX
>> DT   09-JUL-2002 (Rel. 72, Created)
>> DT   15-APR-2005 (Rel. 83, Last updated, Version 4)
>> XX
>> DE   Hepatitis GB virus B subgenomic replicon neoRepB
>> XX
>> KW   core-neo fusion protein; core-neo gene; polyprotein.
>> XX
>> OS   Hepatitis GB virus B
>> OC   Viruses; ssRNA positive-strand viruses, no DNA stage;  
>> Flaviviridae.
>> XX
>> OS   Encephalomyocarditis virus
>> OC   Viruses; ssRNA positive-strand viruses, no DNA stage;  
>> Picornaviridae;
>> OC   Cardiovirus.
>>
>> ...
>>
>> We could probably add support in bioperl fairly easily (Bio::Seq  
>> could just return an array or the first species object based on  
>> context), but would BioSQL support sequences like this?
>
> No it wouldn't. There may only be one species (taxon) per sequence.
>
> There has been a lot of discussion about this in the past mostly  
> driven by the former SwissProt peculiarity of collapsing sequences  
> by sequence identity into a single record. We held out and  
> eventually UniProt dropped this practice.

I'm unsure how often these pop up.  The behavior of both EMBL and  
GenBank parsers assumes one species (as does Bio::Seq); the embl  
parser picks up both and just replaces the first with the second:

...
DE   Hepatitis GB virus B subgenomic replicon neoRepB
XX
KW   core-neo fusion protein; core-neo gene; polyprotein.
XX
OS   Encephalomyocarditis virus
OC   Viruses; ssRNA positive-strand viruses, no DNA stage;  
Picornaviridae;
OC   Cardiovirus.
XX
RN   [1]
...

> I guess we never quite decided what to do about chimeric sequences  
> like the above. Note that the GenBank record gives this differently:
>
> http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=21727885
>
> Here, there's one taxon (ORGANISM line) reference, but two localized  
> 'source' features in the feature table. (I'm actually not 100% sure  
> what the genbank parser would do with this - i.e., whether the  
> second source feature will override the taxon_id found in the  
> first.) Because seqfeatures (in BioSQL) don't have a link to taxon,  
> you wouldn't be able to hit the sequence by its second (chimeric)  
> taxon if that were your query criteria (though you could store it  
> fine, and if you queried by dbxrefs of features of type 'source',  
> you would find it).

The genbank parser gets the taxon and tax ID correct; I would think  
when it hit the next source feature key it would assign the wrong tax  
ID to the species object but maybe there's a secondary check.  Both  
output the source in feature tables just fine.

> At the end of the day, BioSQL will evolve (hopefully) quickly to  
> support what the Bio* toolkits support, and will be much slower to  
> change in ways that Bio* wouldn't be able to take advantage of  
> anyway. At least that's my current vision of it, and of course is up  
> for debate as to whether that's a useful vision as much as anything  
> else.
>
> So, as you say, right now BioPerl, and AFAIAA any of the other Bio*  
> toolkits, doesn't support more than one species per sequence, but as  
> soon as that changes, there's a clear need for BioSQL to follow along.
>
> Does that make sense?
>
> 	-hilmar

Yes.  I think we could add in support for multiple species fairly  
easily but I'll probably hold off on anything until after a 1.6  
release (i.e. push it to the next developer series, which gives us  
more time to think on how to implement this in a BioSQL-friendly way).

chris


From er at xs4all.nl  Sun Mar  2 18:34:10 2008
From: er at xs4all.nl (Erik Rijkers)
Date: Sun, 2 Mar 2008 19:34:10 +0100 (CET)
Subject: [BioSQL-l] BioSQL bug in bugzilla
Message-ID: <25296.156.83.0.185.1204482850.squirrel@webmail.xs4all.nl>

Hi Hilmar,

Sorry, it's too long ago.  I can run it again (with new
versions) somewhere next week.  I don't remember which of
the two problems (parser or data size) it was in my case.

If it is true what you say (that most errors are due to
the parser), it might indeed be better to leave those
constraints in until such time that the parser has become
more trustworthy, and use the database as a test
instrument...

What is really needed of course is a place to run these
loading scrips continually against any appearing new
versions of parsable text, and against the different
database backends.

Does that already happen somewhere?

Should we consider such a bioperl buildfarm / loadfarm?

(I might be able to help with any postgres loading tests.)

Thanks,

Erik Rijkers

On Sun, March 2, 2008 18:39, Hilmar Lapp wrote:
> I don't think it's a good idea to just replace all
> varchar() types
> with type text.
>
> First of all, having reasonable constraints is a Good
> Thing(tm) in my
> book as the majority of times I found them violated it
> revealed a
> parsing error, rather than the constraints not fitting the
> data.
> Second, this won't solve the problem for the other RDBMS
> versions for
> which there is a real performance penalty and other
> implications when
> having unreasonably large column widths.
>
> That said, if the constraint is indeed not compatible with
> current
> data (such as Uniprot) we have a problem that needs to be
> fixed. So,
> what I would like to find out is
>
> 1) is this in reality a parsing error, or is there indeed
> a value for
> a column that in BioSQL is constrained to 40 chars, and
>
> 2) if so, which column in which table is the problem.
>
> Erik - would you mind sending me the full error stack if
> you still
> have it? Usually load_seqdatabase.pl will also print an
> extra warning
> message saying what it couldn't store. That message would
> be great
> too. If you don't have either anymore, do you remember
> vaguely what
> those messsages said? Alternatively, do you have the
> offending
> uniprot entry (or its accession)?
>
> I suspect that it's actually the constraint on
> dbxref.accession. Does
> that ring a bell?
>
> 	-hilmar
>
>
> On Mar 1, 2008, at 3:42 PM, Chris Fields wrote:
>
>> Hilmar,
>>
>> Just wanted to point out a bug which I thought was
>> bioperl-db-
>> related but is really BioSQL.  Could you take a look to
>> see what
>> you think?
>>
>> http://bugzilla.open-bio.org/show_bug.cgi?id=2389
>>
>> chris
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biosql-l
>
> --
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net
> :
> ===========================================================
>
>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
>


From hlapp at gmx.net  Sun Mar  2 19:20:21 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 2 Mar 2008 14:20:21 -0500
Subject: [BioSQL-l] database loading test server (was: BioSQL bug in
	bugzilla)
In-Reply-To: <25296.156.83.0.185.1204482850.squirrel@webmail.xs4all.nl>
References: <25296.156.83.0.185.1204482850.squirrel@webmail.xs4all.nl>
Message-ID: <52F7E02B-7192-4F3C-9B48-887583B25CCE@gmx.net>

Hi Erik,

On Mar 2, 2008, at 1:34 PM, Erik Rijkers wrote:

> What is really needed of course is a place to run these
> loading scrips continually against any appearing new
> versions of parsable text, and against the different
> database backends.

very true indeed.

>
> Does that already happen somewhere?
>
> Should we consider such a bioperl buildfarm / loadfarm?
>
> (I might be able to help with any postgres loading tests.)


Coincidentally we have been batting around the idea to have a OBF  
machine dedicated to serve for testing and proof-of-concept  
demonstrations of OBF projects. Indeed one of the services we had  
thought about setting up is a BioSQL database, and it's reassuring to  
hear independently that that would be useful.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From er at xs4all.nl  Sun Mar  2 20:01:46 2008
From: er at xs4all.nl (Erik Rijkers)
Date: Sun, 2 Mar 2008 21:01:46 +0100 (CET)
Subject: [BioSQL-l] database loading test server (was: BioSQL bug in
 bugzilla)
Message-ID: <9081.156.83.0.185.1204488106.squirrel@webmail.xs4all.nl>

Maybe we can use some ideas from the way the PostgreSQL
project has setup a distributed buildfarm (conceived by
Andrew Dunstan, I think):

  see: http://www.pgbuildfarm.org/

it lets members of the community use a standardized setup
for building postgresql on their own machines and
automates all steps involved.

I know the projects and the communities are different, but
the general idea to have a standard process to set up
machines for whomever wants to dedicate some hardware and
time seems like a good idea.


Erik Rijkers

On Sun, March 2, 2008 20:20, Hilmar Lapp wrote:
> Hi Erik,
>
> On Mar 2, 2008, at 1:34 PM, Erik Rijkers wrote:
>
>> What is really needed of course is a place to run these
>> loading scrips continually against any appearing new
>> versions of parsable text, and against the different
>> database backends.
>
> very true indeed.
>
>>
>> Does that already happen somewhere?
>>
>> Should we consider such a bioperl buildfarm / loadfarm?
>>
>> (I might be able to help with any postgres loading
>> tests.)
>
>
> Coincidentally we have been batting around the idea to
> have a OBF
> machine dedicated to serve for testing and
> proof-of-concept
> demonstrations of OBF projects. Indeed one of the services
> we had
> thought about setting up is a BioSQL database, and it's
> reassuring to
> hear independently that that would be useful.
>
> 	-hilmar
> --
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net
> :
> ===========================================================
>
>
>
>


From hlapp at gmx.net  Sun Mar  2 20:38:27 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 2 Mar 2008 15:38:27 -0500
Subject: [BioSQL-l] enhancement request scheduling
Message-ID: <5D2BC733-9A44-4EEA-B1D7-6DF90116B50E@gmx.net>

FYI, I have added the chimeric sequence problem and the character  
column width issue to the Enhancement Requests page on the wiki:

http://www.biosql.org/wiki/Enhancement_Requests

I've also started to arrange individual requests in a first draft  
towards scheduling them for implementation. This is very much up for  
debate, so let me know any feedback or disagreement you have or votes  
you might want to put in.

	-hilmar

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Sun Mar  2 22:53:34 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 2 Mar 2008 22:53:34 +0000
Subject: [BioSQL-l] database loading test server (was: BioSQL bug in
	bugzilla)
In-Reply-To: <52F7E02B-7192-4F3C-9B48-887583B25CCE@gmx.net>
References: <25296.156.83.0.185.1204482850.squirrel@webmail.xs4all.nl>
	<52F7E02B-7192-4F3C-9B48-887583B25CCE@gmx.net>
Message-ID: <320fb6e00803021453h553a5c2ay8c50381ef39d0b6a@mail.gmail.com>

>
>  Coincidentally we have been batting around the idea to have a OBF
>  machine dedicated to serve for testing and proof-of-concept
>  demonstrations of OBF projects. Indeed one of the services we had
>  thought about setting up is a BioSQL database, and it's reassuring to
>  hear independently that that would be useful.
>

The BioSQL test database would be especially useful if we have all the
Bio* projects hooked up to it, to automatically check they can all
read records written by each other.  I still haven't made time to get
BioPerl setup on my machine to check the BioSQL compatibility with
Biopython...

Peter


From hlapp at gmx.net  Mon Mar  3 03:18:47 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 2 Mar 2008 22:18:47 -0500
Subject: [BioSQL-l] small "bug" correction in package BioSql
In-Reply-To: <473455BE.6040807@ebi.ac.uk>
References: <762277.43372.qm@web26507.mail.ukl.yahoo.com>
	<ECAC265E-DDD2-4403-800E-8B150A980093@gmx.net>
	<473336E6.6000100@ebi.ac.uk>
	<9FF48B4B-74F1-4371-BBB5-541F1A70D88F@gmx.net>
	<473455BE.6040807@ebi.ac.uk>
Message-ID: <4C9ACC1A-8C61-4611-8083-EFAD34D186EF@gmx.net>

Just FYI, I added a section to this extent to the Enhancement Requests:

http://www.biosql.org/wiki/ 
Enhancement_Requests#Check_constraint_on_biosequence.alphabet

Feel free to fix/add as appropriate.

	-hilmar

On Nov 9, 2007, at 7:42 AM, Richard Holland wrote:

> I did a bit of poking around in our code and internally BioJava
> represents all the default alphabet names (Protein, DNA, etc.) in  
> upper
> case. It also allows for mixed case alphabet names.
>
> It's not quite as easy as I thought to change these to lower case as
> they are often referenced by text name, meaning other people's code
> might break if I change them.
>
> Also, as it allows for mixed-case alphabet names, I can't do a
> toUpper/toLower fudge on persistence to BioSQL, as I wouldn't
> necessarily get out what I put in!

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Mon Mar  3 03:38:59 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 2 Mar 2008 22:38:59 -0500
Subject: [BioSQL-l] Fwd: error on insert new sequences from GenBank: no
	annotations saved in BioSQL database
References: <A1A51FDB-C4A6-4894-8C9C-12A210B73C0D@gmx.net>
Message-ID: <917E3FCD-5FDB-460D-8F63-B897ACA5CDD2@gmx.net>

FYI, I used this to start a page on the recommended mapping of  
sequence annotation to BioSQL:

http://www.biosql.org/wiki/Annotation_Mapping

Obviously, this is very rudimentary, but everyone is welcome to add  
to it or comment with further questions. Also, one of the most  
important questions, namely a consistent vocabulary for annotation  
(qualifier) tags, isn't mentioned there (yet).

	-hilmar

Begin forwarded message:

> From: Hilmar Lapp <hlapp at gmx.net>
> Date: November 8, 2007 3:28:19 PM EST
> To: Eric Gibert <ericgibert at yahoo.fr>
> Cc: biopython at lists.open-bio.org, BioJava <biojava-l at biojava.org>
> Subject: Re: [Biojava-l] [BioPython] error on insert new sequences  
> from GenBank: no annotations saved in BioSQL database
>
> Maybe we need to hold some mini-hackathon to make the different
> toolkits compatible in how they map annotation to the schema.
> Obviously I don't know whether you have the latest Biojava setup
> here, but I'll just comment how BioPerl/Bioperl-db would map this:
>
> 'ORIGIN' - if I'm not mistaken this is only a token that introduces
> the actual sequence. I'm not sure what Biojava is storing as value  
> here.
>
> 'DIVISION' - this maps to column division in table bioentry (though I
> agree that if  perfectly following the weak typing principle this
> should be tag/value association, but at present it's still an actual
> column)
>
> 'genbank_accessions' - secondary accession numbers indeed go into the
> qualifier value table. The primary accession maps to column accession
> in table bioentry
>
> 'TITLE' - this is part of a publication reference, and should map to
> column title in table reference (which it does in bioperl-db)
>
> 'cross_references' - not sure where these would be coming from in
> GenBank format; for EMBL this will map to the dbxref table
>
> 'data_file_division' - not sure what this is (same as DIVISION?)
>
> 'VERSION' - in BioPerl we parse this apart into a version for the
> accession (which is column version in table bioentry) and the GI
> number, which maps to column identifier in table bioentry
>
> 'references' - these map to table reference (and bioentry_reference
> for association with the bioentry)
>
> 'KEYWORDS' - indeed these map to bioentry_qualifier_value
>
> 'GI' - maps to column identifier in table bioentry
>
> 'SIZE' - not sure what size that is. If it is the length of the
> sequence, it should (and in BioPerl/bioperl-db does) map to column
> length in table biosequence
>
> 'DEFINITION' - maps to column description in table bioentry
>
> 'REFERENCE' - should be the same as for 'references'
>
> 'MDAT' - not sure what this is
>
> 'ORGANISM' - this is the organism and maps to the table taxon (and
> taxon_name), with a foreign key in bioentry pointing to the taxon
>
> 'JOURNAL' - this is part of a reference, see 'references'
>
> 'ACCESSION' - the primary accession, maps to column accession in
> table bioentry
>
> 'LOCUS' - in the file itself this is an entire line consisting of
> multiple fields; BioPerl/bioperl-db maps the locus name (the first
> token after the literal token LOCUS) to column name in table bioentry
>
> 'SOURCE' - this is the organism, see 'ORGANISM'
>
> 'PUBMED' - this is part of a literature reference, and maps to a
> foreign key in the reference table (reference.dbxref) to a dbxref
> entry with PUBMED or PMID as the database and the pubmed ID as the
> accession
>
> 'AUTHORS' - part of a literature reference, maps to column authors in
> table reference
>
> 'TYPE' - not sure what this is. If it's the alphabet, it maps to
> table biosequence, column alphabet
>
> 'CIRCULAR' - this at present indeed maps to bioentry_qualifier_value,
> though there have been plans to make it a column in table biosequence.
>
> Note that this could in fact be the way Biojava stores it too, but
> upon retrieval represents it in the way you are seeing it.
>
> Hth,
>
> 	-hilmar
>
> On Nov 8, 2007, at 12:50 PM, Eric Gibert wrote:
>
>> Dear all,
>>
>> When I retrieve a BioSQL.BioSeq.DBSeqRecord which was inserted
>> previously by my BioJava application, I have:
>>
>> print "Debug on Seq:", Seq.id, "=", Seq.annotations.keys()
>>
>> Debug on Seq: AJ459190.1 = ['ORIGIN', 'DIVISION',
>> 'genbank_accessions', 'TITLE', 'cross_references',
>> 'data_file_division', 'VERSION', 'references', 'KEYWORDS', 'GI',
>> 'SIZE', 'DEFINITION', 'REFERENCE', 'MDAT', 'ORGANISM', 'JOURNAL',
>> 'ACCESSION', 'LOCUS', 'SOURCE', 'PUBMED', 'AUTHORS', 'TYPE',
>> 'CIRCULAR']
>>
>> but a freshly inserted BioSeq by BioPython 1.44 only gives me:
>> Debug on Seq: EF631597.1 =  ['cross_references', 'dates',
>> 'references', 'gi', 'data_file_division']
>>
>>
>> Once I look in the table bioentry_qualifier_value
>>
>> * 20 records for a Sequence imported by BioJava
>> * 1 only for a Sequence inserted by BioPython: the date which
>> should be inserted by "_load_bioentry_date" in BioSQL/Loader.py
>>
>> Quite a few annotations missing, no?
>>
>> Any idea?
>>
>> Eric
>>
>>
>>
>>
>>
>> _____________________________________________________________________ 
>> _
>> _______
>> Ne gardez plus qu'une seule adresse mail ! Copiez vos mails vers
>> Yahoo! Mail
>> _______________________________________________
>> BioPython mailing list  -  BioPython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>
> -- 
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================
>
>
>
>
>
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From cjfields at uiuc.edu  Mon Mar  3 04:36:56 2008
From: cjfields at uiuc.edu (Chris Fields)
Date: Sun, 2 Mar 2008 22:36:56 -0600
Subject: [BioSQL-l] Fwd: error on insert new sequences from GenBank: no
	annotations saved in BioSQL database
In-Reply-To: <917E3FCD-5FDB-460D-8F63-B897ACA5CDD2@gmx.net>
References: <A1A51FDB-C4A6-4894-8C9C-12A210B73C0D@gmx.net>
	<917E3FCD-5FDB-460D-8F63-B897ACA5CDD2@gmx.net>
Message-ID: <E90DD156-E6DE-47F8-AE0C-5FC1D039E377@uiuc.edu>


On Mar 2, 2008, at 9:38 PM, Hilmar Lapp wrote:

> FYI, I used this to start a page on the recommended mapping of  
> sequence annotation to BioSQL:
>
> http://www.biosql.org/wiki/Annotation_Mapping
>
> Obviously, this is very rudimentary, but everyone is welcome to add  
> to it or comment with further questions. Also, one of the most  
> important questions, namely a consistent vocabulary for annotation  
> (qualifier) tags, isn't mentioned there (yet).
>
> 	-hilmar
>
>> ...
>> Maybe we need to hold some mini-hackathon to make the different
>> toolkits compatible in how they map annotation to the schema.
>> Obviously I don't know whether you have the latest Biojava setup
>> here, but I'll just comment how BioPerl/Bioperl-db would map this:

These are the ones I know of:

>> 'cross_references' - not sure where these would be coming from in
>> GenBank format; for EMBL this will map to the dbxref table

GenPept has DBSOURCE, so maybe from there?

>> 'data_file_division' - not sure what this is (same as DIVISION?)

Note sure about that one, but division sounds right.

>> 'MDAT' - not sure what this is

Modification Date, I think.  'MDAT' is a field name used for limits in  
Entrez searches:

	Field code: MDAT
	      name: Modification Date
	      desc: Date of last update
	     count: 4012
	Attributes: is_date,is_singletoken

chris


From markjschreiber at gmail.com  Wed Mar  5 02:06:17 2008
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Wed, 5 Mar 2008 10:06:17 +0800
Subject: [BioSQL-l] multiple species for a sequence
In-Reply-To: <BFA173E9-1DB8-47F8-BC5F-25C35324C69B@uiuc.edu>
References: <0CFEE9C6-4012-4E6F-B81F-33EE7379036F@uiuc.edu>
	<86EDE863-287A-48BF-9ED8-13219C3D342E@gmx.net>
	<BFA173E9-1DB8-47F8-BC5F-25C35324C69B@uiuc.edu>
Message-ID: <93b45ca50803041806o6f802548g4e408339d1a40c27@mail.gmail.com>

BioJava doesn't support multiple taxa per sequence.  It's something to
consider though.

Philosophically you really have to wonder about he meaning of species
when you have a chimera : )  Should it not be a hybrid species all on
it's own?  I wonder what they will do when Craig Venter produces
Craigus ventus...

- Mark

On Mon, Mar 3, 2008 at 2:00 AM, Chris Fields <cjfields at uiuc.edu> wrote:
>
>
>  On Mar 2, 2008, at 11:33 AM, Hilmar Lapp wrote:
>
>  > On Mar 1, 2008, at 8:16 PM, Chris Fields wrote:
>  >
>  >> I'm looking at a bioperl bug I filed a while back that deals with
>  >> multiple species in a sequence file, such as found for AJ428955:
>  >>
>  >> ID   AJ428955; SV 1; linear; mRNA; STD; VRL; 8027 BP.
>  >> XX
>  >> AC   AJ428955;
>  >> XX
>  >> DT   09-JUL-2002 (Rel. 72, Created)
>  >> DT   15-APR-2005 (Rel. 83, Last updated, Version 4)
>  >> XX
>  >> DE   Hepatitis GB virus B subgenomic replicon neoRepB
>  >> XX
>  >> KW   core-neo fusion protein; core-neo gene; polyprotein.
>  >> XX
>  >> OS   Hepatitis GB virus B
>  >> OC   Viruses; ssRNA positive-strand viruses, no DNA stage;
>  >> Flaviviridae.
>  >> XX
>  >> OS   Encephalomyocarditis virus
>  >> OC   Viruses; ssRNA positive-strand viruses, no DNA stage;
>  >> Picornaviridae;
>  >> OC   Cardiovirus.
>  >>
>  >> ...
>  >>
>  >> We could probably add support in bioperl fairly easily (Bio::Seq
>  >> could just return an array or the first species object based on
>  >> context), but would BioSQL support sequences like this?
>  >
>  > No it wouldn't. There may only be one species (taxon) per sequence.
>  >
>  > There has been a lot of discussion about this in the past mostly
>  > driven by the former SwissProt peculiarity of collapsing sequences
>  > by sequence identity into a single record. We held out and
>  > eventually UniProt dropped this practice.
>
>  I'm unsure how often these pop up.  The behavior of both EMBL and
>  GenBank parsers assumes one species (as does Bio::Seq); the embl
>  parser picks up both and just replaces the first with the second:
>
>  ...
>
> DE   Hepatitis GB virus B subgenomic replicon neoRepB
>  XX
>  KW   core-neo fusion protein; core-neo gene; polyprotein.
>  XX
>
> OS   Encephalomyocarditis virus
>  OC   Viruses; ssRNA positive-strand viruses, no DNA stage;
>  Picornaviridae;
>  OC   Cardiovirus.
>  XX
>  RN   [1]
>  ...
>
>
>  > I guess we never quite decided what to do about chimeric sequences
>  > like the above. Note that the GenBank record gives this differently:
>  >
>  > http://www.ncbi.nlm.nih.gov/entrez/viewer.fcgi?db=nuccore&id=21727885
>  >
>  > Here, there's one taxon (ORGANISM line) reference, but two localized
>  > 'source' features in the feature table. (I'm actually not 100% sure
>  > what the genbank parser would do with this - i.e., whether the
>  > second source feature will override the taxon_id found in the
>  > first.) Because seqfeatures (in BioSQL) don't have a link to taxon,
>  > you wouldn't be able to hit the sequence by its second (chimeric)
>  > taxon if that were your query criteria (though you could store it
>  > fine, and if you queried by dbxrefs of features of type 'source',
>  > you would find it).
>
>  The genbank parser gets the taxon and tax ID correct; I would think
>  when it hit the next source feature key it would assign the wrong tax
>  ID to the species object but maybe there's a secondary check.  Both
>  output the source in feature tables just fine.
>
>
>  > At the end of the day, BioSQL will evolve (hopefully) quickly to
>  > support what the Bio* toolkits support, and will be much slower to
>  > change in ways that Bio* wouldn't be able to take advantage of
>  > anyway. At least that's my current vision of it, and of course is up
>  > for debate as to whether that's a useful vision as much as anything
>  > else.
>  >
>  > So, as you say, right now BioPerl, and AFAIAA any of the other Bio*
>  > toolkits, doesn't support more than one species per sequence, but as
>  > soon as that changes, there's a clear need for BioSQL to follow along.
>  >
>  > Does that make sense?
>  >
>  >       -hilmar
>
>  Yes.  I think we could add in support for multiple species fairly
>  easily but I'll probably hold off on anything until after a 1.6
>  release (i.e. push it to the next developer series, which gives us
>  more time to think on how to implement this in a BioSQL-friendly way).
>
>  chris
>
>
> _______________________________________________
>  BioSQL-l mailing list
>  BioSQL-l at lists.open-bio.org
>  http://lists.open-bio.org/mailman/listinfo/biosql-l
>


From cjfields at uiuc.edu  Wed Mar  5 23:24:03 2008
From: cjfields at uiuc.edu (Chris Fields)
Date: Wed, 5 Mar 2008 17:24:03 -0600
Subject: [BioSQL-l] bioperl-db bugs
Message-ID: <C46D2101-D4FF-47E9-89BA-0D84114CCB7B@uiuc.edu>

Hilmar,

I think I have two bioperl-db bugs sorted out, but I'm trying to  
determine whether the solution is a side-effect, a feature, or a bug.   
Dmitry has filed two bug reports which are somewhat related:

http://bugzilla.open-bio.org/show_bug.cgi?id=2280
http://bugzilla.open-bio.org/show_bug.cgi?id=2281

I have added my comments to it, but maybe you can shed some more light  
on this.  What he is trying to do is copy a persistent Seq object to a  
different namespace; load_seqdatabase.pl won't let him do that  
directly using the same sequence file.  If he changes the namespace()  
and store()s it using a script, the seq is moved to the new namespace,  
not updated.

My reasoning is this is a feature (by not changing the primary_key,  
you don't store a new sequence but update the current one).  However,  
if the primary_key is unset (undef), then it appears you can copy the  
sequence over (from Dmitry's script, with my addition noted):

...
my $ns1 = 'space1';
my $ns2 = 'space2';

my $seqadp = $db->get_object_adaptor('Bio::SeqI');
my $aux_seq = Bio::Seq::RichSeq->new(
     -accession_number => 'NC_005982',
     -version => 1,
     -namespace => $ns1);
my $seq = $seqadp->find_by_unique_key($aux_seq);

# store the found sequence in the second biodatabase:
my $pseq = $seqadp->create_persistent($ns2);
$pseq->namespace('bioperl2');
$pseq->primary_key(undef);  # my addition, which appears to work
$pseq->store();
$seqadp->commit;
...

My question: is this an intended effect?  The ability to assign undef  
to primary_key seems intentional based on the method code, but I'm a  
bit uncertain here.

chris


From hlapp at gmx.net  Thu Mar  6 05:03:26 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 6 Mar 2008 00:03:26 -0500
Subject: [BioSQL-l] Announcement: BioSQL v1.0.0 released
Message-ID: <B496379B-74F4-4A95-9F50-3DC13770EE5F@gmx.net>

BioSQL v1.0.0 Release
=====================

I am extremely pleased to announce the release of version 1.0.0
(code-named Tokyo, see below) of BioSQL. The release can be
downloaded at the following location, in the following formats:

http://biosql.org/DIST/biosql-1.0.0.tar.gz
http://biosql.org/DIST/biosql-1.0.0.tar.bz2
http://biosql.org/DIST/biosql-1.0.0.zip (has Windows-style EOL)

MD5 signatures (http://biosql.org/DIST/SIGNATURES.md5):
MD5(biosql-1.0.0.tar.bz2)= 2b09a821b9d94bb1e94c3c79dc2f4cff
MD5(biosql-1.0.0.tar.gz)= e47982d979ddb98aae640b5ab55ce2c6
MD5(biosql-1.0.0.zip)= 06913c8639ca4fe7f9000b556d8a04ed

The core BioSQL schema is a generic, extensible relational model for
sequences, sequence features, their annotation, and ontology terms. It
is also designed as the interoperable persistence interface between
the Bio* projects.

This version of the schema has essentially been the same since
November 2004. Software that worked with schema versions downloaded
from CVS (or, as of lately, svn) after November 2004 should work with
all 1.0.x releases.

This release contains
  - the core BioSQL schema as DDL (Data Definition Language) for the
    following RDBMSs: MySQL, PostgreSQL, Oracle, HSQLDB, and Apache  
Derby,
  - ancillary (but optional) schema files for PostgreSQL,
  - documentation and an ERD (Entity-Relationship Diagram), and
  - a Perl script that can pre-load (and update) a BioSQL instance with
    the NCBI taxonomy.

Installation instructions for MySQL and PostgreSQL are in the file
INSTALL, and the file doc/bj_and_bsql_oracle_howto.htm has
instructions for installing the Oracle version.

Additional information regarding BioSQL, including links to language
bindings, a roadmap to future releases and enhancements, and possible
local optimizations is available from the BioSQL website at
http://biosql.org.

On behalf of the BioSQL developers,

       Hilmar Lapp

Acknowledgments
---------------

BioSQL in general and this releases in particular owes enormously to a
number of number of people and would not exist without their
contributions, the contributions of people on the biosql-l mailing
list, and the support of other developers and users from the Bio*
community.

Ewan Birney created the first version of the schema and during the
2003 BioHackathon in Singapore tested and wrote much of the INSTALL
document. Elia Stupka and Chris Mungall made significant changes at
the 2002 BioHackathons in Tucson, AZ, and Cape Town, South
Africa. Aaron Mackey was instrumental in the changes made at the
Singapore BioHackathon, which set the path to the version (code-named
'post-Singapore') that eventually stabilized as v1.0. Matthew Pocock
and Thomas Down provided important input for the ontology model.

This release and the accompanying work on cleaning up, updating
documentation, and jump-starting a useful (wiki) website was
irreversibly set in motion at the BioHackathon 2008 in Tokyo, and
would not have happened without the active encouragement from
several participants, especially Heikki Lehvahslaiho, Mark Schreiber,
Richard Holland, and Raoul Bonnal. Finally, without the superb and
prompt help from Mauricio Herrera Cuadra and Jason Stajich with
various wiki and other admin issues that occasionally reared their
heads we wouldn't have made it to this point.

In recognition of the role the BioHackathon 2008 played in getting
this release out the door, and in keeping with an informal tradition
held up since the first BioHackathon, I am code-naming the 1.0.x
release series the Tokyo release series of BioSQL.

Thank you to everyone!

License
-------

BioSQL is free software: you can redistribute it and/or modify it
under the terms of the GNU Lesser General Public License as published
by the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.


-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Sun Mar  9 23:38:18 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 9 Mar 2008 19:38:18 -0400
Subject: [BioSQL-l] bioperl-db bugs
In-Reply-To: <C46D2101-D4FF-47E9-89BA-0D84114CCB7B@uiuc.edu>
References: <C46D2101-D4FF-47E9-89BA-0D84114CCB7B@uiuc.edu>
Message-ID: <DBCA4AD5-9C5D-499F-9506-5E8F3245DB28@gmx.net>

Hi Chris,

I added comments to both bug reports. This belongs to BioPerl,  
though, as it has only to do with its language binding.

The tidbit may be worth keeping in mind for a general BioSQL audience  
is that bioentry namespace (foreign key to biodatabase) is part of  
the (compound) bioentry unique keys. The identifier column used to be  
unique by itself (and could still be made such in a local instance,  
there's a comment to this effect in the DDL), but that was changed a  
while ago. (Also, if one uses any of the Bio* language bindings,  
changing a unique key constraint to something that differs from what  
the language binding assumes may be asking for a lot of trouble.  
Bioperl-db will expect the combination of primary_id() and namespace 
() to match if the latter is provided.)

	-hilmar

On Mar 5, 2008, at 6:24 PM, Chris Fields wrote:

> Hilmar,
>
> I think I have two bioperl-db bugs sorted out, but I'm trying to  
> determine whether the solution is a side-effect, a feature, or a  
> bug.  Dmitry has filed two bug reports which are somewhat related:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2280
> http://bugzilla.open-bio.org/show_bug.cgi?id=2281
>
> I have added my comments to it, but maybe you can shed some more  
> light on this.  What he is trying to do is copy a persistent Seq  
> object to a different namespace; load_seqdatabase.pl won't let him  
> do that directly using the same sequence file.  If he changes the  
> namespace() and store()s it using a script, the seq is moved to the  
> new namespace, not updated.
>
> My reasoning is this is a feature (by not changing the primary_key,  
> you don't store a new sequence but update the current one).   
> However, if the primary_key is unset (undef), then it appears you  
> can copy the sequence over (from Dmitry's script, with my addition  
> noted):
>
> ...
> my $ns1 = 'space1';
> my $ns2 = 'space2';
>
> my $seqadp = $db->get_object_adaptor('Bio::SeqI');
> my $aux_seq = Bio::Seq::RichSeq->new(
>     -accession_number => 'NC_005982',
>     -version => 1,
>     -namespace => $ns1);
> my $seq = $seqadp->find_by_unique_key($aux_seq);
>
> # store the found sequence in the second biodatabase:
> my $pseq = $seqadp->create_persistent($ns2);
> $pseq->namespace('bioperl2');
> $pseq->primary_key(undef);  # my addition, which appears to work
> $pseq->store();
> $seqadp->commit;
> ...
>
> My question: is this an intended effect?  The ability to assign  
> undef to primary_key seems intentional based on the method code,  
> but I'm a bit uncertain here.
>
> chris
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From jswetnam at gmail.com  Mon Mar 10 19:27:46 2008
From: jswetnam at gmail.com (James Swetnam)
Date: Mon, 10 Mar 2008 15:27:46 -0400
Subject: [BioSQL-l] Possible Mysql 5.x bug
Message-ID: <2ABD56A9-9632-4AB1-BC54-B0AF71037DC8@gmail.com>

First off, thank you very much to the developers for creating and
maintaining such a useful and interesting project.  I think I have
found a small syntactical bug; as a caveat, however, I am not a
database developer and have very little experience in these matters.
    I do know how to read documentation though, which I've relied  
heavily
    on to write this email.
    As per the biopython setup tutorial I'm attempting to run the  
biosqldb-
    mysql.sql file on Mac OS X Leopard.  Here is my mysql version  
string:
    cardozo13:sql james$ mysql -V
    mysql  Ver 14.12 Distrib 5.0.54-20071214, for apple-darwin9.1.0
    (powerpc) using  EditLine wrapper
    And my procedure (after grabbing the biosql source via CVS).
    cardozo13:sql james$ mysqladmin -u root -p create bioseqdb
    Enter password:
    cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb-
    mysql.sqlEnter password:
    ERROR 1064 (42000) at line 169: You have an error in your SQL  
syntax;
    check the manual that corresponds to your MySQL server version for  
the
    right syntax to use near '--CREATE INDEX ontrel_subjectid ON
    term_relationship(subject_term_id)' at line 1
    Interesting.  Let's take a look at line 169:

    --CREATE INDEX ontrel_subjectid ON  
term_relationship(subject_term_id);

    And an excerpt from the documentation for my version of MySQL (5.0
    reference manual), section 1.8.5.6. '--' as the Start of a Comment:

    Standard SQL uses ?--? as a start-comment sequence. MySQL Server  
uses
    ?#? as the start comment character. MySQL Server 3.23.3 and up also
    supports a variant of the ?--? comment style. That is, the ?--?  
start-
    comment sequence must be followed by a space (or by a control
    character such as a newline). The space is required to prevent
    problems with automatically generated SQL queries that use  
constructs
    such as the following, where we automatically insert the value of  
the
    payment for payment:

    OK. So after replacing all the lines in which -- is not followed  
by a
    space (thank you regexps), it works beautifully.

    cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb-mysql.sql
    Enter password:

    Should this change be implemented?  Or am i missing something?

    James Swetnam
    Research Technician
    New York University School of Medicine


- Done.


---------- Forwarded message ----------
From: "James Swetnam" <jswetnam at gmail.com>
To: biosql-l-request at lists.open-bio.org
Date: Thu, 6 Mar 2008 18:10:07 -0500
Subject: Comment Syntax bug Generates error on
Hello.

First off, thank you very much to the developers for creating and
maintaining such a useful and interesting project.  I think I have
found a small syntactical bug; as a caveat, however, I am not a
database developer and have very little experience in these matters.
I do know how to read documentation though, which I've relied heavily
on to write this email.

As per the biopython setup tutorial I'm attempting to run the biosqldb-
mysql.sql file on Mac OS X Leopard.  Here is my mysql version string:

cardozo13:sql james$ mysql -V
mysql  Ver 14.12 Distrib 5.0.54-20071214, for apple-darwin9.1.0
(powerpc) using  EditLine wrapper

And my procedure (after grabbing the biosql source via CVS).

cardozo13:sql james$ mysqladmin -u root -p create bioseqdb
Enter password:
cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb-
mysql.sqlEnter password:
ERROR 1064 (42000) at line 169: You have an error in your SQL syntax;
check the manual that corresponds to your MySQL server version for the
right syntax to use near '--CREATE INDEX ontrel_subjectid ON
term_relationship(subject_term_id)' at line 1

Interesting.  Let's take a look at line 169:

--CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id);

And an excerpt from the documentation for my version of MySQL (5.0
reference manual), section 1.8.5.6. '--' as the Start of a Comment:

Standard SQL uses ?--? as a start-comment sequence. MySQL Server uses
?#? as the start comment character. MySQL Server 3.23.3 and up also
supports a variant of the ?--? comment style. That is, the ?--? start-
comment sequence must be followed by a space (or by a control
character such as a newline). The space is required to prevent
problems with automatically generated SQL queries that use constructs
such as the following, where we automatically insert the value of the
payment for payment:

OK. So after replacing all the lines in which -- is not followed by a
space (thank you regexps), it works beautifully.

cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb-mysql.sql
Enter password:

Should this change be implemented?  Or am i missing something?

James Swetnam
Research Technician
New York University School of Medicine


Reply
		
Forward
		
	
From hlapp at gmx.net  Tue Mar 11 03:05:32 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Mon, 10 Mar 2008 23:05:32 -0400
Subject: [BioSQL-l] Possible Mysql 5.x bug
In-Reply-To: <2ABD56A9-9632-4AB1-BC54-B0AF71037DC8@gmail.com>
References: <2ABD56A9-9632-4AB1-BC54-B0AF71037DC8@gmail.com>
Message-ID: <9051AFFE-8660-4E21-B25F-93D1FB70D98B@gmx.net>

Hi James,

thanks for reporting this. Sebastian Bassi beat you to it, though,  
and it has actually been fixed in svn, and is also fixed in the 1.0.0  
release.

BioSQL is meanwhile on svn; the anonymous cvs server is still up, but  
doesn't get updated since the switch-over to svn. Instructions for  
downloading from svn and download location of the 1.0.0 release are  
on the BioSQL wiki at http://biosql.org.

Let us know if you encounter any difficulties. And great that you're  
finding the project useful!

	-hilmar

On Mar 10, 2008, at 3:27 PM, James Swetnam wrote:

> First off, thank you very much to the developers for creating and
> maintaining such a useful and interesting project.  I think I have
> found a small syntactical bug; as a caveat, however, I am not a
> database developer and have very little experience in these matters.
>    I do know how to read documentation though, which I've relied  
> heavily
>    on to write this email.
>    As per the biopython setup tutorial I'm attempting to run the  
> biosqldb-
>    mysql.sql file on Mac OS X Leopard.  Here is my mysql version  
> string:
>    cardozo13:sql james$ mysql -V
>    mysql  Ver 14.12 Distrib 5.0.54-20071214, for apple-darwin9.1.0
>    (powerpc) using  EditLine wrapper
>    And my procedure (after grabbing the biosql source via CVS).
>    cardozo13:sql james$ mysqladmin -u root -p create bioseqdb
>    Enter password:
>    cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb-
>    mysql.sqlEnter password:
>    ERROR 1064 (42000) at line 169: You have an error in your SQL  
> syntax;
>    check the manual that corresponds to your MySQL server version  
> for the
>    right syntax to use near '--CREATE INDEX ontrel_subjectid ON
>    term_relationship(subject_term_id)' at line 1
>    Interesting.  Let's take a look at line 169:
>
>    --CREATE INDEX ontrel_subjectid ON term_relationship 
> (subject_term_id);
>
>    And an excerpt from the documentation for my version of MySQL (5.0
>    reference manual), section 1.8.5.6. '--' as the Start of a Comment:
>
>    Standard SQL uses ?--? as a start-comment sequence. MySQL Server  
> uses
>    ?#? as the start comment character. MySQL Server 3.23.3 and up also
>    supports a variant of the ?--? comment style. That is, the ?--?  
> start-
>    comment sequence must be followed by a space (or by a control
>    character such as a newline). The space is required to prevent
>    problems with automatically generated SQL queries that use  
> constructs
>    such as the following, where we automatically insert the value  
> of the
>    payment for payment:
>
>    OK. So after replacing all the lines in which -- is not followed  
> by a
>    space (thank you regexps), it works beautifully.
>
>    cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb-mysql.sql
>    Enter password:
>
>    Should this change be implemented?  Or am i missing something?
>
>    James Swetnam
>    Research Technician
>    New York University School of Medicine
>
>
>
>
>
>
>
> - Done.
>
>
>
> ---------- Forwarded message ----------
> From: "James Swetnam" <jswetnam at gmail.com>
> To: biosql-l-request at lists.open-bio.org
> Date: Thu, 6 Mar 2008 18:10:07 -0500
> Subject: Comment Syntax bug Generates error on
> Hello.
>
> First off, thank you very much to the developers for creating and
> maintaining such a useful and interesting project.  I think I have
> found a small syntactical bug; as a caveat, however, I am not a
> database developer and have very little experience in these matters.
> I do know how to read documentation though, which I've relied heavily
> on to write this email.
>
> As per the biopython setup tutorial I'm attempting to run the  
> biosqldb-
> mysql.sql file on Mac OS X Leopard.  Here is my mysql version string:
>
> cardozo13:sql james$ mysql -V
> mysql  Ver 14.12 Distrib 5.0.54-20071214, for apple-darwin9.1.0
> (powerpc) using  EditLine wrapper
>
> And my procedure (after grabbing the biosql source via CVS).
>
> cardozo13:sql james$ mysqladmin -u root -p create bioseqdb
> Enter password:
> cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb-
> mysql.sqlEnter password:
> ERROR 1064 (42000) at line 169: You have an error in your SQL syntax;
> check the manual that corresponds to your MySQL server version for the
> right syntax to use near '--CREATE INDEX ontrel_subjectid ON
> term_relationship(subject_term_id)' at line 1
>
> Interesting.  Let's take a look at line 169:
>
> --CREATE INDEX ontrel_subjectid ON term_relationship(subject_term_id);
>
> And an excerpt from the documentation for my version of MySQL (5.0
> reference manual), section 1.8.5.6. '--' as the Start of a Comment:
>
> Standard SQL uses ?--? as a start-comment sequence. MySQL Server uses
> ?#? as the start comment character. MySQL Server 3.23.3 and up also
> supports a variant of the ?--? comment style. That is, the ?--? start-
> comment sequence must be followed by a space (or by a control
> character such as a newline). The space is required to prevent
> problems with automatically generated SQL queries that use constructs
> such as the following, where we automatically insert the value of the
> payment for payment:
>
> OK. So after replacing all the lines in which -- is not followed by a
> space (thank you regexps), it works beautifully.
>
> cardozo13:sql james$ mysql -u root bioseqdb -p < biosqldb-mysql.sql
> Enter password:
>
> Should this change be implemented?  Or am i missing something?
>
> James Swetnam
> Research Technician
> New York University School of Medicine
>
>
>
>
>
>
>
> Reply
> 		
> Forward
> 		
> 	
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Tue Mar 11 18:51:47 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 11 Mar 2008 18:51:47 +0000
Subject: [BioSQL-l] Biopython documentation in BioSQL SVN
In-Reply-To: <320fb6e00803111148k52150fdcu3b3c22f72f59f514@mail.gmail.com>
References: <320fb6e00803111148k52150fdcu3b3c22f72f59f514@mail.gmail.com>
Message-ID: <320fb6e00803111151q645c1f0cgead7842e8ab0d0d@mail.gmail.com>

Hello,

Over on the Biopython mailing list, James Swetnam drew my
attention to the fact that we still had documentation referring to
installing BioSQL from CVS (predating both the move to SVN
and the official 1.0 release).

I've updated our wiki page, http://biopython.org/wiki/BioSQL

However, there is some older LaTeX based documentation on our webpage,
http://biopython.org/DIST/docs/biosql/python_biosql_basic.html
http://biopython.org/DIST/docs/biosql/python_biosql_basic.pdf

These are currently living in the BioSQL repository, which I don't
think I have access to.
http://code.open-bio.org/svnweb/index.cgi/biosql/browse/biosql-schema/trunk/doc/biopython/

Does it make sense to have this documentation separate from the
Biopython code it refers to (which lives in the Biopython repository)?
For one thing, it complicates access rights for developers.

What I would suggest is just to:

(*) add a disclaimer to the top of python_biosql_basic.tex saying this
    document is depreciated, and giving a link to the wiki page,
    http://biopython.org/wiki/BioSQL
(*) regenerate the PDF and HTML files.
(*) Update these three files in BioSQL's SVN repository.
(*) Copy the new PDF and HTML files over to the Biopython webserver.

Thanks

Peter


From hlapp at gmx.net  Tue Mar 11 19:57:16 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Tue, 11 Mar 2008 15:57:16 -0400
Subject: [BioSQL-l] Biopython documentation in BioSQL SVN
In-Reply-To: <320fb6e00803111151q645c1f0cgead7842e8ab0d0d@mail.gmail.com>
References: <320fb6e00803111148k52150fdcu3b3c22f72f59f514@mail.gmail.com>
	<320fb6e00803111151q645c1f0cgead7842e8ab0d0d@mail.gmail.com>
Message-ID: <B66B13C8-6520-46F8-AD96-DABB4D06F91D@gmx.net>


On Mar 11, 2008, at 2:51 PM, Peter wrote:

> However, there is some older LaTeX based documentation on our webpage,
> http://biopython.org/DIST/docs/biosql/python_biosql_basic.html
> http://biopython.org/DIST/docs/biosql/python_biosql_basic.pdf
>
> These are currently living in the BioSQL repository,

You mean that the originals are, i.e., the source .tex file, right?  
The files in the BioSQL repository have been updated, and the updates  
should be in the v1.0.0 release.

> [...]
> Does it make sense to have this documentation separate from the
> Biopython code it refers to (which lives in the Biopython repository)?
> For one thing, it complicates access rights for developers.

Indeed. You can have write access but that doesn't mean it would then  
be easy to maintain for you folks (as it being in a non-biopython  
repository likely makes it slip from your mind again).

However, at the end of the day it is your call. I'm happy to leave it  
there, especially if there is continuing interest from Biopython  
folks to keep it updated (if there isn't, I may schedule it for  
deletion for one of the 1.1 or higher releases).

>
> What I would suggest is just to:
>
> (*) add a disclaimer to the top of python_biosql_basic.tex saying this
>     document is depreciated, and giving a link to the wiki page,
>     http://biopython.org/wiki/BioSQL

Just send me a patch of the change you would like to make.

> (*) regenerate the PDF and HTML files.

Those have been regenerated already, before the v1.0.0 release (by  
me, under some pains trying to get HeVeA to do what the original  
creators seemed to have gotten it to do).

> (*) Update these three files in BioSQL's SVN repository.

Done already as far as the change to svn is concerned. Actually, some  
Biopythonist (Sebastian?) walked through the file and made sure  
everything works as described, giving rise to an additional change.

> (*) Copy the new PDF and HTML files over to the Biopython webserver.


Feel free to grab them from svn (or from the BioSQL 1.0.0 release,  
there haven't been any changes since the release).

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Thu Mar 13 15:06:18 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 13 Mar 2008 15:06:18 +0000
Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id
Message-ID: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com>

Dear list,

One of the unresolved issues with Biopython's BioSQL interface is
dealing with the NCBI taxon ID when loading sequences into the
database.

As I understand it, ideally before loading any sequences, the user
will have loaded in the entire NCBI taxonomy using the
load_ncbi_taxonomy.pl script, as I described here:
http://biopython.org/wiki/BioSQL#NCBI_Taxonomy

When a new sequence is added to the database with a known taxon id,
there is no problem.  But happens if its a recently sequenced organism
which isn't defined yet in the BioSQL taxonomy tables?  Could/should
the user re-run load_ncbi_taxonomy.pl, and then load in their new
sequence?

Right now in Biopython due what appears to have been intended as a
short term hack, we simple don't record the taxon id at all (!), and I
would like to fix this (bug 2422).
http://bugzilla.open-bio.org/show_bug.cgi?id=2422

How do BioPerl et al deal with this issue?  Do they try and update the
taxonomy tables using the available information in the new record's
annotation (i.e. the new taxon id and the species name)?  Do they
lookup the NCBI taxonomy definition via the internet?  Do they throw
an error and halt?

Thanks,

Peter
(Biopython)


From hlapp at gmx.net  Thu Mar 13 22:51:13 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 13 Mar 2008 18:51:13 -0400
Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id
In-Reply-To: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com>
References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com>
Message-ID: <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net>

(this is more of a bioperl question than a biosql one)

The load_ncbi_taxonomy.pl script is designed to update the taxon  
tables in a non-disruptive way, and if there weren't many changes  
shouldn't actually take that long (except that recalculating the  
nested set values may take a couple of minutes).

Bioperl-db will store the taxon information it finds in the  
Bio::Species object if it can't locate the taxon by lookup, and will  
not raise an error. The problem with this is that it relies on the  
Bio::SeqIO parser to have gotten the species and lineage information  
correct, which is sometimes a wrong assumption for exotic species.  
Most often the error will not manifest itself at the time of storing  
the erroneously parsed information, but when it is re-retrieved and  
used to populate a Bio::Species object.

For the SymAtlas project we had this situation (new species in  
sequence updates that the last NCBI taxonomy update hadn't yet  
brought in) quite regularly. I wrote a SQL script would fix those  
'haphazard' additions such that load_ncbi_taxonomy would update them  
to their correct values come the next NCBI taxonomy update. I can  
send you the script (it would be for the Oracle version), but I'm not  
sure this is a widely viable strategy.

	-hilmar

On Mar 13, 2008, at 11:06 AM, Peter wrote:

> Dear list,
>
> One of the unresolved issues with Biopython's BioSQL interface is
> dealing with the NCBI taxon ID when loading sequences into the
> database.
>
> As I understand it, ideally before loading any sequences, the user
> will have loaded in the entire NCBI taxonomy using the
> load_ncbi_taxonomy.pl script, as I described here:
> http://biopython.org/wiki/BioSQL#NCBI_Taxonomy
>
> When a new sequence is added to the database with a known taxon id,
> there is no problem.  But happens if its a recently sequenced organism
> which isn't defined yet in the BioSQL taxonomy tables?  Could/should
> the user re-run load_ncbi_taxonomy.pl, and then load in their new
> sequence?
>
> Right now in Biopython due what appears to have been intended as a
> short term hack, we simple don't record the taxon id at all (!), and I
> would like to fix this (bug 2422).
> http://bugzilla.open-bio.org/show_bug.cgi?id=2422
>
> How do BioPerl et al deal with this issue?  Do they try and update the
> taxonomy tables using the available information in the new record's
> annotation (i.e. the new taxon id and the species name)?  Do they
> lookup the NCBI taxonomy definition via the internet?  Do they throw
> an error and halt?
>
> Thanks,
>
> Peter
> (Biopython)
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Thu Mar 13 23:13:32 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Thu, 13 Mar 2008 23:13:32 +0000
Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id
In-Reply-To: <32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net>
References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com>
	<32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net>
Message-ID: <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com>

On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
> (this is more of a bioperl question than a biosql one)

Well, yes and no.  And I'm not subscribed to the Bioperl list, nor the
BioJava one, nor the BioRuby one.

>  The load_ncbi_taxonomy.pl script is designed to update the taxon
>  tables in a non-disruptive way, and if there weren't many changes
>  shouldn't actually take that long (except that recalculating the
>  nested set values may take a couple of minutes).

Do you think when faced with a novel taxon id, Biopython/BioPerl/...
could write some minimal taxonomy entry (without any guess work based
on the species name), in order to record the sequence's taxon - and
then running an improved load_ncbi_taxonomy.pl at a later date would
sort out the proper taxonomy?

>  Bioperl-db will store the taxon information it finds in the
>  Bio::Species object if it can't locate the taxon by lookup, and will
>  not raise an error. The problem with this is that it relies on the
>  Bio::SeqIO parser to have gotten the species and lineage information
>  correct, which is sometimes a wrong assumption for exotic species.
>  Most often the error will not manifest itself at the time of storing
>  the erroneously parsed information, but when it is re-retrieved and
>  used to populate a Bio::Species object.

This is what I would like to avoid with Biopython.

>  For the SymAtlas project we had this situation (new species in
>  sequence updates that the last NCBI taxonomy update hadn't yet
>  brought in) quite regularly. I wrote a SQL script would fix those
>  'haphazard' additions such that load_ncbi_taxonomy would update them
>  to their correct values come the next NCBI taxonomy update. I can
>  send you the script (it would be for the Oracle version), but I'm not
>  sure this is a widely viable strategy.

So this wasn't integrated with load_ncbi_taxonomy.pl at all?

Peter


From hlapp at gmx.net  Thu Mar 13 23:41:43 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 13 Mar 2008 19:41:43 -0400
Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id
In-Reply-To: <320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com>
References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com>
	<32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net>
	<320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com>
Message-ID: <CEA4F4E7-A66B-4C62-AE32-511E177BC485@gmx.net>


On Mar 13, 2008, at 7:13 PM, Peter wrote:

> On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
>> [...]
>>  The load_ncbi_taxonomy.pl script is designed to update the taxon
>>  tables in a non-disruptive way, and if there weren't many changes
>>  shouldn't actually take that long (except that recalculating the
>>  nested set values may take a couple of minutes).
>
> Do you think when faced with a novel taxon id, Biopython/BioPerl/...
> could write some minimal taxonomy entry (without any guess work based
> on the species name), in order to record the sequence's taxon

This is what Bioperl-db does. There isn't any guesswork. If  
Bio::Species has lineage information it will also insert the lineage  
information, though.

> - and then running an improved load_ncbi_taxonomy.pl at a later  
> date would
> sort out the proper taxonomy?

If I remember correctly, the script makes (and hence expects) the  
primary key and the NCBI taxonomy ID to be identical. If your loading  
procedure can achieve that already then load_ncbi_taxonomy.pl should  
pick them up and fix them. You can try that by loading the taxonomy  
through the script, then arbitrarily choose a taxon, create a stub  
bioentry for it and set its taxon_id foreign key to the chosen  
taxon,  change its taxon_name.name to some bogus value (for the  
'scientific name' class, for example) (and feel free to change the  
left_id and right_id values in taxon too), and rerun the script. It  
should fix the change you made, and your bioentry should still point  
to the same taxon (because its primary key did not change, and did  
not get deleted either; otherwise the bioentry would now have a null  
value in the foreign key).

The Bioperl-db way of storing things does not give control over  
primary key assignment to Bioperl-db, so the database will assign it.

> [...]
>>  For the SymAtlas project we had this situation (new species in
>>  sequence updates that the last NCBI taxonomy update hadn't yet
>>  brought in) quite regularly. I wrote a SQL script would fix those
>>  'haphazard' additions such that load_ncbi_taxonomy would update them
>>  to their correct values come the next NCBI taxonomy update. I can
>>  send you the script (it would be for the Oracle version), but I'm  
>> not
>>  sure this is a widely viable strategy.
>
> So this wasn't integrated with load_ncbi_taxonomy.pl at all?

No, but now that you say it I don't see any reason why I couldn't.  
Maybe that's just what I should do.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From mrphysh at juno.com  Fri Mar 14 01:58:25 2008
From: mrphysh at juno.com (mrphysh at juno.com)
Date: Fri, 14 Mar 2008 01:58:25 GMT
Subject: [BioSQL-l] bioperl basics
Message-ID: <20080313.195825.6855.0@webmail20.vgs.untd.com>

I am a molecular biologist studying bioinformatics from a Perl background and making progress.  I am realizing that without tapping into the existing infrastructure, I will be writing code for ever.  Bioperl is the path for me.  I am moving forward.  

the error I encounter is 

can't locate Cache/FileCache in @INC (@INC contains /etc/perl/ /usr/locaql/lib/perl/5.8.8 .....)    and so forth.

I found the files in a home directory.  I must have told the install to put them there...?


anyway:  How do I edit this environmental variable..... @INC.  I cannot find anything in my book.

thanks
john brigham


I will be writing code for years and need to tap into the  
_____________________________________________________________
Need cash? Click to get an emergency loan, bad credit ok
http://thirdpartyoffers.juno.com/TGL2121/fc/Ioyw6i3mKmyQsg01zMPK1Qa0178ZfajwTEBgEXdzlmb9zLLZc8pLOU/


From barry.moore at genetics.utah.edu  Fri Mar 14 03:08:19 2008
From: barry.moore at genetics.utah.edu (Barry Moore)
Date: Thu, 13 Mar 2008 21:08:19 -0600
Subject: [BioSQL-l] bioperl basics
In-Reply-To: <20080313.195825.6855.0@webmail20.vgs.untd.com>
References: <20080313.195825.6855.0@webmail20.vgs.untd.com>
Message-ID: <E6BF1E75-E367-4F99-B910-FF8D4C307E86@genetics.utah.edu>

John,

@INC is not an environment variable, it is a perl variable that gets  
populated by the environment variable PERL5LIB.  You would normally  
set that environment variable by doing something like 'export  
PERL5LIB='/path/to/perl/libraries':$PERL5LIB' if you use bash shell  
or setenv PERL5LIB "/path/to/perl/libraries:$PERL5LIB" if you use c  
shell and you'll want to put those lines into the appropriate start  
up files so that they get set everytime you log in.  This will be  
different on a windows system but I'm afraid I can't help with that.

If you are having trouble installing bioperl I would encourage you to  
read the installation documentation at http://www.bioperl.org/wiki/ 
Installing_BioPerl.  Beyond that you will find a wealth of help with  
your beginning perl questions by searching the web with Google,  
asking at perlmonks.org or joining one of the many perl mailing lists  
that you can find at http://lists.cpan.org/.

The bioperl mailing list and this mailing list (BioSQL) are devoted  
specifically to discussions directly related to Bioperl and BioSQL  
respectively.  You should search for answers to questions like this  
one first on the web, then on one of the general perl mailing lists  
or web sites mentioned above.  When you have questions (even beginner  
ones) that are specific to Bioperl or BioSQL you are welcome post to  
those lists at any time.

Barry


On Mar 13, 2008, at 7:58 PM, mrphysh at juno.com wrote:

> I am a molecular biologist studying bioinformatics from a Perl  
> background and making progress.  I am realizing that without  
> tapping into the existing infrastructure, I will be writing code  
> for ever.  Bioperl is the path for me.  I am moving forward.
>
> the error I encounter is
>
> can't locate Cache/FileCache in @INC (@INC contains /etc/perl/ /usr/ 
> locaql/lib/perl/5.8.8 .....)    and so forth.
>
> I found the files in a home directory.  I must have told the  
> install to put them there...?
>
>
> anyway:  How do I edit this environmental variable..... @INC.  I  
> cannot find anything in my book.
>
> thanks
> john brigham
>
>
> I will be writing code for years and need to tap into the
> _____________________________________________________________
> Need cash? Click to get an emergency loan, bad credit ok
> http://thirdpartyoffers.juno.com/TGL2121/fc/ 
> Ioyw6i3mKmyQsg01zMPK1Qa0178ZfajwTEBgEXdzlmb9zLLZc8pLOU/
>
>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l


From markjschreiber at gmail.com  Fri Mar 14 13:48:38 2008
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Fri, 14 Mar 2008 21:48:38 +0800
Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id
In-Reply-To: <CEA4F4E7-A66B-4C62-AE32-511E177BC485@gmx.net>
References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com>
	<32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net>
	<320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com>
	<CEA4F4E7-A66B-4C62-AE32-511E177BC485@gmx.net>
Message-ID: <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com>

>From memory BioJava will add it if it is not already in there. If the
taxid can be found then the system connects you with whatever is in
that taxid, it doesn't overwrite it.

This has two curious side effects. Because the details associated with
a taxid sometimes change (eg common name changes a lot) you can get
connected to an outdated version (if your record is newer than your
NCBI taxonomy) or you can get connected with a version that is newer
than your record which means when you round-trip you don't get
complete identity.

For compatibility across the projects some kind of consensus would be good.

- Mark

On Fri, Mar 14, 2008 at 7:41 AM, Hilmar Lapp <hlapp at gmx.net> wrote:
>
>
> On Mar 13, 2008, at 7:13 PM, Peter wrote:
>
> > On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
> >> [...]
>
> >>  The load_ncbi_taxonomy.pl script is designed to update the taxon
> >>  tables in a non-disruptive way, and if there weren't many changes
> >>  shouldn't actually take that long (except that recalculating the
> >>  nested set values may take a couple of minutes).
> >
> > Do you think when faced with a novel taxon id, Biopython/BioPerl/...
> > could write some minimal taxonomy entry (without any guess work based
> > on the species name), in order to record the sequence's taxon
>
> This is what Bioperl-db does. There isn't any guesswork. If
> Bio::Species has lineage information it will also insert the lineage
> information, though.
>
>
> > - and then running an improved load_ncbi_taxonomy.pl at a later
> > date would
> > sort out the proper taxonomy?
>
> If I remember correctly, the script makes (and hence expects) the
> primary key and the NCBI taxonomy ID to be identical. If your loading
> procedure can achieve that already then load_ncbi_taxonomy.pl should
> pick them up and fix them. You can try that by loading the taxonomy
> through the script, then arbitrarily choose a taxon, create a stub
> bioentry for it and set its taxon_id foreign key to the chosen
> taxon,  change its taxon_name.name to some bogus value (for the
> 'scientific name' class, for example) (and feel free to change the
> left_id and right_id values in taxon too), and rerun the script. It
> should fix the change you made, and your bioentry should still point
> to the same taxon (because its primary key did not change, and did
> not get deleted either; otherwise the bioentry would now have a null
> value in the foreign key).
>
> The Bioperl-db way of storing things does not give control over
> primary key assignment to Bioperl-db, so the database will assign it.
>
> > [...]
>
> >>  For the SymAtlas project we had this situation (new species in
> >>  sequence updates that the last NCBI taxonomy update hadn't yet
> >>  brought in) quite regularly. I wrote a SQL script would fix those
> >>  'haphazard' additions such that load_ncbi_taxonomy would update them
> >>  to their correct values come the next NCBI taxonomy update. I can
> >>  send you the script (it would be for the Oracle version), but I'm
> >> not
> >>  sure this is a widely viable strategy.
> >
> > So this wasn't integrated with load_ncbi_taxonomy.pl at all?
>
> No, but now that you say it I don't see any reason why I couldn't.
> Maybe that's just what I should do.
>
>        -hilmar
>
> --
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================
>
>
>
> _______________________________________________
>
>
>
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
>


From cjfields at uiuc.edu  Fri Mar 14 14:31:09 2008
From: cjfields at uiuc.edu (Chris Fields)
Date: Fri, 14 Mar 2008 09:31:09 -0500
Subject: [BioSQL-l] [Bioperl-l] Loading sequences with novel NCBI taxon
	id
In-Reply-To: <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com>
References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com>
	<32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net>
	<320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com>
	<CEA4F4E7-A66B-4C62-AE32-511E177BC485@gmx.net>
	<93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com>
Message-ID: <CE3675B2-2AFD-46AA-A348-16C9FEA51E0E@uiuc.edu>

The counter to that perspective (using new sequences with old tax  
info) would be to regularly update NCBI taxonomy, particularly in  
circumstances prior to adding new sequences.  Hilmar mentioned that  
once tax is loaded it doesn't take as long to update, so you could set  
up a cron job to update regularly.

I remember someone mentioning weekly or monthly updates on the list  
quite a while ago, but I'm unsure how often NCBI updates tax  
information (i.e. with every release, monthly, weekly, etc).  I can  
see instances popping up where you used the an up-to-date taxonomy but  
a new sequence contains a tax ID not present.  I think bioperl-db  
handles these but I'm not sure what other Bio* do.

chris

On Mar 14, 2008, at 8:48 AM, Mark Schreiber wrote:

>> From memory BioJava will add it if it is not already in there. If the
> taxid can be found then the system connects you with whatever is in
> that taxid, it doesn't overwrite it.
>
> This has two curious side effects. Because the details associated with
> a taxid sometimes change (eg common name changes a lot) you can get
> connected to an outdated version (if your record is newer than your
> NCBI taxonomy) or you can get connected with a version that is newer
> than your record which means when you round-trip you don't get
> complete identity.
>
> For compatibility across the projects some kind of consensus would  
> be good.
>
> - Mark
> On Fri, Mar 14, 2008 at 7:41 AM, Hilmar Lapp <hlapp at gmx.net> wrote:
>>
>>
>> On Mar 13, 2008, at 7:13 PM, Peter wrote:
>>
>>> On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
>>>> [...]
>>
>>>> The load_ncbi_taxonomy.pl script is designed to update the taxon
>>>> tables in a non-disruptive way, and if there weren't many changes
>>>> shouldn't actually take that long (except that recalculating the
>>>> nested set values may take a couple of minutes).
>>>
>>> Do you think when faced with a novel taxon id, Biopython/BioPerl/...
>>> could write some minimal taxonomy entry (without any guess work  
>>> based
>>> on the species name), in order to record the sequence's taxon
>>
>> This is what Bioperl-db does. There isn't any guesswork. If
>> Bio::Species has lineage information it will also insert the lineage
>> information, though.
>>
>>
>>> - and then running an improved load_ncbi_taxonomy.pl at a later
>>> date would
>>> sort out the proper taxonomy?
>>
>> If I remember correctly, the script makes (and hence expects) the
>> primary key and the NCBI taxonomy ID to be identical. If your loading
>> procedure can achieve that already then load_ncbi_taxonomy.pl should
>> pick them up and fix them. You can try that by loading the taxonomy
>> through the script, then arbitrarily choose a taxon, create a stub
>> bioentry for it and set its taxon_id foreign key to the chosen
>> taxon,  change its taxon_name.name to some bogus value (for the
>> 'scientific name' class, for example) (and feel free to change the
>> left_id and right_id values in taxon too), and rerun the script. It
>> should fix the change you made, and your bioentry should still point
>> to the same taxon (because its primary key did not change, and did
>> not get deleted either; otherwise the bioentry would now have a null
>> value in the foreign key).
>>
>> The Bioperl-db way of storing things does not give control over
>> primary key assignment to Bioperl-db, so the database will assign it.
>>
>>> [...]
>>
>>>> For the SymAtlas project we had this situation (new species in
>>>> sequence updates that the last NCBI taxonomy update hadn't yet
>>>> brought in) quite regularly. I wrote a SQL script would fix those
>>>> 'haphazard' additions such that load_ncbi_taxonomy would update  
>>>> them
>>>> to their correct values come the next NCBI taxonomy update. I can
>>>> send you the script (it would be for the Oracle version), but I'm
>>>> not
>>>> sure this is a widely viable strategy.
>>>
>>> So this wasn't integrated with load_ncbi_taxonomy.pl at all?
>>
>> No, but now that you say it I don't see any reason why I couldn't.
>> Maybe that's just what I should do.
>>
>>       -hilmar
>>
>> --
>> ===========================================================
>> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
>> ===========================================================
>>
>>
>>
>> _______________________________________________
>>
>>
>>
>> BioSQL-l mailing list
>> BioSQL-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biosql-l
>>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign


From markjschreiber at gmail.com  Sat Mar 15 00:56:37 2008
From: markjschreiber at gmail.com (Mark Schreiber)
Date: Sat, 15 Mar 2008 08:56:37 +0800
Subject: [BioSQL-l] [Bioperl-l] Loading sequences with novel NCBI taxon
	id
In-Reply-To: <CE3675B2-2AFD-46AA-A348-16C9FEA51E0E@uiuc.edu>
References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com>
	<32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net>
	<320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com>
	<CEA4F4E7-A66B-4C62-AE32-511E177BC485@gmx.net>
	<93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com>
	<CE3675B2-2AFD-46AA-A348-16C9FEA51E0E@uiuc.edu>
Message-ID: <93b45ca50803141756m3d7f022cnb57bd39f37270682@mail.gmail.com>

I agree. A regular update would be best.

Of course if your BioSQL db is limited to one or a few organisms you can
just keep a fragment of the db.

- Mark

On Fri, Mar 14, 2008 at 10:31 PM, Chris Fields <cjfields at uiuc.edu> wrote:

> The counter to that perspective (using new sequences with old tax
> info) would be to regularly update NCBI taxonomy, particularly in
> circumstances prior to adding new sequences.  Hilmar mentioned that
> once tax is loaded it doesn't take as long to update, so you could set
> up a cron job to update regularly.
>
> I remember someone mentioning weekly or monthly updates on the list
> quite a while ago, but I'm unsure how often NCBI updates tax
> information (i.e. with every release, monthly, weekly, etc).  I can
> see instances popping up where you used the an up-to-date taxonomy but
> a new sequence contains a tax ID not present.  I think bioperl-db
> handles these but I'm not sure what other Bio* do.
>
> chris
>
> On Mar 14, 2008, at 8:48 AM, Mark Schreiber wrote:
>
> >> From memory BioJava will add it if it is not already in there. If the
> > taxid can be found then the system connects you with whatever is in
> > that taxid, it doesn't overwrite it.
> >
> > This has two curious side effects. Because the details associated with
> > a taxid sometimes change (eg common name changes a lot) you can get
> > connected to an outdated version (if your record is newer than your
> > NCBI taxonomy) or you can get connected with a version that is newer
> > than your record which means when you round-trip you don't get
> > complete identity.
> >
> > For compatibility across the projects some kind of consensus would
> > be good.
> >
> > - Mark
> > On Fri, Mar 14, 2008 at 7:41 AM, Hilmar Lapp <hlapp at gmx.net> wrote:
> >>
> >>
> >> On Mar 13, 2008, at 7:13 PM, Peter wrote:
> >>
> >>> On Thu, Mar 13, 2008 at 10:51 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
> >>>> [...]
> >>
> >>>> The load_ncbi_taxonomy.pl script is designed to update the taxon
> >>>> tables in a non-disruptive way, and if there weren't many changes
> >>>> shouldn't actually take that long (except that recalculating the
> >>>> nested set values may take a couple of minutes).
> >>>
> >>> Do you think when faced with a novel taxon id, Biopython/BioPerl/...
> >>> could write some minimal taxonomy entry (without any guess work
> >>> based
> >>> on the species name), in order to record the sequence's taxon
> >>
> >> This is what Bioperl-db does. There isn't any guesswork. If
> >> Bio::Species has lineage information it will also insert the lineage
> >> information, though.
> >>
> >>
> >>> - and then running an improved load_ncbi_taxonomy.pl at a later
> >>> date would
> >>> sort out the proper taxonomy?
> >>
> >> If I remember correctly, the script makes (and hence expects) the
> >> primary key and the NCBI taxonomy ID to be identical. If your loading
> >> procedure can achieve that already then load_ncbi_taxonomy.pl should
> >> pick them up and fix them. You can try that by loading the taxonomy
> >> through the script, then arbitrarily choose a taxon, create a stub
> >> bioentry for it and set its taxon_id foreign key to the chosen
> >> taxon,  change its taxon_name.name to some bogus value (for the
> >> 'scientific name' class, for example) (and feel free to change the
> >> left_id and right_id values in taxon too), and rerun the script. It
> >> should fix the change you made, and your bioentry should still point
> >> to the same taxon (because its primary key did not change, and did
> >> not get deleted either; otherwise the bioentry would now have a null
> >> value in the foreign key).
> >>
> >> The Bioperl-db way of storing things does not give control over
> >> primary key assignment to Bioperl-db, so the database will assign it.
> >>
> >>> [...]
> >>
> >>>> For the SymAtlas project we had this situation (new species in
> >>>> sequence updates that the last NCBI taxonomy update hadn't yet
> >>>> brought in) quite regularly. I wrote a SQL script would fix those
> >>>> 'haphazard' additions such that load_ncbi_taxonomy would update
> >>>> them
> >>>> to their correct values come the next NCBI taxonomy update. I can
> >>>> send you the script (it would be for the Oracle version), but I'm
> >>>> not
> >>>> sure this is a widely viable strategy.
> >>>
> >>> So this wasn't integrated with load_ncbi_taxonomy.pl at all?
> >>
> >> No, but now that you say it I don't see any reason why I couldn't.
> >> Maybe that's just what I should do.
> >>
> >>       -hilmar
> >>
> >> --
> >> ===========================================================
> >> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> >> ===========================================================
> >>
> >>
> >>
> >> _______________________________________________
> >>
> >>
> >>
> >> BioSQL-l mailing list
> >> BioSQL-l at lists.open-bio.org
> >> http://lists.open-bio.org/mailman/listinfo/biosql-l
> >>
> > _______________________________________________
> > Bioperl-l mailing list
> > Bioperl-l at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> Christopher Fields
> Postdoctoral Researcher
> Lab of Dr. Robert Switzer
> Dept of Biochemistry
> University of Illinois Urbana-Champaign
>
>
>
>


From biopython at maubp.freeserve.co.uk  Sun Mar 16 19:16:04 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 16 Mar 2008 19:16:04 +0000
Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id
In-Reply-To: <93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com>
References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com>
	<32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net>
	<320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com>
	<CEA4F4E7-A66B-4C62-AE32-511E177BC485@gmx.net>
	<93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com>
Message-ID: <320fb6e00803161216q51488e5bo8d1538fb616edb88@mail.gmail.com>

On Fri, Mar 14, 2008 Mark Schreiber wrote:
> From memory BioJava will add it if it is not already in there. If the
> taxid can be found then the system connects you with whatever is in
> that taxid, it doesn't overwrite it.

BioPerl does this to, so there is consensus on this at least.  But see
below regarding the lineage.

>  This has two curious side effects. Because the details associated with
>  a taxid sometimes change (eg common name changes a lot) you can get
>  connected to an outdated version (if your record is newer than your
>  NCBI taxonomy) or you can get connected with a version that is newer
>  than your record which means when you round-trip you don't get
>  complete identity.

This is understandable, even if a little unexpected.

I (Peter) wrote:
>  > > Do you think when faced with a novel taxon id, Biopython/BioPerl/...
>  > > could write some minimal taxonomy entry (without any guess work based
>  > > on the species name), in order to record the sequence's taxon

Hilmar Lapp replied:
>  > This is what Bioperl-db does. There isn't any guesswork. If
>  > Bio::Species has lineage information it will also insert the lineage
>  > information, though.

I am planing to fix Biopython so that once again, it will record the
taxon id against new sequences if the species is already in the table,
and add it to the taxonomy if it isn't there already.

Should we also try and add the lineage into the taxon/taxon_name
tables, linking to existing entries based on matching scientific names
where possible?  Or, should we just add a single taxonomy entry for
the new species, with no lineage links at all?

The old Biopython code also used to add taxon table entries for the
full lineage - trying to reuse existing entries based on string
matching to the scientific name field in the taxon_name table.  This
strikes me as a little unreliable (which is why I used the term "guess
work" in my earlier email).  I am also concerned that this complicates
the clean up operation for load_ncbi_taxonomy.pl, but have not looked
into this.

Hilmar Lapp wrote:
>  > If I remember correctly, the script makes (and hence expects) the
>  > primary key and the NCBI taxonomy ID to be identical.

Really? Perhaps I have misunderstood you.  That would cause problems
if we want to record a new sequence entry with species information but
no NCBI taxonomy ID (e.g. an in house sequencing project).  The
Biopython code doesn't seem to assume the taxon table ID bears any
resemblance to the the NCBI taxonomy ID.  When creating new taxon
table entries, we let the database will assign the taxon table id
(primary key).

Peter


From hlapp at gmx.net  Sun Mar 16 22:00:12 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 16 Mar 2008 18:00:12 -0400
Subject: [BioSQL-l] BioSQL + Embl + Comments
In-Reply-To: <1205182450.18769.20.camel@Graco>
References: <1205182450.18769.20.camel@Graco>
Message-ID: <D9EE6AD3-AC97-4ACD-85AE-4C7E87C1A7A9@gmx.net>

Hi Raoul,

On Mar 10, 2008, at 4:54 PM, Raoul Jean Pierre Bonnal wrote:

> Dear Hilmar,
> I'm here for asking you some help.
>
> BioRuby guys chosen as example for round trip tests the sequence ID
> AJ224122; SV 3; linear; genomic DNA; STD; PLN; 3827 BP.
>
> I have problem with the references/comments informations.
> In biosql "comment" seems to be something generic not directly  
> binded to
> a reference.

Comment in BioSQL is a piece of annotation of type comment. The  
schema at present only allows you to attach those to bioentries, and  
in fact one particular comment can be assigned to only one bioentry  
(1:n relationship).

> If you look at the AJ224122's embl format a comment is
> connected with the reference.

You're referring to the following line, right?

RC   revised by [3]

> There is no problem with genbank because there is only a generic  
> comment
> and BioSQL works correctly in this case.
> So, how can I manage the problem with Embl ? I was thinking to add a
> column the "comment_id" to "bioentry_reference" as fk to "comment"  
> table
> in a way that a bioentry_reference can have more comments.

One question here is whether the comment is specific to the  
association of the reference with the bioentry, or to the reference  
in general.

The next thing to note is that the comment above is not just text, it  
actually establishes a relationship to another reference (or to  
another reference to bioentry association). So to really capture it  
you would want a typed link between bioentry_reference rows (in this  
case the relationship type would be 'revises' or 'revised by',  
depending on direction).

The question is whether this depth of modeling is needed or useful,  
aside from the fact that I'm pretty sure that none of the Bio*  
libraries supports it (but maybe they want to?).

So if not, I guess this goes back to the use-case of round-tripping?  
Maybe to satisfy that a bioentry_reference_qualifier table would  
suffice (assuming that the comment does apply rather to the reference/ 
bioentry association than directly to the reference).

>
> PS: I don't know if this stuff should be emailed to biosql list

Yes, I actually hadn't realized that you hadn't posted this to the  
list. Should have forwarded right away, sorry for sitting on it.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Sun Mar 16 22:54:45 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 16 Mar 2008 18:54:45 -0400
Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id
In-Reply-To: <320fb6e00803161216q51488e5bo8d1538fb616edb88@mail.gmail.com>
References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com>
	<32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net>
	<320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com>
	<CEA4F4E7-A66B-4C62-AE32-511E177BC485@gmx.net>
	<93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com>
	<320fb6e00803161216q51488e5bo8d1538fb616edb88@mail.gmail.com>
Message-ID: <9BC888F9-1DB1-40CC-93DA-C27E30019C04@gmx.net>


On Mar 16, 2008, at 3:16 PM, Peter wrote:

> [...] I (Peter) wrote:
>>>> Do you think when faced with a novel taxon id, Biopython/ 
>>>> BioPerl/...
>>>> could write some minimal taxonomy entry (without any guess work  
>>>> based
>>>> on the species name), in order to record the sequence's taxon
>
> Hilmar Lapp replied:
>>> This is what Bioperl-db does. There isn't any guesswork. If
>>> Bio::Species has lineage information it will also insert the lineage
>>> information, though.
>
> I am planing to fix Biopython so that once again, it will record the
> taxon id against new sequences if the species is already in the table,
> and add it to the taxonomy if it isn't there already.
>
> Should we also try and add the lineage into the taxon/taxon_name
> tables, linking to existing entries based on matching scientific names
> where possible?  Or, should we just add a single taxonomy entry for
> the new species, with no lineage links at all?

This should probably depend on how good or complete the lineage  
information is that you have. BioPerl parses this out of the sequence  
files (for formats that have it, such as GenBank, EMBL, UniProt), and  
so except for exotic clades that don't follow the typical patterns it  
is usually in good shape (though one might say that the majority of  
clades are exotic).

Moreover, it's worth noting that the NCBI taxonomy often contains  
more nodes in a lineage than are shown in the GenBank record. In this  
case, unless you know which levels (ranks) to print and which not to,  
having the full NCBI taxonomy information may in fact cause problems  
for round-tripping.

>
> The old Biopython code also used to add taxon table entries for the
> full lineage - trying to reuse existing entries based on string
> matching to the scientific name field in the taxon_name table.  This
> strikes me as a little unreliable (which is why I used the term "guess
> work" in my earlier email).

It's pretty unreliable actually. There is not only synonymy but also  
rampant homonymy in taxonomic names. There are plenty of examples for  
the same scientific name in use for a plant and for some animal, for  
example. So in order to be unambiguous you will need to know (and  
check) the kingdom.

> I am also concerned that this complicates the clean up operation  
> for load_ncbi_taxonomy.pl, but have not looked into this.

It shouldn't. The script makes no difference between tip (species or  
subspecies) nodes or internal nodes.

>
> Hilmar Lapp wrote:
>>> If I remember correctly, the script makes (and hence expects) the
>>> primary key and the NCBI taxonomy ID to be identical.
>
> Really? Perhaps I have misunderstood you.  That would cause problems
> if we want to record a new sequence entry with species information but
> no NCBI taxonomy ID (e.g. an in house sequencing project).  The
> Biopython code doesn't seem to assume the taxon table ID bears any
> resemblance to the the NCBI taxonomy ID.  When creating new taxon
> table entries, we let the database will assign the taxon table id
> (primary key).

Right, that's what I said Bioperl-db does too, and is the reason I  
had to regularly run that SQL script that would migrate the primary  
keys.

Doing that isn't a big deal but I guess this could also be fixed in  
load_ncbi_taxonomy.pl so that it doesn't need to rely on this  
assumption. Would someone mind filing the bug report? (We have a  
BioSQL category now on bugzilla.open-bio.org.)

Cheers,

	-hilmar

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Mon Mar 17 16:08:43 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Mon, 17 Mar 2008 16:08:43 +0000
Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id
In-Reply-To: <9BC888F9-1DB1-40CC-93DA-C27E30019C04@gmx.net>
References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com>
	<32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net>
	<320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com>
	<CEA4F4E7-A66B-4C62-AE32-511E177BC485@gmx.net>
	<93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com>
	<320fb6e00803161216q51488e5bo8d1538fb616edb88@mail.gmail.com>
	<9BC888F9-1DB1-40CC-93DA-C27E30019C04@gmx.net>
Message-ID: <320fb6e00803170908x76f0b9a3he57f4653d2fd433@mail.gmail.com>

On Sun, Mar 16, 2008 at 10:54 PM, Hilmar Lapp <hlapp at gmx.net> wrote:
>  > Should we [Biopython] also try and add the lineage into the taxon/
>  > taxon_name tables, linking to existing entries based on matching scientific
>  > names where possible?  Or, should we just add a single taxonomy entry
>  > for the new species, with no lineage links at all?
>
>  This should probably depend on how good or complete the lineage
>  information is that you have. BioPerl parses this out of the sequence
>  files (for formats that have it, such as GenBank, EMBL, UniProt), and
>  so except for exotic clades that don't follow the typical patterns it
>  is usually in good shape (though one might say that the majority of
>  clades are exotic).

I'm currently testing with GenBank, EMBL and SwissProt/UniProt files.
Some of these files are several years old, and include have horrible
multi-species SwissProt files with "species" names longer than 255
characters etc.  The good news is that as you pointed out on another
thread on the BioSQL mailing list earlier this month, they don't seem
to do this anymore.

>  Moreover, it's worth noting that the NCBI taxonomy often contains
>  more nodes in a lineage than are shown in the GenBank record. In this
>  case, unless you know which levels (ranks) to print and which not to,
>  having the full NCBI taxonomy information may in fact cause problems
>  for round-tripping.

I've come to accept that taxonomy information won't always survive a round trip.

>  > The old Biopython code also used to add taxon table entries for the
>  > full lineage - trying to reuse existing entries based on string
>  > matching to the scientific name field in the taxon_name table.  This
>  > strikes me as a little unreliable (which is why I used the term "guess
>  > work" in my earlier email).
>
>  It's pretty unreliable actually. There is not only synonymy but also
>  rampant homonymy in taxonomic names. There are plenty of examples for
>  the same scientific name in use for a plant and for some animal, for
>  example. So in order to be unambiguous you will need to know (and
>  check) the kingdom.

I don't think the current Biopython code for recording the lineages checks the
kingdom... could someone point me at the relevant bit of BioPerl and I'll see
if I can understand exactly what they do?

Hilmar Lapp wrote:
>  If I remember correctly, the script makes (and hence expects) the
>  primary key and the NCBI taxonomy ID to be identical.
>  ...
>  Doing that isn't a big deal but I guess this could also be fixed in
>  load_ncbi_taxonomy.pl so that it doesn't need to rely on this
>  assumption. Would someone mind filing the bug report? (We have a
>  BioSQL category now on bugzilla.open-bio.org.)

I've filed Bug 2470 on this, http://bugzilla.open-bio.org/show_bug.cgi?id=2470

Regards,

Peter


From hlapp at gmx.net  Tue Mar 18 12:30:34 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Tue, 18 Mar 2008 08:30:34 -0400
Subject: [BioSQL-l] Loading sequences with novel NCBI taxon id
In-Reply-To: <320fb6e00803170908x76f0b9a3he57f4653d2fd433@mail.gmail.com>
References: <320fb6e00803130806w46148bacm54c3ead9a50b038f@mail.gmail.com>
	<32EB5B0C-4CC8-4C33-9F41-5D4465B6AC48@gmx.net>
	<320fb6e00803131613o20eae2b7y325814ef26d2738f@mail.gmail.com>
	<CEA4F4E7-A66B-4C62-AE32-511E177BC485@gmx.net>
	<93b45ca50803140648s5098a7d0sec621f448ef03040@mail.gmail.com>
	<320fb6e00803161216q51488e5bo8d1538fb616edb88@mail.gmail.com>
	<9BC888F9-1DB1-40CC-93DA-C27E30019C04@gmx.net>
	<320fb6e00803170908x76f0b9a3he57f4653d2fd433@mail.gmail.com>
Message-ID: <418EB160-7848-4F1A-A88B-99B00003F8A2@gmx.net>


On Mar 17, 2008, at 12:08 PM, Peter wrote:

>> [...]
>>  It's pretty unreliable actually. There is not only synonymy but also
>>  rampant homonymy in taxonomic names. There are plenty of examples  
>> for
>>  the same scientific name in use for a plant and for some animal, for
>>  example. So in order to be unambiguous you will need to know (and
>>  check) the kingdom.
>
> I don't think the current Biopython code for recording the lineages  
> checks the
> kingdom... could someone point me at the relevant bit of BioPerl  
> and I'll see
> if I can understand exactly what they do?

Bioperl-db locates by NCBI taxon id first and then by scientific  
name. It does not take kingdom into account.

You can find the persisted columns, unique key queries etc in Bio/DB/ 
BioSQL and then the respective adapter, in this case  
SpeciesAdapter.pm. The unique key queries are defined in  
get_unique_key_query().

>
> Hilmar Lapp wrote:
>>  If I remember correctly, the script makes (and hence expects) the
>>  primary key and the NCBI taxonomy ID to be identical.
>>  ...
>>  Doing that isn't a big deal but I guess this could also be fixed in
>>  load_ncbi_taxonomy.pl so that it doesn't need to rely on this
>>  assumption. Would someone mind filing the bug report? (We have a
>>  BioSQL category now on bugzilla.open-bio.org.)
>
> I've filed Bug 2470 on this, http://bugzilla.open-bio.org/ 
> show_bug.cgi?id=2470

Thanks for the help, great, appreciated!

	-hilmar

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at nescent.org  Sun Mar  9 23:36:26 2008
From: hlapp at nescent.org (Hilmar Lapp)
Date: Sun, 9 Mar 2008 19:36:26 -0400
Subject: [BioSQL-l] bioperl-db bugs
In-Reply-To: <C46D2101-D4FF-47E9-89BA-0D84114CCB7B@uiuc.edu>
References: <C46D2101-D4FF-47E9-89BA-0D84114CCB7B@uiuc.edu>
Message-ID: <E39A75D2-D93B-493E-BE5F-747BAB91EEFD@nescent.org>

Hi Chris,

I added comments to both bug reports. This belongs to BioPerl,  
though, as it has only to do with its language binding.

The tidbit may be worth keeping in mind for a general BioSQL audience  
is that bioentry namespace (foreign key to biodatabase) is part of  
the (compound) bioentry unique keys. The identifier column used to be  
unique by itself (and could still be made such in a local instance,  
there's a comment to this effect in the DDL), but that was changed a  
while ago. (Also, if one uses any of the Bio* language bindings,  
changing a unique key constraint to something that differs from what  
the language binding assumes may be asking for a lot of trouble.  
Bioperl-db will expect the combination of primary_id() and namespace 
() to match if the latter is provided.)

	-hilmar

On Mar 5, 2008, at 6:24 PM, Chris Fields wrote:

> Hilmar,
>
> I think I have two bioperl-db bugs sorted out, but I'm trying to  
> determine whether the solution is a side-effect, a feature, or a  
> bug.  Dmitry has filed two bug reports which are somewhat related:
>
> http://bugzilla.open-bio.org/show_bug.cgi?id=2280
> http://bugzilla.open-bio.org/show_bug.cgi?id=2281
>
> I have added my comments to it, but maybe you can shed some more  
> light on this.  What he is trying to do is copy a persistent Seq  
> object to a different namespace; load_seqdatabase.pl won't let him  
> do that directly using the same sequence file.  If he changes the  
> namespace() and store()s it using a script, the seq is moved to the  
> new namespace, not updated.
>
> My reasoning is this is a feature (by not changing the primary_key,  
> you don't store a new sequence but update the current one).   
> However, if the primary_key is unset (undef), then it appears you  
> can copy the sequence over (from Dmitry's script, with my addition  
> noted):
>
> ...
> my $ns1 = 'space1';
> my $ns2 = 'space2';
>
> my $seqadp = $db->get_object_adaptor('Bio::SeqI');
> my $aux_seq = Bio::Seq::RichSeq->new(
>     -accession_number => 'NC_005982',
>     -version => 1,
>     -namespace => $ns1);
> my $seq = $seqadp->find_by_unique_key($aux_seq);
>
> # store the found sequence in the second biodatabase:
> my $pseq = $seqadp->create_persistent($ns2);
> $pseq->namespace('bioperl2');
> $pseq->primary_key(undef);  # my addition, which appears to work
> $pseq->store();
> $seqadp->commit;
> ...
>
> My question: is this an intended effect?  The ability to assign  
> undef to primary_key seems intentional based on the method code,  
> but I'm a bit uncertain here.
>
> chris
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:- Durham, NC -:- informatics.nescent.org :
===========================================================


From darin.london at duke.edu  Tue Mar 18 18:16:59 2008
From: darin.london at duke.edu (darin.london at duke.edu)
Date: Tue, 18 Mar 2008 13:16:59 -0500
Subject: [BioSQL-l] BOSC 2008 Announcement and Call For Submissions
Message-ID: <200803181816.m2IIGx2k007275@tenero.duhs.duke.edu>


BOSC 2008 Call for Abstracts

The 9th annual Bioinformatics Open Source Conference (BOSC 2008) will take place in Toronto, Ontario, Canada, as one of several Special Interest Group (SIG) meetings occurring in conjunction with the 16th annual Intelligent Systems for Molecular Biology Conference (ISMB 2008).

The Bioinformatics Open Source Conference (BOSC) is sponsored by the Open Bioinformatics Foundation (O|B|F), a non-profit group dedicated to promoting the practice and philosophy of Open Source software development within the biological research community. Many Open Source bioinformatics packages are widely used by the research community across many application areas and form a cornerstone in enabling research in the genomic and post-genomic era. Open source bioinformatics software has facilitated rapid innovation and dissemination of new computational methods as well as informatics infrastructure. Since the work of the Open Source Bioinformatics Community represents some of the most cutting edge of Bioinformatics in general, the overall theme for the conference this year is "Tackling Hard Problems with Emerging Technologies". Topics under this umbrella include cyberinfrastructure, grid computing and workflow management and discovery, and visualization. We will also have a series of update talks about the main Open Source Bioinformatics Software suites.

One of the hallmarks of BOSC is the coming together of the open source developer community in one location. A face-to-face meeting of this community creates synergy where participants can work together to create use cases, prototype working code, or run bootcamps for developers from other projects as short, informal, and hands-on tutorials in new software packages and emerging technologies. In short, BOSC is not just a conference for presentations of completed work, but is a dynamic meeting where collaborative work gets done.

This year, BOSC is accepting abstract submissions on the conference theme "Tackling Hard Problems with Emerging Technologies". The conference theme reflects that there are new technologies emerging on both the scientific front (new sequencing technologies, etc.) and the IT front (workflows, mashup/web 2.0, improvements in all of the major programming languages, etc.), which may allow the open source community to solve problems that were previously intractable. Abstracts may be submitted for the following topics.

1. Cyberinfrastructure - We are interested in presentations on topics dealing with the development of infrastructure on the web to facilitate software and data re-use (mashups, or traditional), interoperability and inter-process communication, system/service discovery, and data movement and modeling in distributed systems. This may include peer-to-peer systems of data transfer, Web Services, various flavors of data representation (SOAP, JSON, XML, others), and technologies commonly referred to under the Web 2.0 paradigm (e.g. folksonomies/tagging, user-based content generation, content feeds, and Social Networking).

2. Grid Computing and Workflow Management and Discovery - We particularly invite talks that report progress in making workflow systems easier to use and on how to do distributed-collaborative research , e.g. workflows that encompass the coordination of systems running in different parts of the world.

3. Visualization - Visualization is a maturing area of open source software development. We particularly invite talks that demonstrate innovative visualization systems in the context of workflows.

4. Open Source Software - Speakers will present talks on the use, development, or philosophy of open source software in bioinformatics.

5. Bio* Open Source Project Updates - We invite abstracts from the representatives of the open source projects sponsored by or affiliated to the O|B|F (see Projects).


Please consult the official BOSC 2008 website at http://www.open-bio.org/wiki/Upcoming_BOSC_conference  for all updates and extra information.

Submission Process:
All abstracts must be submitted through our Open Conference Systems site (http://events.open-bio.org/BOSC2008/openconf.php).
The form will ask for a small Abstract Text to be pasted into it, and a full paper.  The small Abstract text should be a summary, while the longer abstract (should provide more details, including the open-source license requirement details)
Full-length abstracts are limited to one page with one inch (2.5 cm) margins on the top, sides, and bottom.  The full-length abstract should include the title, authors, and affiliations.  We prefer your abstract to be in PDF format, although plain t

Important Dates:
May 11: Abstract submission deadline.
June 2: Notification of accepted talks.
June 4: Early registration discount cut-off.
July 18-19: BOSC 2008!

We hope to see you at BOSC 2008!

Kam Dahlquist and Darin London
BOSC 2008 Co-organizers

			 
From er at xs4all.nl  Thu Mar 20 19:24:12 2008
From: er at xs4all.nl (Erik)
Date: Thu, 20 Mar 2008 20:24:12 +0100 (CET)
Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer
Message-ID: <5095.156.83.1.251.1206041052.squirrel@webmail.xs4all.nl>

Hi,

(latest BioSQL, bioperl-db, and bioperl-live installed.)

Postgres 8.3 will not auto-cast text (='character
varying') to integer any longer, which causes test
t/16odba.t to fail:


------------- EXCEPTION: Bio::Root::Exception -------------
MSG: error while executing query in
Bio::DB::BioSQL::SeqAdaptor::find_by_query: ERROR: 
operator does not exist: character varying = integer
LINE 1: ...eq.taxon_id FROM bioentry seq WHERE
seq.identifier = 5456929

It seems likely to cause many similar statements to fail;
how should this be solved?

I tried to fix it but I couldn't find the place where the
statement/clauses are put together.


Thanks,

Erik Rijkers


From hlapp at gmx.net  Thu Mar 20 22:49:41 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 20 Mar 2008 18:49:41 -0400
Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer
In-Reply-To: <5095.156.83.1.251.1206041052.squirrel@webmail.xs4all.nl>
References: <5095.156.83.1.251.1206041052.squirrel@webmail.xs4all.nl>
Message-ID: <0F80B40B-0232-4367-8433-992588B6E71B@gmx.net>

Hi Erik, thanks for the report. Given the error message, it looks  
more like the integer (which in reality is a string) can't be  
automatically converted to a string.

That would be equally interesting, though. DBI I thought used to bind  
all parameters as string by default, but maybe that has changed?

The parameter values are indeed all bound generically (and the query  
is created dynamically too), and I'm leaving it up to the DBD drivers  
to do the "Right Thing". I could obviously force everything into type  
string, but that is likely to have it's own repercussions on various  
RDBMSs.

So could you file this as a bug report on bugzilla.open-bio.org  
(category bioperl-db, this is actually not a BioSQL problem), and run  
the following test on your 8.3 instance (which minor version actually?):

CREATE TABLE t1 (a varchar(10), b text, c integer);

SELECT * from t1 WHERE a = 1;
SELECT * from t1 WHERE b = 1;
SELECT * from t1 WHERE c = '1';

INSERT INTO t1 (a,b,c) VALUES ('a','b',1);

SELECT * from t1 WHERE a = 1;
SELECT * from t1 WHERE b = 1;
SELECT * from t1 WHERE c = '1';

SELECT * from t1 WHERE a = 1::text;
SELECT * from t1 WHERE b = 1::text;
SELECT * from t1 WHERE c = integer '1';

DROP TABLE t1;

These work all fine on my 8.1.4 instance.

	-hilmar

On Mar 20, 2008, at 3:24 PM, Erik wrote:
> Hi,
>
> (latest BioSQL, bioperl-db, and bioperl-live installed.)
>
> Postgres 8.3 will not auto-cast text (='character
> varying') to integer any longer, which causes test
> t/16odba.t to fail:
>
>
> ------------- EXCEPTION: Bio::Root::Exception -------------
> MSG: error while executing query in
> Bio::DB::BioSQL::SeqAdaptor::find_by_query: ERROR:
> operator does not exist: character varying = integer
> LINE 1: ...eq.taxon_id FROM bioentry seq WHERE
> seq.identifier = 5456929
>
> It seems likely to cause many similar statements to fail;
> how should this be solved?
>
> I tried to fix it but I couldn't find the place where the
> statement/clauses are put together.
>
>
> Thanks,
>
> Erik Rijkers
>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From er at xs4all.nl  Thu Mar 20 23:30:03 2008
From: er at xs4all.nl (Erik)
Date: Fri, 21 Mar 2008 00:30:03 +0100 (CET)
Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer
Message-ID: <15786.156.83.1.157.1206055803.squirrel@webmail.xs4all.nl>

On Thu, March 20, 2008 23:49, Hilmar Lapp wrote:
> Hi Erik, thanks for the report. Given the error message,
> it looks
> more like the integer (which in reality is a string) can't
> be automatically converted to a string.

you are right, of course :)


Here is the postgres 8.3.1 result of your sql statements:

CREATE TABLE t1 (a varchar(10), b text, c integer);

SELECT * from t1 WHERE a = 1;   -- fails in 8.3.1
SELECT * from t1 WHERE b = 1;	  -- fails in 8.3.1
SELECT * from t1 WHERE c = '1'; -- ok

INSERT INTO t1 (a,b,c) VALUES ('a','b',1);

SELECT * from t1 WHERE a = 1;	  -- fails in 8.3.1
SELECT * from t1 WHERE b = 1;	  -- fails in 8.3.1
SELECT * from t1 WHERE c = '1'; -- ok

SELECT * from t1 WHERE a = 1::text;     -- ok
SELECT * from t1 WHERE b = 1::text;     -- ok
SELECT * from t1 WHERE c = integer '1'; -- ok

The failure is always (virtually) the same:
ERROR:  operator does not exist: character varying = integer
LINE 1: SELECT * from t1 WHERE a = 1;
                                 ^
HINT:  No operator matches the given name and argument
type(s). You might need to add explicit type casts.


Then there is the cast function: for instance, I can let
the test in t/16odba.t proceed faultlessly with

 $seq = $biodb->get_Seq_by_id( "cast(5456929 as text)" );


I am also doubtful/curious as to how this would affect the
various loading scripts which I was going to use - I want
to set up a GBrowse with human/mouse/flybase sequence
annotation to show ChipSeq data against.

But one thing at a time, I guess...


> So could you file this as a bug report on
> bugzilla.open-bio.org
> (category bioperl-db, this is actually not a BioSQL
> problem),

I'll make an entry in bugzilla/bioperl-db.


Thanks for you quick reply!


Erik Rijkers


From hlapp at gmx.net  Fri Mar 21 00:34:42 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 20 Mar 2008 20:34:42 -0400
Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer
In-Reply-To: <15786.156.83.1.157.1206055803.squirrel@webmail.xs4all.nl>
References: <15786.156.83.1.157.1206055803.squirrel@webmail.xs4all.nl>
Message-ID: <987C9C0E-840B-44AD-B3E9-0FC2809FF4F4@gmx.net>


On Mar 20, 2008, at 7:30 PM, Erik wrote:
> Here is the postgres 8.3.1 result of your sql statements:
>
> CREATE TABLE t1 (a varchar(10), b text, c integer);
>
> SELECT * from t1 WHERE a = 1;   -- fails in 8.3.1
> SELECT * from t1 WHERE b = 1;	  -- fails in 8.3.1
> SELECT * from t1 WHERE c = '1'; -- ok
>
> [...]
> The failure is always (virtually) the same:
> ERROR:  operator does not exist: character varying = integer
> LINE 1: SELECT * from t1 WHERE a = 1;
>                                  ^
> HINT:  No operator matches the given name and argument
> type(s). You might need to add explicit type casts.


So it's indeed the backend that changed behavior. It's actually  
documented as I see now:

http://www.postgresql.org/docs/8.3/static/release-8-3.html

scroll to section E.2.2. Migration to Version 8.3, E.2.2.1. General,  
and the first item there:

<quote>
Non-character data types are no longer automatically cast to TEXT  
(Peter, Tom)

Previously, if a non-character value was supplied to an operator or  
function that requires text input, it was automatically cast to text,  
for most (though not all) built-in data types. This no longer  
happens: an explicit cast to text is now required for all non- 
character-string types.
</quote>

I can see the arguments there but this will prevent upgrading to 8.3  
for many many applications, and the comments from the Pg developers  
('fix your SQL to use casts') that I've seen there on the mailing  
lists are just not helpful. Fixing SQL is for many legacy  
applications is just not an option.

In the case of Bioperl-db it's very non-trivial, because all of a  
sudden we would be changing from a hands-off and let-the-driver- 
figure-it-out approach to forcing types everywhere.

So I think at this point with this change I have to declare Bioperl- 
db officially incompatible with PostgreSQL 8.3+ until we've found a  
solution to this, which is too bad because it seems 8.3 has some  
really nice performance features added.

One possible solution might be to create a CAST in the database  
(namely the one that was taken away, restoring behavior to pre-8.3).  
Another possibility is to move the parameter binding method into the  
driver adaptor which would then delegate to the DBI method but would  
be overridden for the PostgreSQL adapter to force all bindings to  
type string.

Which leads me back to the surprise observation that the parameter  
was bound as an integer in the first place, when DBD::Pg used to bind  
everything as string unless you told it otherwise. Which DBD::Pg  
version is it that you are using? I would suspect (or hope) that  
maybe there is soon an update release of DBD::Pg that fixes this  
problem by going back to binding everything as string by default (and  
as the tests show PostgreSQL will still convert strings to integer if  
necessary).

Depending on what I (or can someone else update us on this?) find out  
for the DBD::Pg plans, I'll probably start looking into moving the  
parameter binding into the driver adapters. Though it does feel  
pathetic that this is now also not transparent between drivers.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From er at xs4all.nl  Fri Mar 21 00:51:43 2008
From: er at xs4all.nl (Erik)
Date: Fri, 21 Mar 2008 01:51:43 +0100 (CET)
Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer
Message-ID: <4483.156.83.1.157.1206060703.squirrel@webmail.xs4all.nl>

On Fri, March 21, 2008 01:34, Hilmar Lapp wrote:
>
> So I think at this point with this change I have to
> declare Bioperl-
> db officially incompatible with PostgreSQL 8.3+ until
> we've found a
> solution to this, which is too bad because it seems 8.3
> has some
> really nice performance features added.

Pg 8.3 is indeed very noticably faster, and it has other
excellent new features like full text indexing. (This also
makes that downgrading is not really an option)


> Which DBD::Pg version is it that you are using?

DBD::Pg 2.3.0


Thanks,

Erik Rijkers


From hlapp at gmx.net  Fri Mar 21 01:36:50 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Thu, 20 Mar 2008 21:36:50 -0400
Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer
In-Reply-To: <4483.156.83.1.157.1206060703.squirrel@webmail.xs4all.nl>
References: <4483.156.83.1.157.1206060703.squirrel@webmail.xs4all.nl>
Message-ID: <071CB899-AB3E-40B8-9477-82AE98DB88B1@gmx.net>


On Mar 20, 2008, at 8:51 PM, Erik wrote:
> On Fri, March 21, 2008 01:34, Hilmar Lapp wrote:
>>
>> So I think at this point with this change I have to declare  
>> Bioperl-db officially incompatible with PostgreSQL 8.3+ until  
>> we've found a solution to this, which is too bad because it seems  
>> 8.3 has some really nice performance features added.
>
> Pg 8.3 is indeed very noticably faster, and it has other
> excellent new features like full text indexing. (This also
> makes that downgrading is not really an option)

Right, I saw that too. It is, however, just migrated from what was a  
contrib module before, so downgrading and using the contrib module is  
an option.

Furthermore, folding these new features together with a behavior  
change that is backwards incompatible was a choice the PostgreSQL  
people made, not we.

We also aren't doing poor typing that deserves fixing; we're just not  
doing any typing by treating everything as a string. This is the Perl  
paradigm.

At this point it's actually unclear to me how this new behavior is  
compatible with untyped scripting languages unless you know the type  
of each column that you're binding a value for, because if you  
actually force typecasts to string for everything you get an error if  
an integer is indeed what's needed.

I'm wondering what I'm missing.

	-hilmar

BTW what does the following query yield on your 8.3.1 database:

select s.typname as source, t.typname as target, f.proname as  
function, c.castcontextfrom pg_cast c, pg_type s, pg_type t, pg_proc  
f where c.castsource = s.oid and c.casttarget = t.oid and c.castfunc  
= f.oidand t.typname = 'text';

On my 8.1.4 database I get:

   source    | target | function | castcontext
-------------+--------+----------+-------------
  bpchar      | text   | text     | i
  char        | text   | text     | i
  name        | text   | text     | i
  int8        | text   | text     | i
  int2        | text   | text     | i
  int4        | text   | text     | i
  oid         | text   | text     | i
  float4      | text   | text     | i
  float8      | text   | text     | i
  macaddr     | text   | text     | e
  cidr        | text   | text     | e
  inet        | text   | text     | e
  date        | text   | text     | i
  time        | text   | text     | i
  timestamp   | text   | text     | i
  timestamptz | text   | text     | i
  interval    | text   | text     | i
  timetz      | text   | text     | i
  numeric     | text   | text     | i
(19 rows)

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From greg at turnstep.com  Fri Mar 21 02:41:10 2008
From: greg at turnstep.com (Greg Sabino Mullane)
Date: Fri, 21 Mar 2008 02:41:10 -0000
Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer
In-Reply-To: <987C9C0E-840B-44AD-B3E9-0FC2809FF4F4@gmx.net>
Message-ID: <19ecb7a297f64722c4f63f10ed2ebdce@biglumber.com>


-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160


> Which leads me back to the surprise observation that the parameter
> was bound as an integer in the first place, when DBD::Pg used to bind
> everything as string unless you told it otherwise. Which DBD::Pg
> version is it that you are using? I would suspect (or hope) that
> maybe there is soon an update release of DBD::Pg that fixes this
> problem by going back to binding everything as string by default (and
> as the tests show PostgreSQL will still convert strings to integer if
> necessary).
>
> Depending on what I (or can someone else update us on this?) find out
> for the DBD::Pg plans, I'll probably start looking into moving the
> parameter binding into the driver adapters. Though it does feel
> pathetic that this is now also not transparent between drivers.

What you are probably looking for is already there, namely:

$dbh->{pg_server_prepare} = 0;

There's good reasons for the casting enforcement in 8.3, although I've
been a sharp critic of the change, and certainly of the suddeness
of it. Another solution to consider is adding the casts back in:

http://people.planetpostgresql.org/peter/index.php?/archives/2008/03.html
(the March 4th entry)

- --
Greg Sabino Mullane greg at turnstep.com
PGP Key: 0x14964AC8 200803202237
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAkfjIBYACgkQvJuQZxSWSsiamwCdEbNrC4F4oU7AGHrbHAm1YNXG
HbUAoIRJtGW4brvMKklxZYG6pusbcTqf
=Zawx
-----END PGP SIGNATURE-----


From hlapp at gmx.net  Fri Mar 21 12:52:39 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Fri, 21 Mar 2008 08:52:39 -0400
Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer
In-Reply-To: <19ecb7a297f64722c4f63f10ed2ebdce@biglumber.com>
References: <19ecb7a297f64722c4f63f10ed2ebdce@biglumber.com>
Message-ID: <C24DE5CA-F433-48A1-BF08-A6D056A2EBCE@gmx.net>

Hi Greg - thanks for your email, it's very helpful.

On Mar 20, 2008, at 10:41 PM, Greg Sabino Mullane wrote:
>>
>> Depending on what I (or can someone else update us on this?) find out
>> for the DBD::Pg plans, I'll probably start looking into moving the
>> parameter binding into the driver adapters. Though it does feel
>> pathetic that this is now also not transparent between drivers.
>
> What you are probably looking for is already there, namely:
>
> $dbh->{pg_server_prepare} = 0;

So disabling server-side prepares will leave values quoted? Having  
server-side prepares would be very useful though, especially for  
Bioperl-db with its many lookup queries that all use similar  
parameter values.

>
> There's good reasons for the casting enforcement in 8.3

I do understand that, but it's also a sharp contrast to other RDBMSs  
that doesn't it make it easier for people to choose Pg when they  
should, and doesn't help writing cross-platform database applications  
either.

> although I've been a sharp critic of the change, and certainly of  
> the suddeness
> of it. Another solution to consider is adding the casts back in:
>
> http://people.planetpostgresql.org/peter/index.php?/archives/ 
> 2008/03.html
> (the March 4th entry)


Thanks for this, that helps a lot.

Do you have links to some of the key threads showing what rationale  
went into the decision? (Or should I just search for your name?) I'd  
like to read up on that first before pouring more oil into the fire.  
I suspect that many of those who made the decision are never faced  
with needing to write cross-RDBMS code.

Also, I wonder why this wasn't made a configurable option so it can  
be disabled by a simple config file change (such as the move away  
from automatic OID columns). But obviously this is the wrong list for  
discussing this (though Bioperl-db *is* one of those pieces of  
software that must be cross-RDBMS).

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From er at xs4all.nl  Fri Mar 21 21:43:47 2008
From: er at xs4all.nl (Erik)
Date: Fri, 21 Mar 2008 22:43:47 +0100 (CET)
Subject: [BioSQL-l] [Bioperl-l] postgres 8.3 - load_seqdatabase.pl /
 swissprot
Message-ID: <16589.156.83.1.157.1206135827.squirrel@webmail.xs4all.nl>

Hi,

PostgreSQL 8.3.1
DBD::Pg 2.3.0
perl 5.8.8

(The following error may have to do with the 8.3 problems
that I reported yesterday (bug 2472) - I don't know)

 I ran biosql-schema/scripts/load_ncbi_taxonomy.pl without
problem.

Then I ran scripts/biosql/load_seqdatabase.pl as:

perl scripts/biosql/load_seqdatabase.pl \
  -driver Pg \
  -dbuser xxxxxxx \
  -dbname bioseqdb \
  -namespace swissprot \
  -format swiss \
   /DATA/ms/ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat

It took two hours to load 26504 records (7%) of
uniprot_sprot.dat (is it expected to be so slow?), then
failed with:

Could not store Q2UXW0:
------------- EXCEPTION: Bio::Root::Exception -------------
MSG: create: object (Bio::Species) failed to insert or to
be found by unique key
STACK: Error::throw
STACK: Bio::Root::Root::throw
/home/aardvark/bin/perl/lib/site_perl/5.8.8/Bio/Root/Root.pm:357
STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create
/home/aardvark/bin/perl/lib/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:206
STACK: Bio::DB::Persistent::PersistentObject::create
/home/aardvark/bin/perl/lib/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:244
STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::create
/home/aardvark/bin/perl/lib/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:169
STACK: Bio::DB::BioSQL::BasePersistenceAdaptor::store
/home/aardvark/bin/perl/lib/site_perl/5.8.8/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:251
STACK: Bio::DB::Persistent::PersistentObject::store
/home/aardvark/bin/perl/lib/site_perl/5.8.8/Bio/DB/Persistent/PersistentObject.pm:271
STACK: scripts/biosql/load_seqdatabase.pl:630
-----------------------------------------------------------


I don't know if this is directly related to the 8.3
casting problems I reported yesterday (bug 2472), or a
separate Bio::Species issue


regards,

Erik Rijkers


From hlapp at gmx.net  Sat Mar 22 18:18:45 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 22 Mar 2008 14:18:45 -0400
Subject: [BioSQL-l] Call for Student Applications - NESCent participates in
	the Google Summer of Code
In-Reply-To: <0025B440-EF1E-4632-9DB4-B98489BF3550@duke.edu>
Message-ID: <5AC4F213-8D88-41C6-B380-59B2EF7831F0@gmx.net>

Hi all - just wanted to draw your attention to our Google Summer of  
Code participation this year. One of the projects deals directly with  
BioPerl, another one builds on BioSQL (and could be implemented  
taking advantage of BioPerl or Bio::Phylo, or Biojava).

Cheers,

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================

Phyloinformatics Summer of Code 2008
http://phyloinformatics.net/Phyloinformatics_Summer_of_Code_2008

*** Please disseminate this announcement widely to appropriate students
at your institution ***

The National Evolutionary Synthesis Center (NESCent: http:// 
www.nescent.org/) is participating in 2008 for the second year as a  
mentoring organization in the Google Summer of Code (http:// 
code.google.com/soc). Through this program, Google provides  
undergraduate, masters, and PhD students with a unique opportunity to  
obtain hands-on experience writing and extending open-source software  
under the mentorship of experienced developers from around the world.

Our goal in participating is to train future researchers and  
developers to not only have awareness and understanding of the value  
of open-source and collaboratively developed software, but also to  
gain the programming and remote collaboration skills needed to  
successfully contribute to such projects. Students will receive a  
stipend from Google, and may work from their home, or home  
institution, for the duration of the 3 month program. Students will  
each have one or more dedicated mentors with expertise in  
phylogenetic methods and open-source software development.

NESCent is particularly targeting students interested in both  
evolutionary biology and software development. Project ideas (see URL  
below) range from visualizing phylogenetic data in R, to development  
of a Mesquite module, web-services for phylogenetic data providers or  
geophylogeny mashups, implementing phyloXML support, navigating  
databases of networks, topology queries for PhyloCode registries, to  
phylogenetic tree mining in a MapReduce framework, and more.

The project ideas are flexible and many can be adjusted in scope to  
match the skills of the student. If the program sounds interesting to  
you but you are unsure whether you have the necessary skills, please  
email the mentors at the address below.  We will work with you to  
find a project that fits your interests and skills.

INQUIRIES:
Email any questions, including self-proposed project ideas, to  
phylosoc {at}
nescent {dot} org.

TO APPLY:
Apply on-line at the Google Summer of Code website
(http://code.google.com/soc/2008), where you will also find GSoC program
rules and eligibility requirements.  The 1-week application period for
students opens on Monday March 24th and runs through Monday, March  
31st, 2008.

Hilmar Lapp and Todd Vision
US National Evolutionary Synthesis Center

=====
URLs:
=====

2008 NESCent Phyloinformatics Summer of Code:
http://phyloinformatics.net/Phyloinformatics_Summer_of_Code_2008

Eligibility requirements:
http://code.google.com/opensource/gsoc/2008/faqs.html#0.1_eligibility

Stipends:
http://code.google.com/opensource/gsoc/2008/faqs.html#0.1_administrivia

To sign up for quarterly NESCent newsletters: with announcements about
upcoming programs at the Center:
http://www.nescent.org/about/contact.php


From hlapp at gmx.net  Sat Mar 22 20:01:51 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sat, 22 Mar 2008 16:01:51 -0400
Subject: [BioSQL-l] [Bioperl-l] postgres 8.3 - load_seqdatabase.pl /
	swissprot
In-Reply-To: <16589.156.83.1.157.1206135827.squirrel@webmail.xs4all.nl>
References: <16589.156.83.1.157.1206135827.squirrel@webmail.xs4all.nl>
Message-ID: <69D3EA33-810B-40EA-8687-752FA1A34FBF@gmx.net>

Forgot to respond to this:

On Mar 21, 2008, at 5:43 PM, Erik wrote:
> It took two hours to load 26504 records (7%) of uniprot_sprot.dat  
> (is it expected to be so slow?)


The last time I used to load those regularly it was a bit faster (~ 5  
seqs/s) but it is in a ballpark that wouldn't raise a red flag for me.

BTW you can make it print statistics using the --logchunk N option,  
where N is the number of seqs after which you want the current count  
and the #recs/s printed.

You may get it to be faster if you tune the database (e.g., make sure  
there is enough memory for index reorganization, transaction log and  
tablespace datafile are on separate disks, etc; fiddling with the  
query optimizer has probably little effect as almost all queries are  
simple lookups or inserts).

That all said, the strength of load_seqdatabase.pl isn't speed. It  
doesn't make use of any bulk upload optimizations, and therefore the  
initial load of a very large database will take its time. The power  
is more in subsequent updates where you can configure what you want  
to happen, and during which the database is never in an inconsistent  
state, so it can run in the background.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From greg at turnstep.com  Mon Mar 24 00:42:36 2008
From: greg at turnstep.com (Greg Sabino Mullane)
Date: Mon, 24 Mar 2008 00:42:36 -0000
Subject: [BioSQL-l] postgres 8.3 will not cast text to integer any longer
In-Reply-To: <C24DE5CA-F433-48A1-BF08-A6D056A2EBCE@gmx.net>
Message-ID: <4ab14dcc59d7566b55ba87027055e9fd@biglumber.com>


-----BEGIN PGP SIGNED MESSAGE-----
Hash: RIPEMD160


>> Depending on what I (or can someone else update us on this?) find out
>> for the DBD::Pg plans, I'll probably start looking into moving the
>> parameter binding into the driver adapters. Though it does feel
>> pathetic that this is now also not transparent between drivers.
>
> What you are probably looking for is already there, namely:
>
> $dbh->{pg_server_prepare} = 0;

> So disabling server-side prepares will leave values quoted? Having
> server-side prepares would be very useful though, especially for
> Bioperl-db with its many lookup queries that all use similar
> parameter values.

Yes, it forces DBD::Pg to do the quoting itself, which basically means
that everything is shipped to the server as a single SQL string, and
no placeholders are used. In the grand scheme of things, the speed
difference is not large for most queries. Certainly one way would be
to turn this on for 8.3 and above, and slowly migrate the queries/schema
over time.

>> There's good reasons for the casting enforcement in 8.3

> I do understand that, but it's also a sharp contrast to other RDBMSs
> that doesn't it make it easier for people to choose Pg when they
> should, and doesn't help writing cross-platform database applications
> either.

I'm not overly familiar with how other databases treat this, but I've
heard DB2 can be a stickler about this too. I've not dug into the bioperl
code in a while, to be honest, so I'm not sure what sort of queries we're
talking about. Certainly long-term the code and schema should move away
from implicit casting. Maybe a better short-term solution is addind
the more obvious casts (e.g. text<->int) back in.

> Do you have links to some of the key threads showing what rationale
> went into the decision? (Or should I just search for your name?) I'd
> like to read up on that first before pouring more oil into the fire.
> I suspect that many of those who made the decision are never faced
> with needing to write cross-RDBMS code.
>
> Also, I wonder why this wasn't made a configurable option so it can
> be disabled by a simple config file change (such as the move away
> from automatic OID columns). But obviously this is the wrong list for
. discussing this (though Bioperl-db *is* one of those pieces of
> software that must be cross-RDBMS).

I did ask about that, and was told it would not have been easy to do so.
But I agree, a phasing in period (heck, even a warning) would have been
nice. Feel free to pour some oil on the fire, I think this is one of
many apps that has been affected. (I've run across two other major
cross-DB apps (Interchange and MediaWiki) that are struggling with the
same pain. I managed to painfully fix the latter, but the former is way
too complex to tackle at the moment).

I could not find the thread(s?) I weighed in on, but you can find some
relevant discussions by googling "strict-typing benefits grokbase"

- --
Greg Sabino Mullane greg at turnstep.com
PGP Key: 0x14964AC8 200803232039
http://biglumber.com/x/web?pk=2529DF6AB8F79407E94445B4BC9B906714964AC8
-----BEGIN PGP SIGNATURE-----

iEYEAREDAAYFAkfm+NAACgkQvJuQZxSWSsi4ogCdGNWvCJIzXxb+YKzdm6wwxQMv
p3AAnizkWXoo/rvxv4KVdC8tD0vF87k3
=dNYi
-----END PGP SIGNATURE-----


From biopython at maubp.freeserve.co.uk  Tue Mar 25 15:56:16 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 25 Mar 2008 15:56:16 +0000
Subject: [BioSQL-l] [BioPython] Concerns the update of BioSQL.taxon table
In-Reply-To: <320fb6e00803250853i629e59aj310ddc5667ea57d@mail.gmail.com>
References: <711039.40736.qm@web26505.mail.ukl.yahoo.com>
	<320fb6e00803250853i629e59aj310ddc5667ea57d@mail.gmail.com>
Message-ID: <320fb6e00803250856n1001d74dxeb8560652f594e51@mail.gmail.com>

On Tue, Mar 25, 2008 at 3:53 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Hi Eric,
>
>  Your issue is almost certainly due to switching from Biopython 1.44 to
>  1.45, rather than from a prerelease BioSQL to the recently released
>  BioSQL 1.0.0.
>
>  For background, you should read Bug 2422 and the BioSQL thread it points to.
>  http://bugzilla.open-bio.org/show_bug.cgi?id=2422
>
>  Biopython 1.44 never recorded the taxon id (and therefore didn't use
>  the taxon/taxon_name tables)
>  Biopython 1.45 does record the taxon id, and attempts to fill in
>  missing taxon/taxon_name entries
>
>  I'm a little unclear on what is going wrong for you.  Did you pre-load
>  the NCBI taxonomy for example?  The script you are talking about, is
>  this your own?
>
>  Peter
>

P.S. Did you mean to send your original message to the BioSQL list as well Eric?

You need biosql-l at lists.open-bio.org not biosql at lists.open-bio.org

Peter


From ericgibert at yahoo.fr  Wed Mar 26 11:29:24 2008
From: ericgibert at yahoo.fr (Eric Gibert)
Date: Wed, 26 Mar 2008 11:29:24 +0000 (GMT)
Subject: [BioSQL-l] Concerns the update of BioSQL.taxon table
Message-ID: <290936.61510.qm@web26510.mail.ukl.yahoo.com>

Thank you Peter for the correct email of the BioSQL list.

No, it is not something linked to BioPython 1.45 upgrade: same behavior as 1.44. My problem is linked to the fact  that the BioSQl schema version 1.0.0 defines a *unique* index on taxon.ncbi_taxon_id. I did not have this index before.

I have written a script that connects to the taxonomy database of NCBI and get the XML data for the species. Then it updates the taxon table, replacing the ncbi_taxon_id and node_rank NULL by their values for all the lineage. I call it after the loading of BioSeqs in the database.

Example:
I load a BioSeq for Nannophya pygmaea then I run my script to update the  ncbi_taxon_id and rank:
+----------+---------------+-----------------+--------------+
| taxon_id | ncbi_taxon_id | parent_taxon_id | node_rank    |
+----------+---------------+-----------------+--------------+
|       13 |          2759 |            NULL | superkingdom |
|       14 |         33208 |              13 | kingdom      |
|       15 |          6656 |              14 | phylum       |
|       16 |          6960 |              15 | superclass   |
|       17 |         50557 |              16 | class        |
|       18 |          7496 |              17 | no rank      |
|       19 |         33339 |              18 | subclass     |
|       20 |          6961 |              19 | order        |
|       21 |          6962 |              20 | suborder     |
|       22 |          6964 |              21 | family       |
|       23 |        229390 |              22 | genus        |
|       24 |        229391 |              23 | species      |

No problem.

Now I insert/load another Libellulideae (Orthetrum sabina ): 'empty/NULL' taxons records are inserted by the db.load() BioPython function:
|       25 |          NULL |            NULL | NULL         |
|       26 |          NULL |              25 | NULL         |
|       27 |          NULL |              26 | NULL         |
|       28 |          NULL |              27 | NULL         |
|       29 |          NULL |              28 | NULL         |
|       30 |          NULL |              29 | NULL         |
|       31 |          NULL |              30 | NULL         |
|       32 |          NULL |              31 | NULL         |
|       33 |          NULL |              32 | NULL         |
|       34 |          NULL |              33 | NULL         |
|       35 |          NULL |              34 | genus        |
|       36 |        320892 |              35 | species      |

then I try to run my script: this time I have an update failure because the record 34 is the SAME family hence same ncbi_taxon_id as record 22: 'duplicate entry on key 2'.

Either this *unique* index is new and it is a BioSQL "issue" (as said, this index did not exist in my previous BioSQL db so I never encountered this issue before), OR the way BioPython "repeats" existing taxons is incorrect/not compatible. In that case, when inserting the second BioSeq, record 34 should not be created but record 35 (the genus) should "point" to the already existing family at record 22 as its father.

Thus I would have the confirmation on by BioSQL team that the unique index is valid. If that is the case, then we can have a BioPython separate talk about how to improve the management of the taxon table.


Best regards,

Eric


      _____________________________________________________________________________ 
Envoyez avec Yahoo! Mail. Capacit? de stockage illimit?e pour vos emails. http://mail.yahoo.fr


From holland at ebi.ac.uk  Wed Mar 26 12:00:03 2008
From: holland at ebi.ac.uk (Richard Holland)
Date: Wed, 26 Mar 2008 12:00:03 +0000
Subject: [BioSQL-l] Concerns the update of BioSQL.taxon table
In-Reply-To: <290936.61510.qm@web26510.mail.ukl.yahoo.com>
References: <290936.61510.qm@web26510.mail.ukl.yahoo.com>
Message-ID: <47EA3AC3.20104@ebi.ac.uk>

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Purely from a database perspective, the index is correct. There should
be no need to have a duplicate entry in ncbi_taxon_id. The implication
is that taxon_id is a 1:1 mapping to ncbi_taxon_id. There should be no
need to have two separate local taxon_id values referring to one NCBI taxon.

Ideally, when you run your update script, for each taxon_id record it
processes it should be checking for an existing entry with the same
ncbi_taxon_id, getting the taxon_id for that existing entry, then
removing the duplicate entry and updating the relevant parent_taxon_id
values in other records to refer to the existing taxon_id instead.

BioPython would need to be making similar checks when it inserts new
entries. If it isn't, then it needs to be fixed.

cheers,
Richard

Eric Gibert wrote:
> Thank you Peter for the correct email of the BioSQL list.
> 
> No, it is not something linked to BioPython 1.45 upgrade: same behavior as 1.44. My problem is linked to the fact  that the BioSQl schema version 1.0.0 defines a *unique* index on taxon.ncbi_taxon_id. I did not have this index before.
> 
> I have written a script that connects to the taxonomy database of NCBI and get the XML data for the species. Then it updates the taxon table, replacing the ncbi_taxon_id and node_rank NULL by their values for all the lineage. I call it after the loading of BioSeqs in the database.
> 
> Example:
> I load a BioSeq for Nannophya pygmaea then I run my script to update the  ncbi_taxon_id and rank:
> +----------+---------------+-----------------+--------------+
> | taxon_id | ncbi_taxon_id | parent_taxon_id | node_rank    |
> +----------+---------------+-----------------+--------------+
> |       13 |          2759 |            NULL | superkingdom |
> |       14 |         33208 |              13 | kingdom      |
> |       15 |          6656 |              14 | phylum       |
> |       16 |          6960 |              15 | superclass   |
> |       17 |         50557 |              16 | class        |
> |       18 |          7496 |              17 | no rank      |
> |       19 |         33339 |              18 | subclass     |
> |       20 |          6961 |              19 | order        |
> |       21 |          6962 |              20 | suborder     |
> |       22 |          6964 |              21 | family       |
> |       23 |        229390 |              22 | genus        |
> |       24 |        229391 |              23 | species      |
> 
> No problem.
> 
> Now I insert/load another Libellulideae (Orthetrum sabina ): 'empty/NULL' taxons records are inserted by the db.load() BioPython function:
> |       25 |          NULL |            NULL | NULL         |
> |       26 |          NULL |              25 | NULL         |
> |       27 |          NULL |              26 | NULL         |
> |       28 |          NULL |              27 | NULL         |
> |       29 |          NULL |              28 | NULL         |
> |       30 |          NULL |              29 | NULL         |
> |       31 |          NULL |              30 | NULL         |
> |       32 |          NULL |              31 | NULL         |
> |       33 |          NULL |              32 | NULL         |
> |       34 |          NULL |              33 | NULL         |
> |       35 |          NULL |              34 | genus        |
> |       36 |        320892 |              35 | species      |
> 
> then I try to run my script: this time I have an update failure because the record 34 is the SAME family hence same ncbi_taxon_id as record 22: 'duplicate entry on key 2'.
> 
> Either this *unique* index is new and it is a BioSQL "issue" (as said, this index did not exist in my previous BioSQL db so I never encountered this issue before), OR the way BioPython "repeats" existing taxons is incorrect/not compatible. In that case, when inserting the second BioSeq, record 34 should not be created but record 35 (the genus) should "point" to the already existing family at record 22 as its father.
> 
> Thus I would have the confirmation on by BioSQL team that the unique index is valid. If that is the case, then we can have a BioPython separate talk about how to improve the management of the taxon table.
> 
> 
> Best regards,
> 
> Eric
> 
> 
> 
> 
> 
> 
>       _____________________________________________________________________________ 
> Envoyez avec Yahoo! Mail. Capacit? de stockage illimit?e pour vos emails. http://mail.yahoo.fr
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
> 

- --
Richard Holland (BioMart)
EMBL EBI, Wellcome Trust Genome Campus,
Hinxton, Cambridgeshire CB10 1SD, UK
Tel. +44 (0)1223 494416

http://www.biomart.org/
http://www.biojava.org/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2.2 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFH6jrD4C5LeMEKA/QRAu7rAJ9TBYt0CeTTrPi0QN7Vm/UwiBANQwCfeoqz
0uTvcXXteholK+4xxuxjCXw=
=qhOf
-----END PGP SIGNATURE-----


From biopython at maubp.freeserve.co.uk  Wed Mar 26 12:30:50 2008
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 26 Mar 2008 12:30:50 +0000
Subject: [BioSQL-l] [BioPython] Concerns the update of BioSQL.taxon table
In-Reply-To: <290936.61510.qm@web26510.mail.ukl.yahoo.com>
References: <290936.61510.qm@web26510.mail.ukl.yahoo.com>
Message-ID: <320fb6e00803260530w72cca900mc19654798d5d7e13@mail.gmail.com>

On Wed, Mar 26, 2008 at 11:29 AM, Eric Gibert <ericgibert at yahoo.fr> wrote:
> Thank you Peter for the correct email of the BioSQL list.
>
> No, it is not something linked to BioPython 1.45 upgrade: same behavior as 1.44.
> My problem is linked to the fact  that the BioSQl schema version 1.0.0 defines a
> *unique* index on taxon.ncbi_taxon_id. I did not have this index before.
>
>  I have written a script that connects to the taxonomy database of NCBI and get
>  the XML data for the species. Then it updates the taxon table, replacing the
>  ncbi_taxon_id and node_rank NULL by their values for all the lineage. I call it
>  after the loading of BioSeqs in the database.

So you wrote your own version of the BioSQL perl script load_ncbi_taxonomy.pl?

>  Example:
>  I load a BioSeq for Nannophya pygmaea then I run my script to update the  ncbi_taxon_id and rank:
>  +----------+---------------+-----------------+--------------+
>  | taxon_id | ncbi_taxon_id | parent_taxon_id | node_rank    |
>  +----------+---------------+-----------------+--------------+
>  |       13 |          2759 |            NULL | superkingdom |
>  |       14 |         33208 |              13 | kingdom      |
>  |       15 |          6656 |              14 | phylum       |
>  |       16 |          6960 |              15 | superclass   |
>  |       17 |         50557 |              16 | class        |
>  |       18 |          7496 |              17 | no rank      |
>  |       19 |         33339 |              18 | subclass     |
>  |       20 |          6961 |              19 | order        |
>  |       21 |          6962 |              20 | suborder     |
>  |       22 |          6964 |              21 | family       |
>  |       23 |        229390 |              22 | genus        |
>  |       24 |        229391 |              23 | species      |
>
>  No problem.
>
>  Now I insert/load another Libellulideae (Orthetrum sabina ): 'empty/NULL'
>  taxons records are inserted by the db.load() BioPython function:

These records are "guess work" based on the lineage in the GenBank
file - we don't know the NCBI taxon ids, so they are NULL, nor the
rank, but there is a scientific name in the lined taxon_name table.  I
am open to the idea of not writing this guessed lineage, and just
writing one entry for the species and the given NCBI taxon ID.

However, as the new entry Orthetrum sabina should share some of its
lineage with Nannophya pygmaea, then I agree Biopython *should* be
re-using those existing taxon entries, if it can match them safely
using the scientific name.  Re-reading the relevant bit of old code,
it doesn't seem to do this.  I've file bug 2475:
http://bugzilla.open-bio.org/show_bug.cgi?id=2475

This is actually a tricky problem, requiring some a 'clever' parent
linkage as you said in your earlier email.  Hilmar wrote this about
the equivalent code in BioPerl:

>>  It's pretty unreliable actually. There is not only synonymy but also
>>  rampant homonymy in taxonomic names. There are plenty of examples
>>  for the same scientific name in use for a plant and for some animal, for
>>  example. So in order to be unambiguous you will need to know (and
>>  check) the kingdom.

See http://lists.open-bio.org/pipermail/biosql-l/2008-March/001207.html

Eric wrote:
>  then I try to run my script: this time I have an update failure because the
> record 34 is the SAME family hence same ncbi_taxon_id as record 22:
> 'duplicate entry on key 2'.
>
>  Either this *unique* index is new and it is a BioSQL "issue" (as said, this index
> did not exist in my previous BioSQL db so I never encountered this issue before),

Hopefully Hilmar from BioSQL can answer this.

> OR the way BioPython "repeats" existing taxons is incorrect/not compatible.
> In that case, when inserting the second BioSeq, record 34 should not be created
> but record 35 (the genus) should "point" to the already existing family at record
> 22 as its father.

This example might be easier to follow if the scientific names from
the taxon_name were included.  I would check the lineage but the NCBI
wepage is being very slow for me right now.

In the short term, as a quick fix, your script could first remove
taxon entries with a blank NCBI taxon ID (and clear any keys pointing
to them).  Not elegent - but it would work.

Thanks Eric

Peter


From hlapp at gmx.net  Wed Mar 26 13:29:01 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 26 Mar 2008 09:29:01 -0400
Subject: [BioSQL-l] Concerns the update of BioSQL.taxon table
In-Reply-To: <290936.61510.qm@web26510.mail.ukl.yahoo.com>
References: <290936.61510.qm@web26510.mail.ukl.yahoo.com>
Message-ID: <EFDC1E5E-1379-435F-A6F6-79E0C382F18D@gmx.net>


On Mar 26, 2008, at 7:29 AM, Eric Gibert wrote:
> Either this *unique* index is new and it is a BioSQL "issue" (as  
> said, this index did not exist in my previous BioSQL db so I never  
> encountered this issue before)


The unique index has been there since Feb 2003 (the Singapore  
Biohackathon). I'm not sure how you got a version that doesn't have it.

The unique key constraint on the identifier column is also necessary  
- otherwise you cannot guarantee lookups by the NCBI taxonID to  
return either one or zero rows. Like Peter and Richard, I also don't  
understand what the point would be in allowing the same taxon (which  
in essence is a node), as identified by taxonID, to exist more than  
once.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From pan.mueller at yahoo.de  Thu Mar 27 19:33:34 2008
From: pan.mueller at yahoo.de (=?iso-8859-1?Q?Peter_M=FCller?=)
Date: Thu, 27 Mar 2008 20:33:34 +0100 (CET)
Subject: [BioSQL-l] bioentries in a sequence cluster
Message-ID: <664425.11239.qm@web28203.mail.ukl.yahoo.com>


Dear list,

I have a few questions, but maybe with a working example, I can derive the rest.

With perl-db I can fetch a Bio::Cluster Object wit this query:
(I found no documentation about c::subject and p::object ...)

$query->datacollections(
          ["Bio::PrimarySeqI c::subject",
          "Bio::PrimarySeqI p::object",
         "Bio::PrimarySeqI<=>Bio::ClusterI<=>Bio::Ontology::TermI"]);

$query->where(["p.accession_number = 'NM_000015'"]);

my $adp = $db->get_object_adaptor('Bio::Cluster');
my $qres = $adp->find_by_query($query);


That's great - but here I ask for a sequence accession-number.

Is it possible to aks for the Clone (IMAGE:4722596) or for an STS accession-number where the result is also a cluster object?
"give me the cluster(s) where in the sequence-line is a clone-entry with this number 'IMAGE:4722596' ....
"give me the cluster(s) where in the STS-line is an accession-number with this value 'PMC310725P3'...
PROTID and NID would be also interesting.

UniGene-snippet:
STS         ACC=PMC310725P3 UNISTS=272646
PROTSIM     ORG=10090; PROTGI=6754794; PROTID=NP_035004.1; PCT=76.55; ALN=288
SEQUENCE    ACC=BG569293.1; NID=g13576946; CLONE=IMAGE:4722596; END=5'; LID=6989; SEQTYPE=EST; TRACE=44157214

regards
pan


      Machen Sie Yahoo! zu Ihrer Startseite. Los geht's: 
http://de.yahoo.com/set


From hlapp at gmx.net  Sun Mar 30 05:00:25 2008
From: hlapp at gmx.net (Hilmar Lapp)
Date: Sun, 30 Mar 2008 01:00:25 -0400
Subject: [BioSQL-l] bioentries in a sequence cluster
In-Reply-To: <664425.11239.qm@web28203.mail.ukl.yahoo.com>
References: <664425.11239.qm@web28203.mail.ukl.yahoo.com>
Message-ID: <8083537C-C721-48C2-A838-AAC2B178468A@gmx.net>


On Mar 27, 2008, at 3:33 PM, Peter M?ller wrote:
>
>
> Dear list,
>
> I have a few questions, but maybe with a working example, I can  
> derive the rest.
>
> With perl-db I can fetch a Bio::Cluster Object wit this query:
> (I found no documentation about c::subject and p::object ...)

Yes, sorry, this needs a lot more documentation. The suffix of the  
alias separated from it by '::' is the 'context'. This is needed if  
the same entity participates more than once in an association. What's  
confusing the issue further here is that at the object level each  
object entity (Bio::PrimarySeq, Bio::ClusterI, Bio::Ontology::TermI)  
is participating only once, though in reality Bio::ClusterI and  
Bio::PrimarySeqI both map to table bioentry.

>
> $query->datacollections(
>           ["Bio::PrimarySeqI c::subject",
>           "Bio::PrimarySeqI p::object",

I think that Bio::PrimarySeqI can be substituted with Bio::ClusterI  
in the second line. This would make the mapping clearer I guess. I'm  
not sure why I wrote the example that way, but I'd be surprised if  
Bio::ClusterI does not work here.

>          "Bio::PrimarySeqI<=>Bio::ClusterI<=>Bio::Ontology::TermI"]);
>
> $query->where(["p.accession_number = 'NM_000015'"]);

Actually I think you need to use c.accession_number to query by  
sequence accession. The c (child) alias is the cluster member, and  
the p (parent) alias is the cluster itself.

>
> my $adp = $db->get_object_adaptor('Bio::Cluster');
> my $qres = $adp->find_by_query($query);
>
>
> That's great - but here I ask for a sequence accession-number.
>
> Is it possible to aks for the Clone (IMAGE:4722596) or for an STS  
> accession-number where the result is also a cluster object?
> "give me the cluster(s) where in the sequence-line is a clone-entry  
> with this number 'IMAGE:4722596' ....
> "give me the cluster(s) where in the STS-line is an accession- 
> number with this value 'PMC310725P3'...
> PROTID and NID would be also interesting.

PID and NID should become the primary_id() of the sequence members.  
Hence, you would say c.primary_id where you have c.accession_number  
above.

Each STS line should be in a qualifier/value pair attached to the  
cluster bioentry, under the tag 'sts' (which from what I can see  
would consist of whole lines, not ACC= and UNISTS= values parsed out,  
though I may be mistaken). So you would add

"Bio::PrimarySeqI<=>Bio::Annotation::SimpleValue sv"

to the datacollections, and "sv.value = 'ACC=PMC310725P3  
UNISTS=272646'" and "sv.tagname = 'sts'" to the where() array.

The same goes for IMAGE clone IDs, except that the tag name is  
'clone' and the qualifier/value is attached to the member sequence,  
not the cluster; also here not the entire line is stored, but rather  
parsed into tokens.

Does this help?

	-hilmar

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================