From jijibio at gmail.com  Mon Nov  9 07:47:20 2009
From: jijibio at gmail.com (=?ISO-8859-1?Q?=BB=BB=BBJiji_Kurup=AB=AB=AB?=)
Date: Mon, 9 Nov 2009 18:17:20 +0530
Subject: [BioSQL-l] Project participation
Message-ID: <82611efb0911090447y2aa1ae16x1106a41b7a0e2a1@mail.gmail.com>

Hi Lapp,

I am very much interested to be a part of  "JEE5 webservice interface  
to BioSQL" and
"BioSQL web interface and API on Google App Engine"  project.
But i am not a student now, i am working in a bioinformatics company,  
so whether it is possible to
do participate in any of this projects.

Kindly let me known if there is any provision for it and tell me the  
procedure also.


-- 
Regards,

Jiji Kurup
Application Scientist

From jay at jays.net  Mon Nov 16 17:20:33 2009
From: jay at jays.net (Jay Hannah)
Date: Mon, 16 Nov 2009 16:20:33 -0600
Subject: [BioSQL-l] [patch] load_taxonomy.pl script was renamed
	load_ncbi_taxonomy.pl
Message-ID: <625B7B90-85D5-4084-BC56-8EBB2CFF6EEE@jays.net>


Can someone activate my ('jhannah') commit bit for this biosql-schema? I can commit to bioperl-live, but not biosql-schema.

Or apply the patch below for me?

Thanks,

j
http://clab.ist.unomaha.edu/CLAB/index.php/User:Jhannah


jhannah at jaysnet-MacBook:~/src/biosql-schema$ svn info
Path: .
URL: svn://code.open-bio.org/biosql/biosql-schema/trunk
Repository Root: svn://code.open-bio.org/biosql
Repository UUID: 77cd3915-3943-0410-8c1d-86e8b156039c
Revision: 316
Node Kind: directory
Schedule: normal
Last Changed Author: lapp
Last Changed Rev: 316
Last Changed Date: 2009-07-18 13:24:56 -0500 (Sat, 18 Jul 2009)


jhannah at jaysnet-MacBook:~/src/biosql-schema$ svn diff
Index: doc/schema-overview.txt
===================================================================
--- doc/schema-overview.txt	(revision 316)
+++ doc/schema-overview.txt	(working copy)
@@ -150,7 +150,7 @@
 structure of NCBI's taxonomy database. Each bioentry can be
 associated with only one taxon, but many bioentries can be associated
 with the same taxon. In order to get the most value from these tables
-it's recommended that you use the BioSQL script load_taxonomy.pl
+it's recommended that you use the BioSQL script load_ncbi_taxonomy.pl
 to populate them.
 
 The taxon_name.taxon_id field is meant to store an NCBI
@@ -165,7 +165,7 @@
 parent_taxon_id contains the taxon id of the parent taxon, since there
 should only be one parent in the taxonomic tree. The right_value and
 left_value fields store values that are calculated and entered by the 
-load_taxonomy.pl script. These arbitrary values are the upper and
+load_ncbi_taxonomy.pl script. These arbitrary values are the upper and
 lower bounds of "nested sets", one set for each taxa, where the set 
 of the child taxa is contained within the larger set of the parent 
 taxon. An example would be the set for the species Procyon lotor,  
Index: INSTALL
===================================================================
--- INSTALL	(revision 316)
+++ INSTALL	(working copy)
@@ -449,7 +449,7 @@
 
 With bioperl and bioperl-db installed you are ready to load some data.
 It is advisable to pre-load the NCBI taxonomy database (use
-scripts/load_taxonomy.pl in the biosql-schema package, the details are
+scripts/load_ncbi_taxonomy.pl in the biosql-schema package, the details are
 in its documentation). Otherwise you'll see errors from misparsed
 organisms. 
 

jhannah at jaysnet-MacBook:~/src/biosql-schema$ svn commit
svn: Commit failed (details follow):
svn: Authorization failed


From jay at jays.net  Mon Nov 16 18:22:28 2009
From: jay at jays.net (Jay Hannah)
Date: Mon, 16 Nov 2009 17:22:28 -0600
Subject: [BioSQL-l] [patch] load_taxonomy.pl script was renamed
	load_ncbi_taxonomy.pl
In-Reply-To: <625B7B90-85D5-4084-BC56-8EBB2CFF6EEE@jays.net>
References: <625B7B90-85D5-4084-BC56-8EBB2CFF6EEE@jays.net>
Message-ID: <044213EE-E980-4E48-870D-1F2896E937B3@jays.net>

On Nov 16, 2009, at 4:20 PM, Jay Hannah wrote:
> URL: svn://code.open-bio.org/biosql/biosql-schema/trunk

Oh, oops. I think I was using the wrong repo address for committing. 

I think I'm using the right address now. Now getting the error below.

Thanks,

j
http://clab.ist.unomaha.edu/CLAB/index.php/User:Jhannah


jhannah at jaysnet-MacBook:~/src/biosql-schema-committer$ svn info
Path: .
URL: svn+ssh://jhannah at dev.open-bio.org/home/svn-repositories/biosql/biosql-schema/trunk
Repository Root: svn+ssh://jhannah at dev.open-bio.org/home/svn-repositories/biosql
Repository UUID: 77cd3915-3943-0410-8c1d-86e8b156039c
Revision: 316
Node Kind: directory
Schedule: normal
Last Changed Author: lapp
Last Changed Rev: 316
Last Changed Date: 2009-07-18 13:24:56 -0500 (Sat, 18 Jul 2009)


jhannah at jaysnet-MacBook:~/src/biosql-schema-committer$ svn commit
===========================================
 dev.open-bio.org - Authorized Access Only
===========================================
Sending        INSTALL
Sending        doc/schema-overview.txt
Transmitting file data ..svn: Commit failed (details follow):
svn: Can't create directory '/home/svn-repositories/biosql/db/transactions/316-1.txn': Permission denied
svn: Your commit message was left in a temporary file:
svn:    '/Users/jhannah/src/biosql-schema-committer/svn-commit.tmp'


From mauricio at open-bio.org  Tue Nov 17 00:02:32 2009
From: mauricio at open-bio.org (Mauricio Herrera Cuadra)
Date: Mon, 16 Nov 2009 23:02:32 -0600
Subject: [BioSQL-l] [patch] load_taxonomy.pl script was
	renamed	load_ncbi_taxonomy.pl
In-Reply-To: <044213EE-E980-4E48-870D-1F2896E937B3@jays.net>
References: <625B7B90-85D5-4084-BC56-8EBB2CFF6EEE@jays.net>
	<044213EE-E980-4E48-870D-1F2896E937B3@jays.net>
Message-ID: <4B022E68.2080304@open-bio.org>

I added you to the biosql group in the SVN server. You should be able to 
commit the patch now.

Cheers,
Mauricio.

Jay Hannah wrote:
> On Nov 16, 2009, at 4:20 PM, Jay Hannah wrote:
>> URL: svn://code.open-bio.org/biosql/biosql-schema/trunk
> 
> Oh, oops. I think I was using the wrong repo address for committing. 
> 
> I think I'm using the right address now. Now getting the error below.
> 
> Thanks,
> 
> j
> http://clab.ist.unomaha.edu/CLAB/index.php/User:Jhannah
> 
> 
> 
> 
> jhannah at jaysnet-MacBook:~/src/biosql-schema-committer$ svn info
> Path: .
> URL: svn+ssh://jhannah at dev.open-bio.org/home/svn-repositories/biosql/biosql-schema/trunk
> Repository Root: svn+ssh://jhannah at dev.open-bio.org/home/svn-repositories/biosql
> Repository UUID: 77cd3915-3943-0410-8c1d-86e8b156039c
> Revision: 316
> Node Kind: directory
> Schedule: normal
> Last Changed Author: lapp
> Last Changed Rev: 316
> Last Changed Date: 2009-07-18 13:24:56 -0500 (Sat, 18 Jul 2009)
> 
> 
> jhannah at jaysnet-MacBook:~/src/biosql-schema-committer$ svn commit
> ===========================================
>  dev.open-bio.org - Authorized Access Only
> ===========================================
> Sending        INSTALL
> Sending        doc/schema-overview.txt
> Transmitting file data ..svn: Commit failed (details follow):
> svn: Can't create directory '/home/svn-repositories/biosql/db/transactions/316-1.txn': Permission denied
> svn: Your commit message was left in a temporary file:
> svn:    '/Users/jhannah/src/biosql-schema-committer/svn-commit.tmp'
> 
> 
> 
> 
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
> 

From jay at jays.net  Tue Nov 17 08:00:01 2009
From: jay at jays.net (Jay Hannah)
Date: Tue, 17 Nov 2009 07:00:01 -0600
Subject: [BioSQL-l] [patch] load_taxonomy.pl script was
	renamed	load_ncbi_taxonomy.pl
In-Reply-To: <4B022E68.2080304@open-bio.org>
References: <625B7B90-85D5-4084-BC56-8EBB2CFF6EEE@jays.net>
	<044213EE-E980-4E48-870D-1F2896E937B3@jays.net>
	<4B022E68.2080304@open-bio.org>
Message-ID: <BE51C7D4-F886-47AF-AA45-9B887F426C40@jays.net>

On Nov 16, 2009, at 11:02 PM, Mauricio Herrera Cuadra wrote:
> I added you to the biosql group in the SVN server. You should be able to commit the patch now.

Thanks! r317 committed.  :)

j


------------------------------------------------------------------------
r317 | jhannah | 2009-11-17 06:58:07 -0600 (Tue, 17 Nov 2009) | 2 lines
Changed paths:
   M /biosql-schema/trunk/INSTALL
   M /biosql-schema/trunk/doc/schema-overview.txt

load_taxonomy.pl script was renamed load_ncbi_taxonomy.pl.
------------------------------------------------------------------------


From biopython at maubp.freeserve.co.uk  Wed Nov 18 06:06:51 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 18 Nov 2009 11:06:51 +0000
Subject: [BioSQL-l] Treating GenBank source features as top level annotation
Message-ID: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>

Hello all,

Something we've just been discussing on the Biopython mailing list
is a possible change to how we parse the source features in GenBank
(or EMBL) files. This could have knock on implications for how we use
BioSQL. For anyone interested, the thread is here:
http://lists.open-bio.org/pipermail/biopython/2009-November/005826.html

The basic observation is that GenBank files do not have any extensible
annotation block for the whole sequence. There are a few fields like
the comment, organism and taxonomy - but nothing general and
structured. Instead, it seems the NCBI etc decided to use the feature
table for this task by inventing the "source" feature. In every single
GenBank file I have ever seen with a source feature, there is only
one feature of this type and it spans the full sequence.

For example, NC_005816, Yersinia pestis biovar Microtus str. 91001
plasmid pPCP1, complete sequence:

 source      1..9609
             /organism="Yersinia pestis biovar Microtus str. 91001"
             /mol_type="genomic DNA"
             /strain="91001"
             /db_xref="taxon:229193"
             /plasmid="pPCP1"
             /biovar="Microtus"

(I reduced the white space for emailing). All of that information
makes sense as annotation for the whole sequence. In fact, the
"organism" entry is duplicated on the ORGANISM line in the
GenBank header (and the SOURCE line too).

Currently we (Biopython, BioPerl etc) store this annotation in BioSQL
using the seqfeature_qualifiter_value and seqfeature_dbxref tables,
associated with a "source" feature in the seqfeature table.

I am suggesting it could make more sense to store the "source"
feature annotation at the sequence level, using instead the
bioentry_qualifier_value and bioentry_dbxref tables.

This is a slight shift from the origins of BioSQL as a schema to
hold GenBank files - but to me at least it is more logical.

What does everyone else think? Things work as they are...
and "if it ain't broken don't fix it"?

Peter

[Even if Biopython changes its internal object structure to treat
the "source" feature annotation as sequence level annotation,
we *could* continue to use a "source" feature when loading
GenBank files to/from BioSQL if required for compatibility with
the other Bio* projects. It would be more work though. In any
case, we'd also need to recreate a "source" feature when
writing GenBank output files.]

From biopython at maubp.freeserve.co.uk  Wed Nov 18 07:27:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 18 Nov 2009 12:27:12 +0000
Subject: [BioSQL-l] Treating GenBank source features as top level
	annotation
In-Reply-To: <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com>
References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>
	<8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com>
Message-ID: <320fb6e00911180427q79961f6ci5a43ebac9ff70f7a@mail.gmail.com>

On Wed, Nov 18, 2009 at 12:08 PM, Richard Holland
<holland at eaglegenomics.com> wrote:
>
> BioJava's latest parsers do the following:
> ...

Without checking all the details, that is broadly what Biopython does
at the moment.

> The main reason why we still use the source feature and don't go to sequence
> level is because when converting between formats it's hard to tell which
> sequence-level qualifier_values are from the source feature and which are
> from other places.

Makes sense.

> The main reason why we rely entirely on the source feature for organism
> and taxon ID info is because it's much easier to parse than the SOURCE
> and ORGANISM tags.

>From memory, Biopython also uses the taxon table here too.

Peter

From holland at eaglegenomics.com  Wed Nov 18 07:08:48 2009
From: holland at eaglegenomics.com (Richard Holland)
Date: Wed, 18 Nov 2009 12:08:48 +0000
Subject: [BioSQL-l] Treating GenBank source features as top level
	annotation
In-Reply-To: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>
References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>
Message-ID: <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com>

BioJava's latest parsers do the following:

On read:

  SOURCE and ORGANISM top-level tags are completely ignored
  For each tag in each feature, including source:
    If it's a dbxref 
       If it's taxon, set the taxon ID in the BioEntry table (if no /taxon is specified in the source feature the taxonomy does not get stored at all)
       Otherwise set dbxref as a feature CrossRef table entry
    If it's organism
       Add the organism name to the taxon ID in the Taxon table using the scientific taxon name type (if no /organism tag is specified in the source feature, the taxon gets the default name from NCBI, but only if the NCBI taxonomy data is already present in BioSQL) (if no /taxon is specified in the source feature, then the taxonomy does not get stored at all)
    Otherwise
       All other tags get mapped as feature qualifier values, including the source feature
   
On write:

   SOURCE and ORGANISM tags are generated from the BioEntry taxon ID entry for the sequence,
   All features get qualifier values output plus /db_xref tags for all entries from the CrossRef table for the feature,
   The source feature is output as per a normal feature, plus /organism and /db_xref="taxon:..." tags generated as per the SOURCE and ORGANISM tags

The main reason why we still use the source feature and don't go to sequence level is because when converting between formats it's hard to tell which sequence-level qualifier_values are from the source feature and which are from other places. 

The main reason why we rely entirely on the source feature for organism and taxon ID info is because it's much easier to parse than the SOURCE and ORGANISM tags.

cheers,
Richard

On 18 Nov 2009, at 11:06, Peter wrote:

> Hello all,
> 
> Something we've just been discussing on the Biopython mailing list
> is a possible change to how we parse the source features in GenBank
> (or EMBL) files. This could have knock on implications for how we use
> BioSQL. For anyone interested, the thread is here:
> http://lists.open-bio.org/pipermail/biopython/2009-November/005826.html
> 
> The basic observation is that GenBank files do not have any extensible
> annotation block for the whole sequence. There are a few fields like
> the comment, organism and taxonomy - but nothing general and
> structured. Instead, it seems the NCBI etc decided to use the feature
> table for this task by inventing the "source" feature. In every single
> GenBank file I have ever seen with a source feature, there is only
> one feature of this type and it spans the full sequence.
> 
> For example, NC_005816, Yersinia pestis biovar Microtus str. 91001
> plasmid pPCP1, complete sequence:
> 
> source      1..9609
>             /organism="Yersinia pestis biovar Microtus str. 91001"
>             /mol_type="genomic DNA"
>             /strain="91001"
>             /db_xref="taxon:229193"
>             /plasmid="pPCP1"
>             /biovar="Microtus"
> 
> (I reduced the white space for emailing). All of that information
> makes sense as annotation for the whole sequence. In fact, the
> "organism" entry is duplicated on the ORGANISM line in the
> GenBank header (and the SOURCE line too).
> 
> Currently we (Biopython, BioPerl etc) store this annotation in BioSQL
> using the seqfeature_qualifiter_value and seqfeature_dbxref tables,
> associated with a "source" feature in the seqfeature table.
> 
> I am suggesting it could make more sense to store the "source"
> feature annotation at the sequence level, using instead the
> bioentry_qualifier_value and bioentry_dbxref tables.
> 
> This is a slight shift from the origins of BioSQL as a schema to
> hold GenBank files - but to me at least it is more logical.
> 
> What does everyone else think? Things work as they are...
> and "if it ain't broken don't fix it"?
> 
> Peter
> 
> [Even if Biopython changes its internal object structure to treat
> the "source" feature annotation as sequence level annotation,
> we *could* continue to use a "source" feature when loading
> GenBank files to/from BioSQL if required for compatibility with
> the other Bio* projects. It would be more work though. In any
> case, we'd also need to recreate a "source" feature when
> writing GenBank output files.]
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From hlapp at gmx.net  Wed Nov 18 08:13:05 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 18 Nov 2009 08:13:05 -0500
Subject: [BioSQL-l] Treating GenBank source features as top level
	annotation
In-Reply-To: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>
References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>
Message-ID: <670ED558-7FBD-4219-A449-0D7E63BE0766@gmx.net>

I agree completely with your interpretation of the "source" feature  
tag, and in fact what you outline below is what I implemented as a  
"SeqProcessor" module for use within the SymAtlas data integration  
project (BioPerl supports 'pipes' of I/O and processing modules, where  
the latter can modify the sequence objects coming out of the I/O  
module).

I'm not sure I would want to hard-code this behavior into the BioPerl  
genbank parser. However, it would be easy enough to code it into a  
processing module that comes standard with the distribution to the  
extent that it can be enabled as simply as a format variant to SeqIO.

It sounds useful enough that I guess I should post it to the BioPerl  
list ...

	-hilmar

On Nov 18, 2009, at 6:06 AM, Peter wrote:

> Hello all,
>
> Something we've just been discussing on the Biopython mailing list
> is a possible change to how we parse the source features in GenBank
> (or EMBL) files. This could have knock on implications for how we use
> BioSQL. For anyone interested, the thread is here:
> http://lists.open-bio.org/pipermail/biopython/2009-November/ 
> 005826.html
>
> The basic observation is that GenBank files do not have any extensible
> annotation block for the whole sequence. There are a few fields like
> the comment, organism and taxonomy - but nothing general and
> structured. Instead, it seems the NCBI etc decided to use the feature
> table for this task by inventing the "source" feature. In every single
> GenBank file I have ever seen with a source feature, there is only
> one feature of this type and it spans the full sequence.
>
> For example, NC_005816, Yersinia pestis biovar Microtus str. 91001
> plasmid pPCP1, complete sequence:
>
> source      1..9609
>             /organism="Yersinia pestis biovar Microtus str. 91001"
>             /mol_type="genomic DNA"
>             /strain="91001"
>             /db_xref="taxon:229193"
>             /plasmid="pPCP1"
>             /biovar="Microtus"
>
> (I reduced the white space for emailing). All of that information
> makes sense as annotation for the whole sequence. In fact, the
> "organism" entry is duplicated on the ORGANISM line in the
> GenBank header (and the SOURCE line too).
>
> Currently we (Biopython, BioPerl etc) store this annotation in BioSQL
> using the seqfeature_qualifiter_value and seqfeature_dbxref tables,
> associated with a "source" feature in the seqfeature table.
>
> I am suggesting it could make more sense to store the "source"
> feature annotation at the sequence level, using instead the
> bioentry_qualifier_value and bioentry_dbxref tables.
>
> This is a slight shift from the origins of BioSQL as a schema to
> hold GenBank files - but to me at least it is more logical.
>
> What does everyone else think? Things work as they are...
> and "if it ain't broken don't fix it"?
>
> Peter
>
> [Even if Biopython changes its internal object structure to treat
> the "source" feature annotation as sequence level annotation,
> we *could* continue to use a "source" feature when loading
> GenBank files to/from BioSQL if required for compatibility with
> the other Bio* projects. It would be more work though. In any
> case, we'd also need to recreate a "source" feature when
> writing GenBank output files.]
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
>

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Wed Nov 18 08:14:35 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 18 Nov 2009 08:14:35 -0500
Subject: [BioSQL-l] Treating GenBank source features as top level
	annotation
In-Reply-To: <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com>
References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>
	<8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com>
Message-ID: <618B39A2-1CA2-405F-A8D7-4947356BF7A5@gmx.net>


On Nov 18, 2009, at 7:08 AM, Richard Holland wrote:

>  For each tag in each feature, including source:
>    If it's a dbxref
>       If it's taxon, set the taxon ID in the BioEntry table (if no / 
> taxon is specified in the source feature the taxonomy does not get  
> stored at all)


That's what the BioPerl Genbank parser does too, though only if it's a  
"source" feature. I don't know of any other feature key that would  
have a taxon dbxref entry.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Wed Nov 18 08:16:40 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 18 Nov 2009 08:16:40 -0500
Subject: [BioSQL-l] Treating GenBank source features as top level
	annotation
In-Reply-To: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>
References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>
Message-ID: <72A7ED62-7213-4136-90F9-74F4691F3003@gmx.net>


On Nov 18, 2009, at 6:06 AM, Peter wrote:

> For example, NC_005816, Yersinia pestis biovar Microtus str. 91001
> plasmid pPCP1, complete sequence:
>
> source      1..9609
>             /organism="Yersinia pestis biovar Microtus str. 91001"
>             /mol_type="genomic DNA"
>             /strain="91001"
>             /db_xref="taxon:229193"
>             /plasmid="pPCP1"
>             /biovar="Microtus"


Just FYI, the sequences coming out of the barcoding projects will have  
the lat/long coordinates here, too. Those obviously pertain to the  
specimen (and hence to the whole sequence).

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Wed Nov 18 08:34:38 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 18 Nov 2009 13:34:38 +0000
Subject: [BioSQL-l] Treating GenBank source features as top level
	annotation
In-Reply-To: <D8574860-3891-4D4E-ACD8-18D07FBB6E81@illinois.edu>
References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>
	<8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com>
	<D8574860-3891-4D4E-ACD8-18D07FBB6E81@illinois.edu>
Message-ID: <320fb6e00911180534o2cd0126fp62527db04e0c346f@mail.gmail.com>

On Wed, Nov 18, 2009 at 1:10 PM, Chris Fields <cjfields at illinois.edu> wrote:
>
> Just to note, there are a few cases where there are two or more source features.
> This pops up mainly with chimeric sequences, for example:
>
> http://www.ncbi.nlm.nih.gov/nuccore/21727885
>
> We have run into this a couple of times on the bioperl list. ?In this case, each
> feature is limited to specific locations on the sequence and doesn't pertain to
> the entire sequence. ?NCBI only notes the first source on the ORGANISM line;
> last time I checked, EMBL used both.
>
> chris

Wow - cool example. It was worth starting this thread just to learn about
this interesting corner case. I wonder if this is a common enough case to
warrant leaving the source features as they are?

Peter


From cjfields at illinois.edu  Wed Nov 18 08:10:36 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Wed, 18 Nov 2009 07:10:36 -0600
Subject: [BioSQL-l] Treating GenBank source features as top level
	annotation
In-Reply-To: <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com>
References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>
	<8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com>
Message-ID: <D8574860-3891-4D4E-ACD8-18D07FBB6E81@illinois.edu>

Just to note, there are a few cases where there are two or more source features.  This pops up mainly with chimeric sequences, for example:

http://www.ncbi.nlm.nih.gov/nuccore/21727885

We have run into this a couple of times on the bioperl list.  In this case, each feature is limited to specific locations on the sequence and doesn't pertain to the entire sequence.  NCBI only notes the first source on the ORGANISM line; last time I checked, EMBL used both.

chris

On Nov 18, 2009, at 6:08 AM, Richard Holland wrote:

> BioJava's latest parsers do the following:
> 
> On read:
> 
>  SOURCE and ORGANISM top-level tags are completely ignored
>  For each tag in each feature, including source:
>    If it's a dbxref 
>       If it's taxon, set the taxon ID in the BioEntry table (if no /taxon is specified in the source feature the taxonomy does not get stored at all)
>       Otherwise set dbxref as a feature CrossRef table entry
>    If it's organism
>       Add the organism name to the taxon ID in the Taxon table using the scientific taxon name type (if no /organism tag is specified in the source feature, the taxon gets the default name from NCBI, but only if the NCBI taxonomy data is already present in BioSQL) (if no /taxon is specified in the source feature, then the taxonomy does not get stored at all)
>    Otherwise
>       All other tags get mapped as feature qualifier values, including the source feature
> 
> On write:
> 
>   SOURCE and ORGANISM tags are generated from the BioEntry taxon ID entry for the sequence,
>   All features get qualifier values output plus /db_xref tags for all entries from the CrossRef table for the feature,
>   The source feature is output as per a normal feature, plus /organism and /db_xref="taxon:..." tags generated as per the SOURCE and ORGANISM tags
> 
> The main reason why we still use the source feature and don't go to sequence level is because when converting between formats it's hard to tell which sequence-level qualifier_values are from the source feature and which are from other places. 
> 
> The main reason why we rely entirely on the source feature for organism and taxon ID info is because it's much easier to parse than the SOURCE and ORGANISM tags.
> 
> cheers,
> Richard
> 
> On 18 Nov 2009, at 11:06, Peter wrote:
> 
>> Hello all,
>> 
>> Something we've just been discussing on the Biopython mailing list
>> is a possible change to how we parse the source features in GenBank
>> (or EMBL) files. This could have knock on implications for how we use
>> BioSQL. For anyone interested, the thread is here:
>> http://lists.open-bio.org/pipermail/biopython/2009-November/005826.html
>> 
>> The basic observation is that GenBank files do not have any extensible
>> annotation block for the whole sequence. There are a few fields like
>> the comment, organism and taxonomy - but nothing general and
>> structured. Instead, it seems the NCBI etc decided to use the feature
>> table for this task by inventing the "source" feature. In every single
>> GenBank file I have ever seen with a source feature, there is only
>> one feature of this type and it spans the full sequence.
>> 
>> For example, NC_005816, Yersinia pestis biovar Microtus str. 91001
>> plasmid pPCP1, complete sequence:
>> 
>> source      1..9609
>>            /organism="Yersinia pestis biovar Microtus str. 91001"
>>            /mol_type="genomic DNA"
>>            /strain="91001"
>>            /db_xref="taxon:229193"
>>            /plasmid="pPCP1"
>>            /biovar="Microtus"
>> 
>> (I reduced the white space for emailing). All of that information
>> makes sense as annotation for the whole sequence. In fact, the
>> "organism" entry is duplicated on the ORGANISM line in the
>> GenBank header (and the SOURCE line too).
>> 
>> Currently we (Biopython, BioPerl etc) store this annotation in BioSQL
>> using the seqfeature_qualifiter_value and seqfeature_dbxref tables,
>> associated with a "source" feature in the seqfeature table.
>> 
>> I am suggesting it could make more sense to store the "source"
>> feature annotation at the sequence level, using instead the
>> bioentry_qualifier_value and bioentry_dbxref tables.
>> 
>> This is a slight shift from the origins of BioSQL as a schema to
>> hold GenBank files - but to me at least it is more logical.
>> 
>> What does everyone else think? Things work as they are...
>> and "if it ain't broken don't fix it"?
>> 
>> Peter
>> 
>> [Even if Biopython changes its internal object structure to treat
>> the "source" feature annotation as sequence level annotation,
>> we *could* continue to use a "source" feature when loading
>> GenBank files to/from BioSQL if required for compatibility with
>> the other Bio* projects. It would be more work though. In any
>> case, we'd also need to recreate a "source" feature when
>> writing GenBank output files.]
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biosql-l
> 
> --
> Richard Holland, BSc MBCS
> Operations and Delivery Director, Eagle Genomics Ltd
> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
> 
> 
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l


From hlapp at gmx.net  Wed Nov 18 09:28:01 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 18 Nov 2009 09:28:01 -0500
Subject: [BioSQL-l] Treating GenBank source features as top level
	annotation
In-Reply-To: <D8574860-3891-4D4E-ACD8-18D07FBB6E81@illinois.edu>
References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>
	<8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com>
	<D8574860-3891-4D4E-ACD8-18D07FBB6E81@illinois.edu>
Message-ID: <73839DF2-F9D2-441C-8AA3-649D619D2B1E@gmx.net>

True - for chimeric sequences you can have multiple sources. That  
should be recognizable though from the length (and span) of the source  
feature location?

	-hilmar

On Nov 18, 2009, at 8:10 AM, Chris Fields wrote:

> Just to note, there are a few cases where there are two or more  
> source features.  This pops up mainly with chimeric sequences, for  
> example:
>
> http://www.ncbi.nlm.nih.gov/nuccore/21727885
>
> We have run into this a couple of times on the bioperl list.  In  
> this case, each feature is limited to specific locations on the  
> sequence and doesn't pertain to the entire sequence.  NCBI only  
> notes the first source on the ORGANISM line; last time I checked,  
> EMBL used both.
>
> chris
>
> On Nov 18, 2009, at 6:08 AM, Richard Holland wrote:
>
>> BioJava's latest parsers do the following:
>>
>> On read:
>>
>> SOURCE and ORGANISM top-level tags are completely ignored
>> For each tag in each feature, including source:
>>   If it's a dbxref
>>      If it's taxon, set the taxon ID in the BioEntry table (if no / 
>> taxon is specified in the source feature the taxonomy does not get  
>> stored at all)
>>      Otherwise set dbxref as a feature CrossRef table entry
>>   If it's organism
>>      Add the organism name to the taxon ID in the Taxon table using  
>> the scientific taxon name type (if no /organism tag is specified in  
>> the source feature, the taxon gets the default name from NCBI, but  
>> only if the NCBI taxonomy data is already present in BioSQL) (if  
>> no /taxon is specified in the source feature, then the taxonomy  
>> does not get stored at all)
>>   Otherwise
>>      All other tags get mapped as feature qualifier values,  
>> including the source feature
>>
>> On write:
>>
>>  SOURCE and ORGANISM tags are generated from the BioEntry taxon ID  
>> entry for the sequence,
>>  All features get qualifier values output plus /db_xref tags for  
>> all entries from the CrossRef table for the feature,
>>  The source feature is output as per a normal feature, plus / 
>> organism and /db_xref="taxon:..." tags generated as per the SOURCE  
>> and ORGANISM tags
>>
>> The main reason why we still use the source feature and don't go to  
>> sequence level is because when converting between formats it's hard  
>> to tell which sequence-level qualifier_values are from the source  
>> feature and which are from other places.
>>
>> The main reason why we rely entirely on the source feature for  
>> organism and taxon ID info is because it's much easier to parse  
>> than the SOURCE and ORGANISM tags.
>>
>> cheers,
>> Richard
>>
>> On 18 Nov 2009, at 11:06, Peter wrote:
>>
>>> Hello all,
>>>
>>> Something we've just been discussing on the Biopython mailing list
>>> is a possible change to how we parse the source features in GenBank
>>> (or EMBL) files. This could have knock on implications for how we  
>>> use
>>> BioSQL. For anyone interested, the thread is here:
>>> http://lists.open-bio.org/pipermail/biopython/2009-November/005826.html
>>>
>>> The basic observation is that GenBank files do not have any  
>>> extensible
>>> annotation block for the whole sequence. There are a few fields like
>>> the comment, organism and taxonomy - but nothing general and
>>> structured. Instead, it seems the NCBI etc decided to use the  
>>> feature
>>> table for this task by inventing the "source" feature. In every  
>>> single
>>> GenBank file I have ever seen with a source feature, there is only
>>> one feature of this type and it spans the full sequence.
>>>
>>> For example, NC_005816, Yersinia pestis biovar Microtus str. 91001
>>> plasmid pPCP1, complete sequence:
>>>
>>> source      1..9609
>>>           /organism="Yersinia pestis biovar Microtus str. 91001"
>>>           /mol_type="genomic DNA"
>>>           /strain="91001"
>>>           /db_xref="taxon:229193"
>>>           /plasmid="pPCP1"
>>>           /biovar="Microtus"
>>>
>>> (I reduced the white space for emailing). All of that information
>>> makes sense as annotation for the whole sequence. In fact, the
>>> "organism" entry is duplicated on the ORGANISM line in the
>>> GenBank header (and the SOURCE line too).
>>>
>>> Currently we (Biopython, BioPerl etc) store this annotation in  
>>> BioSQL
>>> using the seqfeature_qualifiter_value and seqfeature_dbxref tables,
>>> associated with a "source" feature in the seqfeature table.
>>>
>>> I am suggesting it could make more sense to store the "source"
>>> feature annotation at the sequence level, using instead the
>>> bioentry_qualifier_value and bioentry_dbxref tables.
>>>
>>> This is a slight shift from the origins of BioSQL as a schema to
>>> hold GenBank files - but to me at least it is more logical.
>>>
>>> What does everyone else think? Things work as they are...
>>> and "if it ain't broken don't fix it"?
>>>
>>> Peter
>>>
>>> [Even if Biopython changes its internal object structure to treat
>>> the "source" feature annotation as sequence level annotation,
>>> we *could* continue to use a "source" feature when loading
>>> GenBank files to/from BioSQL if required for compatibility with
>>> the other Bio* projects. It would be more work though. In any
>>> case, we'd also need to recreate a "source" feature when
>>> writing GenBank output files.]
>>> _______________________________________________
>>> BioSQL-l mailing list
>>> BioSQL-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biosql-l
>>
>> --
>> Richard Holland, BSc MBCS
>> Operations and Delivery Director, Eagle Genomics Ltd
>> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
>> http://www.eaglegenomics.com/
>>
>>
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biosql-l
>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
>

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Wed Nov 18 11:50:04 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 18 Nov 2009 11:50:04 -0500
Subject: [BioSQL-l] [patch] load_taxonomy.pl script was renamed
	load_ncbi_taxonomy.pl
In-Reply-To: <BE51C7D4-F886-47AF-AA45-9B887F426C40@jays.net>
References: <625B7B90-85D5-4084-BC56-8EBB2CFF6EEE@jays.net>
	<044213EE-E980-4E48-870D-1F2896E937B3@jays.net>
	<4B022E68.2080304@open-bio.org>
	<BE51C7D4-F886-47AF-AA45-9B887F426C40@jays.net>
Message-ID: <03730DF5-706B-47F6-A191-8352E38CA42C@gmx.net>

Hi Jay - thanks much for the patch, highly appreciated! -hilmar

On Nov 17, 2009, at 8:00 AM, Jay Hannah wrote:

> On Nov 16, 2009, at 11:02 PM, Mauricio Herrera Cuadra wrote:
>> I added you to the biosql group in the SVN server. You should be  
>> able to commit the patch now.
>
> Thanks! r317 committed.  :)
>
> j
>
>
>
> ------------------------------------------------------------------------
> r317 | jhannah | 2009-11-17 06:58:07 -0600 (Tue, 17 Nov 2009) | 2  
> lines
> Changed paths:
>   M /biosql-schema/trunk/INSTALL
>   M /biosql-schema/trunk/doc/schema-overview.txt
>
> load_taxonomy.pl script was renamed load_ncbi_taxonomy.pl.
> ------------------------------------------------------------------------
>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
>

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From maruco at gmail.com  Mon Nov  9 09:35:31 2009
From: maruco at gmail.com (Thiago Satake)
Date: Mon, 9 Nov 2009 09:35:31 -0500
Subject: [BioSQL-l] Project participation
In-Reply-To: <82611efb0911090447y2aa1ae16x1106a41b7a0e2a1@mail.gmail.com>
References: <82611efb0911090447y2aa1ae16x1106a41b7a0e2a1@mail.gmail.com>
Message-ID: <eff79fb40911090635n7f9dc07bx252810784727a732@mail.gmail.com>

Hi gays,

It sounds to me very" interesting project!!

Where can I find more information about ?

Thanks,


2009/11/9 ???Jiji Kurup??? <jijibio at gmail.com>:
> Hi Lapp,
>
> I am very much interested to be a part of  "JEE5 webservice  
> interface to
> BioSQL" and
> "BioSQL web interface and API on Google App Engine"  project.
> But i am not a student now, i am working in a bioinformatics  
> company, so
> whether it is possible to
> do participate in any of this projects.
>
> Kindly let me known if there is any provision for it and tell me the
> procedure also.
>
>
>
>
> --
> Regards,
>
> Jiji Kurup
> Application Scientist
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
>


-- 
Thiago Seito Satake
Tel: +55(011) 6588-8045


From cjfields at illinois.edu  Wed Nov 18 12:40:21 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Wed, 18 Nov 2009 11:40:21 -0600
Subject: [BioSQL-l] Treating GenBank source features as top level
	annotation
In-Reply-To: <73839DF2-F9D2-441C-8AA3-649D619D2B1E@gmx.net>
References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>
	<8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com>
	<D8574860-3891-4D4E-ACD8-18D07FBB6E81@illinois.edu>
	<73839DF2-F9D2-441C-8AA3-649D619D2B1E@gmx.net>
Message-ID: <BE72D1D3-74FE-4AFF-BD2F-BA5110B511BD@illinois.edu>

Yes; the location appears to specify regions of sequence originating from the indicated source.  

chris

On Nov 18, 2009, at 8:28 AM, Hilmar Lapp wrote:

> True - for chimeric sequences you can have multiple sources. That should be recognizable though from the length (and span) of the source feature location?
> 
> 	-hilmar
> 
> On Nov 18, 2009, at 8:10 AM, Chris Fields wrote:
> 
>> Just to note, there are a few cases where there are two or more source features.  This pops up mainly with chimeric sequences, for example:
>> 
>> http://www.ncbi.nlm.nih.gov/nuccore/21727885
>> 
>> We have run into this a couple of times on the bioperl list.  In this case, each feature is limited to specific locations on the sequence and doesn't pertain to the entire sequence.  NCBI only notes the first source on the ORGANISM line; last time I checked, EMBL used both.
>> 
>> chris
>> 
>> On Nov 18, 2009, at 6:08 AM, Richard Holland wrote:
>> 
>>> BioJava's latest parsers do the following:
>>> 
>>> On read:
>>> 
>>> SOURCE and ORGANISM top-level tags are completely ignored
>>> For each tag in each feature, including source:
>>>  If it's a dbxref
>>>     If it's taxon, set the taxon ID in the BioEntry table (if no /taxon is specified in the source feature the taxonomy does not get stored at all)
>>>     Otherwise set dbxref as a feature CrossRef table entry
>>>  If it's organism
>>>     Add the organism name to the taxon ID in the Taxon table using the scientific taxon name type (if no /organism tag is specified in the source feature, the taxon gets the default name from NCBI, but only if the NCBI taxonomy data is already present in BioSQL) (if no /taxon is specified in the source feature, then the taxonomy does not get stored at all)
>>>  Otherwise
>>>     All other tags get mapped as feature qualifier values, including the source feature
>>> 
>>> On write:
>>> 
>>> SOURCE and ORGANISM tags are generated from the BioEntry taxon ID entry for the sequence,
>>> All features get qualifier values output plus /db_xref tags for all entries from the CrossRef table for the feature,
>>> The source feature is output as per a normal feature, plus /organism and /db_xref="taxon:..." tags generated as per the SOURCE and ORGANISM tags
>>> 
>>> The main reason why we still use the source feature and don't go to sequence level is because when converting between formats it's hard to tell which sequence-level qualifier_values are from the source feature and which are from other places.
>>> 
>>> The main reason why we rely entirely on the source feature for organism and taxon ID info is because it's much easier to parse than the SOURCE and ORGANISM tags.
>>> 
>>> cheers,
>>> Richard
>>> 
>>> On 18 Nov 2009, at 11:06, Peter wrote:
>>> 
>>>> Hello all,
>>>> 
>>>> Something we've just been discussing on the Biopython mailing list
>>>> is a possible change to how we parse the source features in GenBank
>>>> (or EMBL) files. This could have knock on implications for how we use
>>>> BioSQL. For anyone interested, the thread is here:
>>>> http://lists.open-bio.org/pipermail/biopython/2009-November/005826.html
>>>> 
>>>> The basic observation is that GenBank files do not have any extensible
>>>> annotation block for the whole sequence. There are a few fields like
>>>> the comment, organism and taxonomy - but nothing general and
>>>> structured. Instead, it seems the NCBI etc decided to use the feature
>>>> table for this task by inventing the "source" feature. In every single
>>>> GenBank file I have ever seen with a source feature, there is only
>>>> one feature of this type and it spans the full sequence.
>>>> 
>>>> For example, NC_005816, Yersinia pestis biovar Microtus str. 91001
>>>> plasmid pPCP1, complete sequence:
>>>> 
>>>> source      1..9609
>>>>          /organism="Yersinia pestis biovar Microtus str. 91001"
>>>>          /mol_type="genomic DNA"
>>>>          /strain="91001"
>>>>          /db_xref="taxon:229193"
>>>>          /plasmid="pPCP1"
>>>>          /biovar="Microtus"
>>>> 
>>>> (I reduced the white space for emailing). All of that information
>>>> makes sense as annotation for the whole sequence. In fact, the
>>>> "organism" entry is duplicated on the ORGANISM line in the
>>>> GenBank header (and the SOURCE line too).
>>>> 
>>>> Currently we (Biopython, BioPerl etc) store this annotation in BioSQL
>>>> using the seqfeature_qualifiter_value and seqfeature_dbxref tables,
>>>> associated with a "source" feature in the seqfeature table.
>>>> 
>>>> I am suggesting it could make more sense to store the "source"
>>>> feature annotation at the sequence level, using instead the
>>>> bioentry_qualifier_value and bioentry_dbxref tables.
>>>> 
>>>> This is a slight shift from the origins of BioSQL as a schema to
>>>> hold GenBank files - but to me at least it is more logical.
>>>> 
>>>> What does everyone else think? Things work as they are...
>>>> and "if it ain't broken don't fix it"?
>>>> 
>>>> Peter
>>>> 
>>>> [Even if Biopython changes its internal object structure to treat
>>>> the "source" feature annotation as sequence level annotation,
>>>> we *could* continue to use a "source" feature when loading
>>>> GenBank files to/from BioSQL if required for compatibility with
>>>> the other Bio* projects. It would be more work though. In any
>>>> case, we'd also need to recreate a "source" feature when
>>>> writing GenBank output files.]
>>>> _______________________________________________
>>>> BioSQL-l mailing list
>>>> BioSQL-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biosql-l
>>> 
>>> --
>>> Richard Holland, BSc MBCS
>>> Operations and Delivery Director, Eagle Genomics Ltd
>>> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
>>> http://www.eaglegenomics.com/
>>> 
>>> 
>>> _______________________________________________
>>> BioSQL-l mailing list
>>> BioSQL-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biosql-l
>> 
>> 
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biosql-l
>> 
> 
> -- 
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================
> 
> 
> 


From biopython at maubp.freeserve.co.uk  Tue Nov 24 09:27:39 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 24 Nov 2009 14:27:39 +0000
Subject: [BioSQL-l] SQLite support
In-Reply-To: <320fb6e00907280458q56f74ec6iefa420ac1caab8da@mail.gmail.com>
References: <1f864af10812150224y540f1ba6y6b30168102885fcd@mail.gmail.com>
	<320fb6e00812150243w4b0dc223g40abcf684af1ccf5@mail.gmail.com>
	<320fb6e00907050324i6d64d3abreb4d0c256bf1bdc4@mail.gmail.com>
	<320fb6e00907090529t61239952y1c86963f13c1db78@mail.gmail.com>
	<320fb6e00907280458q56f74ec6iefa420ac1caab8da@mail.gmail.com>
Message-ID: <320fb6e00911240627o49bc1ec9nc0d26065ebc23423@mail.gmail.com>

On Tue, Jul 28, 2009 at 11:58 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Thu, Jul 9, 2009 at 1:29 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
>> Hi Hilmar,
>>
>> I've filed a BioSQL enhancement bug 2870 for adding an SQLite
>> schema to BioSQL, and Brad has attached his proposed schema
>> (converted from that for MySQL) to the bug:
>>
>> http://bugzilla.open-bio.org/show_bug.cgi?id=2870
>>
>> Could you take a look at this please? If you are happy with it, we'd like
>> to have it included in BioSQL v1.0.2 (even if Biopython is initially the
>> only Bio* project to support it).
>
> Have you had a chance to look at this yet Hilmar? Brad is keen to
> include BioSQL support for SQLite in the next release of Biopython
> (hopefully within the next week or two), but to do this I'd like your
> blessing, and for the proposed SQLite BioSQL schema to be added
> to the BioSQL SVN repository.

Hi again Hilmar,

Just a reminder about the BioSQL on SQLite proposals - we'd still
like to ship this with the *next* Biopython release (having skipped it
for Biopython 1.52 a couple of months back).

Regards,

Peter

From cjfields at illinois.edu  Tue Nov 24 11:36:33 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Tue, 24 Nov 2009 10:36:33 -0600
Subject: [BioSQL-l] SQLite support
In-Reply-To: <320fb6e00911240627o49bc1ec9nc0d26065ebc23423@mail.gmail.com>
References: <1f864af10812150224y540f1ba6y6b30168102885fcd@mail.gmail.com>
	<320fb6e00812150243w4b0dc223g40abcf684af1ccf5@mail.gmail.com>
	<320fb6e00907050324i6d64d3abreb4d0c256bf1bdc4@mail.gmail.com>
	<320fb6e00907090529t61239952y1c86963f13c1db78@mail.gmail.com>
	<320fb6e00907280458q56f74ec6iefa420ac1caab8da@mail.gmail.com>
	<320fb6e00911240627o49bc1ec9nc0d26065ebc23423@mail.gmail.com>
Message-ID: <070E8BA8-B2C1-4E44-AA2D-9934B3742406@illinois.edu>

On Nov 24, 2009, at 8:27 AM, Peter wrote:

> On Tue, Jul 28, 2009 at 11:58 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> On Thu, Jul 9, 2009 at 1:29 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
>>> Hi Hilmar,
>>> 
>>> I've filed a BioSQL enhancement bug 2870 for adding an SQLite
>>> schema to BioSQL, and Brad has attached his proposed schema
>>> (converted from that for MySQL) to the bug:
>>> 
>>> http://bugzilla.open-bio.org/show_bug.cgi?id=2870
>>> 
>>> Could you take a look at this please? If you are happy with it, we'd like
>>> to have it included in BioSQL v1.0.2 (even if Biopython is initially the
>>> only Bio* project to support it).
>> 
>> Have you had a chance to look at this yet Hilmar? Brad is keen to
>> include BioSQL support for SQLite in the next release of Biopython
>> (hopefully within the next week or two), but to do this I'd like your
>> blessing, and for the proposed SQLite BioSQL schema to be added
>> to the BioSQL SVN repository.
> 
> Hi again Hilmar,
> 
> Just a reminder about the BioSQL on SQLite proposals - we'd still
> like to ship this with the *next* Biopython release (having skipped it
> for Biopython 1.52 a couple of months back).
> 
> Regards,
> 
> Peter

Just want to add that I would like to see SQLite support as well (I might even feel the need to implement the necessary bioperl-db bits).

chris

From biopython at maubp.freeserve.co.uk  Tue Nov 24 12:07:19 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 24 Nov 2009 17:07:19 +0000
Subject: [BioSQL-l] SQLite support
In-Reply-To: <070E8BA8-B2C1-4E44-AA2D-9934B3742406@illinois.edu>
References: <1f864af10812150224y540f1ba6y6b30168102885fcd@mail.gmail.com>
	<320fb6e00812150243w4b0dc223g40abcf684af1ccf5@mail.gmail.com>
	<320fb6e00907050324i6d64d3abreb4d0c256bf1bdc4@mail.gmail.com>
	<320fb6e00907090529t61239952y1c86963f13c1db78@mail.gmail.com>
	<320fb6e00907280458q56f74ec6iefa420ac1caab8da@mail.gmail.com>
	<320fb6e00911240627o49bc1ec9nc0d26065ebc23423@mail.gmail.com>
	<070E8BA8-B2C1-4E44-AA2D-9934B3742406@illinois.edu>
Message-ID: <320fb6e00911240907u32dca751ldb488cbc38f0e035@mail.gmail.com>

On Tue, Nov 24, 2009 at 4:36 PM, Chris Fields <cjfields at illinois.edu> wrote:
>
> Just want to add that I would like to see SQLite support as well
> (I might even feel the need to implement the necessary bioperl-db bits).

Excellent :)

Peter

From desouza at ncbi.nlm.nih.gov  Thu Nov 19 16:58:45 2009
From: desouza at ncbi.nlm.nih.gov (De souza, Robson (NIH/NLM/NCBI) [F])
Date: Thu, 19 Nov 2009 16:58:45 -0500
Subject: [BioSQL-l] update ontology
Message-ID: <340BA68A9AA4F548881B3CCC5435061B45A6A2@NIHCESMLBX15.nih.gov>

Hi guys,

I'm trying to use a BioSQL database to store some protein annotation and
need to know a few things:

- First thing: I want to be able to update the ontologies we are working
on  automatically but I failed to make bp_load_ontology.pl to make it.
What I want is to replace any changed terms and their relationships for
new ones without losing the association between unmodified terms and the
annotated proteins. I still don't know what an scriptlet for --mergeobjs
should look like or whether such scriplet is the way to go in this case

- How do I represent protein domains in BioSQL?
I was thinking of writing code to add domains as bioentries and the use
bioentry_relationship to associate sequence and domain but I would also
like to store the coordinates of each domain in a protein, which would
imply associating bioentries with seqfeatures. Does any of you has
another suggestion in this direction?

Thanks!
Robson


From biopython at maubp.freeserve.co.uk  Wed Nov 25 16:39:39 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 25 Nov 2009 21:39:39 +0000
Subject: [BioSQL-l] update ontology
In-Reply-To: <340BA68A9AA4F548881B3CCC5435061B45A6A2@NIHCESMLBX15.nih.gov>
References: <340BA68A9AA4F548881B3CCC5435061B45A6A2@NIHCESMLBX15.nih.gov>
Message-ID: <320fb6e00911251339j700517cckac3ddb0adc323f00@mail.gmail.com>

On Thu, Nov 19, 2009 at 9:58 PM, De souza, Robson (NIH/NLM/NCBI) [F]
<desouza at ncbi.nlm.nih.gov> wrote:
>
>
> - How do I represent protein domains in BioSQL?
> I was thinking of writing code to add domains as bioentries and the use
> bioentry_relationship to associate sequence and domain but I would also
> like to store the coordinates of each domain in a protein, which would
> imply associating bioentries with seqfeatures. Does any of you has
> another suggestion in this direction?

I would do what Biopython/BioPerl/Bio* would do on loading
a GenPept file into BioSQL - have a bioentry for each protein,
with its amino acid sequence, and for each domain a seqfeature
entry (which records the location within the parent protein).

Peter

From jijibio at gmail.com  Mon Nov  9 12:47:20 2009
From: jijibio at gmail.com (=?ISO-8859-1?Q?=BB=BB=BBJiji_Kurup=AB=AB=AB?=)
Date: Mon, 9 Nov 2009 18:17:20 +0530
Subject: [BioSQL-l] Project participation
Message-ID: <82611efb0911090447y2aa1ae16x1106a41b7a0e2a1@mail.gmail.com>

Hi Lapp,

I am very much interested to be a part of  "JEE5 webservice interface  
to BioSQL" and
"BioSQL web interface and API on Google App Engine"  project.
But i am not a student now, i am working in a bioinformatics company,  
so whether it is possible to
do participate in any of this projects.

Kindly let me known if there is any provision for it and tell me the  
procedure also.


-- 
Regards,

Jiji Kurup
Application Scientist


From jay at jays.net  Mon Nov 16 22:20:33 2009
From: jay at jays.net (Jay Hannah)
Date: Mon, 16 Nov 2009 16:20:33 -0600
Subject: [BioSQL-l] [patch] load_taxonomy.pl script was renamed
	load_ncbi_taxonomy.pl
Message-ID: <625B7B90-85D5-4084-BC56-8EBB2CFF6EEE@jays.net>


Can someone activate my ('jhannah') commit bit for this biosql-schema? I can commit to bioperl-live, but not biosql-schema.

Or apply the patch below for me?

Thanks,

j
http://clab.ist.unomaha.edu/CLAB/index.php/User:Jhannah


jhannah at jaysnet-MacBook:~/src/biosql-schema$ svn info
Path: .
URL: svn://code.open-bio.org/biosql/biosql-schema/trunk
Repository Root: svn://code.open-bio.org/biosql
Repository UUID: 77cd3915-3943-0410-8c1d-86e8b156039c
Revision: 316
Node Kind: directory
Schedule: normal
Last Changed Author: lapp
Last Changed Rev: 316
Last Changed Date: 2009-07-18 13:24:56 -0500 (Sat, 18 Jul 2009)


jhannah at jaysnet-MacBook:~/src/biosql-schema$ svn diff
Index: doc/schema-overview.txt
===================================================================
--- doc/schema-overview.txt	(revision 316)
+++ doc/schema-overview.txt	(working copy)
@@ -150,7 +150,7 @@
 structure of NCBI's taxonomy database. Each bioentry can be
 associated with only one taxon, but many bioentries can be associated
 with the same taxon. In order to get the most value from these tables
-it's recommended that you use the BioSQL script load_taxonomy.pl
+it's recommended that you use the BioSQL script load_ncbi_taxonomy.pl
 to populate them.
 
 The taxon_name.taxon_id field is meant to store an NCBI
@@ -165,7 +165,7 @@
 parent_taxon_id contains the taxon id of the parent taxon, since there
 should only be one parent in the taxonomic tree. The right_value and
 left_value fields store values that are calculated and entered by the 
-load_taxonomy.pl script. These arbitrary values are the upper and
+load_ncbi_taxonomy.pl script. These arbitrary values are the upper and
 lower bounds of "nested sets", one set for each taxa, where the set 
 of the child taxa is contained within the larger set of the parent 
 taxon. An example would be the set for the species Procyon lotor,  
Index: INSTALL
===================================================================
--- INSTALL	(revision 316)
+++ INSTALL	(working copy)
@@ -449,7 +449,7 @@
 
 With bioperl and bioperl-db installed you are ready to load some data.
 It is advisable to pre-load the NCBI taxonomy database (use
-scripts/load_taxonomy.pl in the biosql-schema package, the details are
+scripts/load_ncbi_taxonomy.pl in the biosql-schema package, the details are
 in its documentation). Otherwise you'll see errors from misparsed
 organisms. 
 

jhannah at jaysnet-MacBook:~/src/biosql-schema$ svn commit
svn: Commit failed (details follow):
svn: Authorization failed


From jay at jays.net  Mon Nov 16 23:22:28 2009
From: jay at jays.net (Jay Hannah)
Date: Mon, 16 Nov 2009 17:22:28 -0600
Subject: [BioSQL-l] [patch] load_taxonomy.pl script was renamed
	load_ncbi_taxonomy.pl
In-Reply-To: <625B7B90-85D5-4084-BC56-8EBB2CFF6EEE@jays.net>
References: <625B7B90-85D5-4084-BC56-8EBB2CFF6EEE@jays.net>
Message-ID: <044213EE-E980-4E48-870D-1F2896E937B3@jays.net>

On Nov 16, 2009, at 4:20 PM, Jay Hannah wrote:
> URL: svn://code.open-bio.org/biosql/biosql-schema/trunk

Oh, oops. I think I was using the wrong repo address for committing. 

I think I'm using the right address now. Now getting the error below.

Thanks,

j
http://clab.ist.unomaha.edu/CLAB/index.php/User:Jhannah


jhannah at jaysnet-MacBook:~/src/biosql-schema-committer$ svn info
Path: .
URL: svn+ssh://jhannah at dev.open-bio.org/home/svn-repositories/biosql/biosql-schema/trunk
Repository Root: svn+ssh://jhannah at dev.open-bio.org/home/svn-repositories/biosql
Repository UUID: 77cd3915-3943-0410-8c1d-86e8b156039c
Revision: 316
Node Kind: directory
Schedule: normal
Last Changed Author: lapp
Last Changed Rev: 316
Last Changed Date: 2009-07-18 13:24:56 -0500 (Sat, 18 Jul 2009)


jhannah at jaysnet-MacBook:~/src/biosql-schema-committer$ svn commit
===========================================
 dev.open-bio.org - Authorized Access Only
===========================================
Sending        INSTALL
Sending        doc/schema-overview.txt
Transmitting file data ..svn: Commit failed (details follow):
svn: Can't create directory '/home/svn-repositories/biosql/db/transactions/316-1.txn': Permission denied
svn: Your commit message was left in a temporary file:
svn:    '/Users/jhannah/src/biosql-schema-committer/svn-commit.tmp'


From mauricio at open-bio.org  Tue Nov 17 05:02:32 2009
From: mauricio at open-bio.org (Mauricio Herrera Cuadra)
Date: Mon, 16 Nov 2009 23:02:32 -0600
Subject: [BioSQL-l] [patch] load_taxonomy.pl script was
	renamed	load_ncbi_taxonomy.pl
In-Reply-To: <044213EE-E980-4E48-870D-1F2896E937B3@jays.net>
References: <625B7B90-85D5-4084-BC56-8EBB2CFF6EEE@jays.net>
	<044213EE-E980-4E48-870D-1F2896E937B3@jays.net>
Message-ID: <4B022E68.2080304@open-bio.org>

I added you to the biosql group in the SVN server. You should be able to 
commit the patch now.

Cheers,
Mauricio.

Jay Hannah wrote:
> On Nov 16, 2009, at 4:20 PM, Jay Hannah wrote:
>> URL: svn://code.open-bio.org/biosql/biosql-schema/trunk
> 
> Oh, oops. I think I was using the wrong repo address for committing. 
> 
> I think I'm using the right address now. Now getting the error below.
> 
> Thanks,
> 
> j
> http://clab.ist.unomaha.edu/CLAB/index.php/User:Jhannah
> 
> 
> 
> 
> jhannah at jaysnet-MacBook:~/src/biosql-schema-committer$ svn info
> Path: .
> URL: svn+ssh://jhannah at dev.open-bio.org/home/svn-repositories/biosql/biosql-schema/trunk
> Repository Root: svn+ssh://jhannah at dev.open-bio.org/home/svn-repositories/biosql
> Repository UUID: 77cd3915-3943-0410-8c1d-86e8b156039c
> Revision: 316
> Node Kind: directory
> Schedule: normal
> Last Changed Author: lapp
> Last Changed Rev: 316
> Last Changed Date: 2009-07-18 13:24:56 -0500 (Sat, 18 Jul 2009)
> 
> 
> jhannah at jaysnet-MacBook:~/src/biosql-schema-committer$ svn commit
> ===========================================
>  dev.open-bio.org - Authorized Access Only
> ===========================================
> Sending        INSTALL
> Sending        doc/schema-overview.txt
> Transmitting file data ..svn: Commit failed (details follow):
> svn: Can't create directory '/home/svn-repositories/biosql/db/transactions/316-1.txn': Permission denied
> svn: Your commit message was left in a temporary file:
> svn:    '/Users/jhannah/src/biosql-schema-committer/svn-commit.tmp'
> 
> 
> 
> 
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
> 


From jay at jays.net  Tue Nov 17 13:00:01 2009
From: jay at jays.net (Jay Hannah)
Date: Tue, 17 Nov 2009 07:00:01 -0600
Subject: [BioSQL-l] [patch] load_taxonomy.pl script was
	renamed	load_ncbi_taxonomy.pl
In-Reply-To: <4B022E68.2080304@open-bio.org>
References: <625B7B90-85D5-4084-BC56-8EBB2CFF6EEE@jays.net>
	<044213EE-E980-4E48-870D-1F2896E937B3@jays.net>
	<4B022E68.2080304@open-bio.org>
Message-ID: <BE51C7D4-F886-47AF-AA45-9B887F426C40@jays.net>

On Nov 16, 2009, at 11:02 PM, Mauricio Herrera Cuadra wrote:
> I added you to the biosql group in the SVN server. You should be able to commit the patch now.

Thanks! r317 committed.  :)

j


------------------------------------------------------------------------
r317 | jhannah | 2009-11-17 06:58:07 -0600 (Tue, 17 Nov 2009) | 2 lines
Changed paths:
   M /biosql-schema/trunk/INSTALL
   M /biosql-schema/trunk/doc/schema-overview.txt

load_taxonomy.pl script was renamed load_ncbi_taxonomy.pl.
------------------------------------------------------------------------


From biopython at maubp.freeserve.co.uk  Wed Nov 18 11:06:51 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 18 Nov 2009 11:06:51 +0000
Subject: [BioSQL-l] Treating GenBank source features as top level annotation
Message-ID: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>

Hello all,

Something we've just been discussing on the Biopython mailing list
is a possible change to how we parse the source features in GenBank
(or EMBL) files. This could have knock on implications for how we use
BioSQL. For anyone interested, the thread is here:
http://lists.open-bio.org/pipermail/biopython/2009-November/005826.html

The basic observation is that GenBank files do not have any extensible
annotation block for the whole sequence. There are a few fields like
the comment, organism and taxonomy - but nothing general and
structured. Instead, it seems the NCBI etc decided to use the feature
table for this task by inventing the "source" feature. In every single
GenBank file I have ever seen with a source feature, there is only
one feature of this type and it spans the full sequence.

For example, NC_005816, Yersinia pestis biovar Microtus str. 91001
plasmid pPCP1, complete sequence:

 source      1..9609
             /organism="Yersinia pestis biovar Microtus str. 91001"
             /mol_type="genomic DNA"
             /strain="91001"
             /db_xref="taxon:229193"
             /plasmid="pPCP1"
             /biovar="Microtus"

(I reduced the white space for emailing). All of that information
makes sense as annotation for the whole sequence. In fact, the
"organism" entry is duplicated on the ORGANISM line in the
GenBank header (and the SOURCE line too).

Currently we (Biopython, BioPerl etc) store this annotation in BioSQL
using the seqfeature_qualifiter_value and seqfeature_dbxref tables,
associated with a "source" feature in the seqfeature table.

I am suggesting it could make more sense to store the "source"
feature annotation at the sequence level, using instead the
bioentry_qualifier_value and bioentry_dbxref tables.

This is a slight shift from the origins of BioSQL as a schema to
hold GenBank files - but to me at least it is more logical.

What does everyone else think? Things work as they are...
and "if it ain't broken don't fix it"?

Peter

[Even if Biopython changes its internal object structure to treat
the "source" feature annotation as sequence level annotation,
we *could* continue to use a "source" feature when loading
GenBank files to/from BioSQL if required for compatibility with
the other Bio* projects. It would be more work though. In any
case, we'd also need to recreate a "source" feature when
writing GenBank output files.]


From biopython at maubp.freeserve.co.uk  Wed Nov 18 12:27:12 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 18 Nov 2009 12:27:12 +0000
Subject: [BioSQL-l] Treating GenBank source features as top level
	annotation
In-Reply-To: <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com>
References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>
	<8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com>
Message-ID: <320fb6e00911180427q79961f6ci5a43ebac9ff70f7a@mail.gmail.com>

On Wed, Nov 18, 2009 at 12:08 PM, Richard Holland
<holland at eaglegenomics.com> wrote:
>
> BioJava's latest parsers do the following:
> ...

Without checking all the details, that is broadly what Biopython does
at the moment.

> The main reason why we still use the source feature and don't go to sequence
> level is because when converting between formats it's hard to tell which
> sequence-level qualifier_values are from the source feature and which are
> from other places.

Makes sense.

> The main reason why we rely entirely on the source feature for organism
> and taxon ID info is because it's much easier to parse than the SOURCE
> and ORGANISM tags.

>From memory, Biopython also uses the taxon table here too.

Peter


From holland at eaglegenomics.com  Wed Nov 18 12:08:48 2009
From: holland at eaglegenomics.com (Richard Holland)
Date: Wed, 18 Nov 2009 12:08:48 +0000
Subject: [BioSQL-l] Treating GenBank source features as top level
	annotation
In-Reply-To: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>
References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>
Message-ID: <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com>

BioJava's latest parsers do the following:

On read:

  SOURCE and ORGANISM top-level tags are completely ignored
  For each tag in each feature, including source:
    If it's a dbxref 
       If it's taxon, set the taxon ID in the BioEntry table (if no /taxon is specified in the source feature the taxonomy does not get stored at all)
       Otherwise set dbxref as a feature CrossRef table entry
    If it's organism
       Add the organism name to the taxon ID in the Taxon table using the scientific taxon name type (if no /organism tag is specified in the source feature, the taxon gets the default name from NCBI, but only if the NCBI taxonomy data is already present in BioSQL) (if no /taxon is specified in the source feature, then the taxonomy does not get stored at all)
    Otherwise
       All other tags get mapped as feature qualifier values, including the source feature
   
On write:

   SOURCE and ORGANISM tags are generated from the BioEntry taxon ID entry for the sequence,
   All features get qualifier values output plus /db_xref tags for all entries from the CrossRef table for the feature,
   The source feature is output as per a normal feature, plus /organism and /db_xref="taxon:..." tags generated as per the SOURCE and ORGANISM tags

The main reason why we still use the source feature and don't go to sequence level is because when converting between formats it's hard to tell which sequence-level qualifier_values are from the source feature and which are from other places. 

The main reason why we rely entirely on the source feature for organism and taxon ID info is because it's much easier to parse than the SOURCE and ORGANISM tags.

cheers,
Richard

On 18 Nov 2009, at 11:06, Peter wrote:

> Hello all,
> 
> Something we've just been discussing on the Biopython mailing list
> is a possible change to how we parse the source features in GenBank
> (or EMBL) files. This could have knock on implications for how we use
> BioSQL. For anyone interested, the thread is here:
> http://lists.open-bio.org/pipermail/biopython/2009-November/005826.html
> 
> The basic observation is that GenBank files do not have any extensible
> annotation block for the whole sequence. There are a few fields like
> the comment, organism and taxonomy - but nothing general and
> structured. Instead, it seems the NCBI etc decided to use the feature
> table for this task by inventing the "source" feature. In every single
> GenBank file I have ever seen with a source feature, there is only
> one feature of this type and it spans the full sequence.
> 
> For example, NC_005816, Yersinia pestis biovar Microtus str. 91001
> plasmid pPCP1, complete sequence:
> 
> source      1..9609
>             /organism="Yersinia pestis biovar Microtus str. 91001"
>             /mol_type="genomic DNA"
>             /strain="91001"
>             /db_xref="taxon:229193"
>             /plasmid="pPCP1"
>             /biovar="Microtus"
> 
> (I reduced the white space for emailing). All of that information
> makes sense as annotation for the whole sequence. In fact, the
> "organism" entry is duplicated on the ORGANISM line in the
> GenBank header (and the SOURCE line too).
> 
> Currently we (Biopython, BioPerl etc) store this annotation in BioSQL
> using the seqfeature_qualifiter_value and seqfeature_dbxref tables,
> associated with a "source" feature in the seqfeature table.
> 
> I am suggesting it could make more sense to store the "source"
> feature annotation at the sequence level, using instead the
> bioentry_qualifier_value and bioentry_dbxref tables.
> 
> This is a slight shift from the origins of BioSQL as a schema to
> hold GenBank files - but to me at least it is more logical.
> 
> What does everyone else think? Things work as they are...
> and "if it ain't broken don't fix it"?
> 
> Peter
> 
> [Even if Biopython changes its internal object structure to treat
> the "source" feature annotation as sequence level annotation,
> we *could* continue to use a "source" feature when loading
> GenBank files to/from BioSQL if required for compatibility with
> the other Bio* projects. It would be more work though. In any
> case, we'd also need to recreate a "source" feature when
> writing GenBank output files.]
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l

--
Richard Holland, BSc MBCS
Operations and Delivery Director, Eagle Genomics Ltd
T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/


From hlapp at gmx.net  Wed Nov 18 13:13:05 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 18 Nov 2009 08:13:05 -0500
Subject: [BioSQL-l] Treating GenBank source features as top level
	annotation
In-Reply-To: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>
References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>
Message-ID: <670ED558-7FBD-4219-A449-0D7E63BE0766@gmx.net>

I agree completely with your interpretation of the "source" feature  
tag, and in fact what you outline below is what I implemented as a  
"SeqProcessor" module for use within the SymAtlas data integration  
project (BioPerl supports 'pipes' of I/O and processing modules, where  
the latter can modify the sequence objects coming out of the I/O  
module).

I'm not sure I would want to hard-code this behavior into the BioPerl  
genbank parser. However, it would be easy enough to code it into a  
processing module that comes standard with the distribution to the  
extent that it can be enabled as simply as a format variant to SeqIO.

It sounds useful enough that I guess I should post it to the BioPerl  
list ...

	-hilmar

On Nov 18, 2009, at 6:06 AM, Peter wrote:

> Hello all,
>
> Something we've just been discussing on the Biopython mailing list
> is a possible change to how we parse the source features in GenBank
> (or EMBL) files. This could have knock on implications for how we use
> BioSQL. For anyone interested, the thread is here:
> http://lists.open-bio.org/pipermail/biopython/2009-November/ 
> 005826.html
>
> The basic observation is that GenBank files do not have any extensible
> annotation block for the whole sequence. There are a few fields like
> the comment, organism and taxonomy - but nothing general and
> structured. Instead, it seems the NCBI etc decided to use the feature
> table for this task by inventing the "source" feature. In every single
> GenBank file I have ever seen with a source feature, there is only
> one feature of this type and it spans the full sequence.
>
> For example, NC_005816, Yersinia pestis biovar Microtus str. 91001
> plasmid pPCP1, complete sequence:
>
> source      1..9609
>             /organism="Yersinia pestis biovar Microtus str. 91001"
>             /mol_type="genomic DNA"
>             /strain="91001"
>             /db_xref="taxon:229193"
>             /plasmid="pPCP1"
>             /biovar="Microtus"
>
> (I reduced the white space for emailing). All of that information
> makes sense as annotation for the whole sequence. In fact, the
> "organism" entry is duplicated on the ORGANISM line in the
> GenBank header (and the SOURCE line too).
>
> Currently we (Biopython, BioPerl etc) store this annotation in BioSQL
> using the seqfeature_qualifiter_value and seqfeature_dbxref tables,
> associated with a "source" feature in the seqfeature table.
>
> I am suggesting it could make more sense to store the "source"
> feature annotation at the sequence level, using instead the
> bioentry_qualifier_value and bioentry_dbxref tables.
>
> This is a slight shift from the origins of BioSQL as a schema to
> hold GenBank files - but to me at least it is more logical.
>
> What does everyone else think? Things work as they are...
> and "if it ain't broken don't fix it"?
>
> Peter
>
> [Even if Biopython changes its internal object structure to treat
> the "source" feature annotation as sequence level annotation,
> we *could* continue to use a "source" feature when loading
> GenBank files to/from BioSQL if required for compatibility with
> the other Bio* projects. It would be more work though. In any
> case, we'd also need to recreate a "source" feature when
> writing GenBank output files.]
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
>

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Wed Nov 18 13:14:35 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 18 Nov 2009 08:14:35 -0500
Subject: [BioSQL-l] Treating GenBank source features as top level
	annotation
In-Reply-To: <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com>
References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>
	<8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com>
Message-ID: <618B39A2-1CA2-405F-A8D7-4947356BF7A5@gmx.net>


On Nov 18, 2009, at 7:08 AM, Richard Holland wrote:

>  For each tag in each feature, including source:
>    If it's a dbxref
>       If it's taxon, set the taxon ID in the BioEntry table (if no / 
> taxon is specified in the source feature the taxonomy does not get  
> stored at all)


That's what the BioPerl Genbank parser does too, though only if it's a  
"source" feature. I don't know of any other feature key that would  
have a taxon dbxref entry.

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Wed Nov 18 13:16:40 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 18 Nov 2009 08:16:40 -0500
Subject: [BioSQL-l] Treating GenBank source features as top level
	annotation
In-Reply-To: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>
References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>
Message-ID: <72A7ED62-7213-4136-90F9-74F4691F3003@gmx.net>


On Nov 18, 2009, at 6:06 AM, Peter wrote:

> For example, NC_005816, Yersinia pestis biovar Microtus str. 91001
> plasmid pPCP1, complete sequence:
>
> source      1..9609
>             /organism="Yersinia pestis biovar Microtus str. 91001"
>             /mol_type="genomic DNA"
>             /strain="91001"
>             /db_xref="taxon:229193"
>             /plasmid="pPCP1"
>             /biovar="Microtus"


Just FYI, the sequences coming out of the barcoding projects will have  
the lat/long coordinates here, too. Those obviously pertain to the  
specimen (and hence to the whole sequence).

	-hilmar
-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From biopython at maubp.freeserve.co.uk  Wed Nov 18 13:34:38 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 18 Nov 2009 13:34:38 +0000
Subject: [BioSQL-l] Treating GenBank source features as top level
	annotation
In-Reply-To: <D8574860-3891-4D4E-ACD8-18D07FBB6E81@illinois.edu>
References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>
	<8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com>
	<D8574860-3891-4D4E-ACD8-18D07FBB6E81@illinois.edu>
Message-ID: <320fb6e00911180534o2cd0126fp62527db04e0c346f@mail.gmail.com>

On Wed, Nov 18, 2009 at 1:10 PM, Chris Fields <cjfields at illinois.edu> wrote:
>
> Just to note, there are a few cases where there are two or more source features.
> This pops up mainly with chimeric sequences, for example:
>
> http://www.ncbi.nlm.nih.gov/nuccore/21727885
>
> We have run into this a couple of times on the bioperl list. ?In this case, each
> feature is limited to specific locations on the sequence and doesn't pertain to
> the entire sequence. ?NCBI only notes the first source on the ORGANISM line;
> last time I checked, EMBL used both.
>
> chris

Wow - cool example. It was worth starting this thread just to learn about
this interesting corner case. I wonder if this is a common enough case to
warrant leaving the source features as they are?

Peter


From cjfields at illinois.edu  Wed Nov 18 13:10:36 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Wed, 18 Nov 2009 07:10:36 -0600
Subject: [BioSQL-l] Treating GenBank source features as top level
	annotation
In-Reply-To: <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com>
References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>
	<8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com>
Message-ID: <D8574860-3891-4D4E-ACD8-18D07FBB6E81@illinois.edu>

Just to note, there are a few cases where there are two or more source features.  This pops up mainly with chimeric sequences, for example:

http://www.ncbi.nlm.nih.gov/nuccore/21727885

We have run into this a couple of times on the bioperl list.  In this case, each feature is limited to specific locations on the sequence and doesn't pertain to the entire sequence.  NCBI only notes the first source on the ORGANISM line; last time I checked, EMBL used both.

chris

On Nov 18, 2009, at 6:08 AM, Richard Holland wrote:

> BioJava's latest parsers do the following:
> 
> On read:
> 
>  SOURCE and ORGANISM top-level tags are completely ignored
>  For each tag in each feature, including source:
>    If it's a dbxref 
>       If it's taxon, set the taxon ID in the BioEntry table (if no /taxon is specified in the source feature the taxonomy does not get stored at all)
>       Otherwise set dbxref as a feature CrossRef table entry
>    If it's organism
>       Add the organism name to the taxon ID in the Taxon table using the scientific taxon name type (if no /organism tag is specified in the source feature, the taxon gets the default name from NCBI, but only if the NCBI taxonomy data is already present in BioSQL) (if no /taxon is specified in the source feature, then the taxonomy does not get stored at all)
>    Otherwise
>       All other tags get mapped as feature qualifier values, including the source feature
> 
> On write:
> 
>   SOURCE and ORGANISM tags are generated from the BioEntry taxon ID entry for the sequence,
>   All features get qualifier values output plus /db_xref tags for all entries from the CrossRef table for the feature,
>   The source feature is output as per a normal feature, plus /organism and /db_xref="taxon:..." tags generated as per the SOURCE and ORGANISM tags
> 
> The main reason why we still use the source feature and don't go to sequence level is because when converting between formats it's hard to tell which sequence-level qualifier_values are from the source feature and which are from other places. 
> 
> The main reason why we rely entirely on the source feature for organism and taxon ID info is because it's much easier to parse than the SOURCE and ORGANISM tags.
> 
> cheers,
> Richard
> 
> On 18 Nov 2009, at 11:06, Peter wrote:
> 
>> Hello all,
>> 
>> Something we've just been discussing on the Biopython mailing list
>> is a possible change to how we parse the source features in GenBank
>> (or EMBL) files. This could have knock on implications for how we use
>> BioSQL. For anyone interested, the thread is here:
>> http://lists.open-bio.org/pipermail/biopython/2009-November/005826.html
>> 
>> The basic observation is that GenBank files do not have any extensible
>> annotation block for the whole sequence. There are a few fields like
>> the comment, organism and taxonomy - but nothing general and
>> structured. Instead, it seems the NCBI etc decided to use the feature
>> table for this task by inventing the "source" feature. In every single
>> GenBank file I have ever seen with a source feature, there is only
>> one feature of this type and it spans the full sequence.
>> 
>> For example, NC_005816, Yersinia pestis biovar Microtus str. 91001
>> plasmid pPCP1, complete sequence:
>> 
>> source      1..9609
>>            /organism="Yersinia pestis biovar Microtus str. 91001"
>>            /mol_type="genomic DNA"
>>            /strain="91001"
>>            /db_xref="taxon:229193"
>>            /plasmid="pPCP1"
>>            /biovar="Microtus"
>> 
>> (I reduced the white space for emailing). All of that information
>> makes sense as annotation for the whole sequence. In fact, the
>> "organism" entry is duplicated on the ORGANISM line in the
>> GenBank header (and the SOURCE line too).
>> 
>> Currently we (Biopython, BioPerl etc) store this annotation in BioSQL
>> using the seqfeature_qualifiter_value and seqfeature_dbxref tables,
>> associated with a "source" feature in the seqfeature table.
>> 
>> I am suggesting it could make more sense to store the "source"
>> feature annotation at the sequence level, using instead the
>> bioentry_qualifier_value and bioentry_dbxref tables.
>> 
>> This is a slight shift from the origins of BioSQL as a schema to
>> hold GenBank files - but to me at least it is more logical.
>> 
>> What does everyone else think? Things work as they are...
>> and "if it ain't broken don't fix it"?
>> 
>> Peter
>> 
>> [Even if Biopython changes its internal object structure to treat
>> the "source" feature annotation as sequence level annotation,
>> we *could* continue to use a "source" feature when loading
>> GenBank files to/from BioSQL if required for compatibility with
>> the other Bio* projects. It would be more work though. In any
>> case, we'd also need to recreate a "source" feature when
>> writing GenBank output files.]
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biosql-l
> 
> --
> Richard Holland, BSc MBCS
> Operations and Delivery Director, Eagle Genomics Ltd
> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
> http://www.eaglegenomics.com/
> 
> 
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l


From hlapp at gmx.net  Wed Nov 18 14:28:01 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 18 Nov 2009 09:28:01 -0500
Subject: [BioSQL-l] Treating GenBank source features as top level
	annotation
In-Reply-To: <D8574860-3891-4D4E-ACD8-18D07FBB6E81@illinois.edu>
References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>
	<8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com>
	<D8574860-3891-4D4E-ACD8-18D07FBB6E81@illinois.edu>
Message-ID: <73839DF2-F9D2-441C-8AA3-649D619D2B1E@gmx.net>

True - for chimeric sequences you can have multiple sources. That  
should be recognizable though from the length (and span) of the source  
feature location?

	-hilmar

On Nov 18, 2009, at 8:10 AM, Chris Fields wrote:

> Just to note, there are a few cases where there are two or more  
> source features.  This pops up mainly with chimeric sequences, for  
> example:
>
> http://www.ncbi.nlm.nih.gov/nuccore/21727885
>
> We have run into this a couple of times on the bioperl list.  In  
> this case, each feature is limited to specific locations on the  
> sequence and doesn't pertain to the entire sequence.  NCBI only  
> notes the first source on the ORGANISM line; last time I checked,  
> EMBL used both.
>
> chris
>
> On Nov 18, 2009, at 6:08 AM, Richard Holland wrote:
>
>> BioJava's latest parsers do the following:
>>
>> On read:
>>
>> SOURCE and ORGANISM top-level tags are completely ignored
>> For each tag in each feature, including source:
>>   If it's a dbxref
>>      If it's taxon, set the taxon ID in the BioEntry table (if no / 
>> taxon is specified in the source feature the taxonomy does not get  
>> stored at all)
>>      Otherwise set dbxref as a feature CrossRef table entry
>>   If it's organism
>>      Add the organism name to the taxon ID in the Taxon table using  
>> the scientific taxon name type (if no /organism tag is specified in  
>> the source feature, the taxon gets the default name from NCBI, but  
>> only if the NCBI taxonomy data is already present in BioSQL) (if  
>> no /taxon is specified in the source feature, then the taxonomy  
>> does not get stored at all)
>>   Otherwise
>>      All other tags get mapped as feature qualifier values,  
>> including the source feature
>>
>> On write:
>>
>>  SOURCE and ORGANISM tags are generated from the BioEntry taxon ID  
>> entry for the sequence,
>>  All features get qualifier values output plus /db_xref tags for  
>> all entries from the CrossRef table for the feature,
>>  The source feature is output as per a normal feature, plus / 
>> organism and /db_xref="taxon:..." tags generated as per the SOURCE  
>> and ORGANISM tags
>>
>> The main reason why we still use the source feature and don't go to  
>> sequence level is because when converting between formats it's hard  
>> to tell which sequence-level qualifier_values are from the source  
>> feature and which are from other places.
>>
>> The main reason why we rely entirely on the source feature for  
>> organism and taxon ID info is because it's much easier to parse  
>> than the SOURCE and ORGANISM tags.
>>
>> cheers,
>> Richard
>>
>> On 18 Nov 2009, at 11:06, Peter wrote:
>>
>>> Hello all,
>>>
>>> Something we've just been discussing on the Biopython mailing list
>>> is a possible change to how we parse the source features in GenBank
>>> (or EMBL) files. This could have knock on implications for how we  
>>> use
>>> BioSQL. For anyone interested, the thread is here:
>>> http://lists.open-bio.org/pipermail/biopython/2009-November/005826.html
>>>
>>> The basic observation is that GenBank files do not have any  
>>> extensible
>>> annotation block for the whole sequence. There are a few fields like
>>> the comment, organism and taxonomy - but nothing general and
>>> structured. Instead, it seems the NCBI etc decided to use the  
>>> feature
>>> table for this task by inventing the "source" feature. In every  
>>> single
>>> GenBank file I have ever seen with a source feature, there is only
>>> one feature of this type and it spans the full sequence.
>>>
>>> For example, NC_005816, Yersinia pestis biovar Microtus str. 91001
>>> plasmid pPCP1, complete sequence:
>>>
>>> source      1..9609
>>>           /organism="Yersinia pestis biovar Microtus str. 91001"
>>>           /mol_type="genomic DNA"
>>>           /strain="91001"
>>>           /db_xref="taxon:229193"
>>>           /plasmid="pPCP1"
>>>           /biovar="Microtus"
>>>
>>> (I reduced the white space for emailing). All of that information
>>> makes sense as annotation for the whole sequence. In fact, the
>>> "organism" entry is duplicated on the ORGANISM line in the
>>> GenBank header (and the SOURCE line too).
>>>
>>> Currently we (Biopython, BioPerl etc) store this annotation in  
>>> BioSQL
>>> using the seqfeature_qualifiter_value and seqfeature_dbxref tables,
>>> associated with a "source" feature in the seqfeature table.
>>>
>>> I am suggesting it could make more sense to store the "source"
>>> feature annotation at the sequence level, using instead the
>>> bioentry_qualifier_value and bioentry_dbxref tables.
>>>
>>> This is a slight shift from the origins of BioSQL as a schema to
>>> hold GenBank files - but to me at least it is more logical.
>>>
>>> What does everyone else think? Things work as they are...
>>> and "if it ain't broken don't fix it"?
>>>
>>> Peter
>>>
>>> [Even if Biopython changes its internal object structure to treat
>>> the "source" feature annotation as sequence level annotation,
>>> we *could* continue to use a "source" feature when loading
>>> GenBank files to/from BioSQL if required for compatibility with
>>> the other Bio* projects. It would be more work though. In any
>>> case, we'd also need to recreate a "source" feature when
>>> writing GenBank output files.]
>>> _______________________________________________
>>> BioSQL-l mailing list
>>> BioSQL-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biosql-l
>>
>> --
>> Richard Holland, BSc MBCS
>> Operations and Delivery Director, Eagle Genomics Ltd
>> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
>> http://www.eaglegenomics.com/
>>
>>
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biosql-l
>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
>

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From hlapp at gmx.net  Wed Nov 18 16:50:04 2009
From: hlapp at gmx.net (Hilmar Lapp)
Date: Wed, 18 Nov 2009 11:50:04 -0500
Subject: [BioSQL-l] [patch] load_taxonomy.pl script was renamed
	load_ncbi_taxonomy.pl
In-Reply-To: <BE51C7D4-F886-47AF-AA45-9B887F426C40@jays.net>
References: <625B7B90-85D5-4084-BC56-8EBB2CFF6EEE@jays.net>
	<044213EE-E980-4E48-870D-1F2896E937B3@jays.net>
	<4B022E68.2080304@open-bio.org>
	<BE51C7D4-F886-47AF-AA45-9B887F426C40@jays.net>
Message-ID: <03730DF5-706B-47F6-A191-8352E38CA42C@gmx.net>

Hi Jay - thanks much for the patch, highly appreciated! -hilmar

On Nov 17, 2009, at 8:00 AM, Jay Hannah wrote:

> On Nov 16, 2009, at 11:02 PM, Mauricio Herrera Cuadra wrote:
>> I added you to the biosql group in the SVN server. You should be  
>> able to commit the patch now.
>
> Thanks! r317 committed.  :)
>
> j
>
>
>
> ------------------------------------------------------------------------
> r317 | jhannah | 2009-11-17 06:58:07 -0600 (Tue, 17 Nov 2009) | 2  
> lines
> Changed paths:
>   M /biosql-schema/trunk/INSTALL
>   M /biosql-schema/trunk/doc/schema-overview.txt
>
> load_taxonomy.pl script was renamed load_ncbi_taxonomy.pl.
> ------------------------------------------------------------------------
>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
>

-- 
===========================================================
: Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
===========================================================


From maruco at gmail.com  Mon Nov  9 14:35:31 2009
From: maruco at gmail.com (Thiago Satake)
Date: Mon, 9 Nov 2009 09:35:31 -0500
Subject: [BioSQL-l] Project participation
In-Reply-To: <82611efb0911090447y2aa1ae16x1106a41b7a0e2a1@mail.gmail.com>
References: <82611efb0911090447y2aa1ae16x1106a41b7a0e2a1@mail.gmail.com>
Message-ID: <eff79fb40911090635n7f9dc07bx252810784727a732@mail.gmail.com>

Hi gays,

It sounds to me very" interesting project!!

Where can I find more information about ?

Thanks,


2009/11/9 ???Jiji Kurup??? <jijibio at gmail.com>:
> Hi Lapp,
>
> I am very much interested to be a part of  "JEE5 webservice  
> interface to
> BioSQL" and
> "BioSQL web interface and API on Google App Engine"  project.
> But i am not a student now, i am working in a bioinformatics  
> company, so
> whether it is possible to
> do participate in any of this projects.
>
> Kindly let me known if there is any provision for it and tell me the
> procedure also.
>
>
>
>
> --
> Regards,
>
> Jiji Kurup
> Application Scientist
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biosql-l
>


-- 
Thiago Seito Satake
Tel: +55(011) 6588-8045


From cjfields at illinois.edu  Wed Nov 18 17:40:21 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Wed, 18 Nov 2009 11:40:21 -0600
Subject: [BioSQL-l] Treating GenBank source features as top level
	annotation
In-Reply-To: <73839DF2-F9D2-441C-8AA3-649D619D2B1E@gmx.net>
References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com>
	<8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com>
	<D8574860-3891-4D4E-ACD8-18D07FBB6E81@illinois.edu>
	<73839DF2-F9D2-441C-8AA3-649D619D2B1E@gmx.net>
Message-ID: <BE72D1D3-74FE-4AFF-BD2F-BA5110B511BD@illinois.edu>

Yes; the location appears to specify regions of sequence originating from the indicated source.  

chris

On Nov 18, 2009, at 8:28 AM, Hilmar Lapp wrote:

> True - for chimeric sequences you can have multiple sources. That should be recognizable though from the length (and span) of the source feature location?
> 
> 	-hilmar
> 
> On Nov 18, 2009, at 8:10 AM, Chris Fields wrote:
> 
>> Just to note, there are a few cases where there are two or more source features.  This pops up mainly with chimeric sequences, for example:
>> 
>> http://www.ncbi.nlm.nih.gov/nuccore/21727885
>> 
>> We have run into this a couple of times on the bioperl list.  In this case, each feature is limited to specific locations on the sequence and doesn't pertain to the entire sequence.  NCBI only notes the first source on the ORGANISM line; last time I checked, EMBL used both.
>> 
>> chris
>> 
>> On Nov 18, 2009, at 6:08 AM, Richard Holland wrote:
>> 
>>> BioJava's latest parsers do the following:
>>> 
>>> On read:
>>> 
>>> SOURCE and ORGANISM top-level tags are completely ignored
>>> For each tag in each feature, including source:
>>>  If it's a dbxref
>>>     If it's taxon, set the taxon ID in the BioEntry table (if no /taxon is specified in the source feature the taxonomy does not get stored at all)
>>>     Otherwise set dbxref as a feature CrossRef table entry
>>>  If it's organism
>>>     Add the organism name to the taxon ID in the Taxon table using the scientific taxon name type (if no /organism tag is specified in the source feature, the taxon gets the default name from NCBI, but only if the NCBI taxonomy data is already present in BioSQL) (if no /taxon is specified in the source feature, then the taxonomy does not get stored at all)
>>>  Otherwise
>>>     All other tags get mapped as feature qualifier values, including the source feature
>>> 
>>> On write:
>>> 
>>> SOURCE and ORGANISM tags are generated from the BioEntry taxon ID entry for the sequence,
>>> All features get qualifier values output plus /db_xref tags for all entries from the CrossRef table for the feature,
>>> The source feature is output as per a normal feature, plus /organism and /db_xref="taxon:..." tags generated as per the SOURCE and ORGANISM tags
>>> 
>>> The main reason why we still use the source feature and don't go to sequence level is because when converting between formats it's hard to tell which sequence-level qualifier_values are from the source feature and which are from other places.
>>> 
>>> The main reason why we rely entirely on the source feature for organism and taxon ID info is because it's much easier to parse than the SOURCE and ORGANISM tags.
>>> 
>>> cheers,
>>> Richard
>>> 
>>> On 18 Nov 2009, at 11:06, Peter wrote:
>>> 
>>>> Hello all,
>>>> 
>>>> Something we've just been discussing on the Biopython mailing list
>>>> is a possible change to how we parse the source features in GenBank
>>>> (or EMBL) files. This could have knock on implications for how we use
>>>> BioSQL. For anyone interested, the thread is here:
>>>> http://lists.open-bio.org/pipermail/biopython/2009-November/005826.html
>>>> 
>>>> The basic observation is that GenBank files do not have any extensible
>>>> annotation block for the whole sequence. There are a few fields like
>>>> the comment, organism and taxonomy - but nothing general and
>>>> structured. Instead, it seems the NCBI etc decided to use the feature
>>>> table for this task by inventing the "source" feature. In every single
>>>> GenBank file I have ever seen with a source feature, there is only
>>>> one feature of this type and it spans the full sequence.
>>>> 
>>>> For example, NC_005816, Yersinia pestis biovar Microtus str. 91001
>>>> plasmid pPCP1, complete sequence:
>>>> 
>>>> source      1..9609
>>>>          /organism="Yersinia pestis biovar Microtus str. 91001"
>>>>          /mol_type="genomic DNA"
>>>>          /strain="91001"
>>>>          /db_xref="taxon:229193"
>>>>          /plasmid="pPCP1"
>>>>          /biovar="Microtus"
>>>> 
>>>> (I reduced the white space for emailing). All of that information
>>>> makes sense as annotation for the whole sequence. In fact, the
>>>> "organism" entry is duplicated on the ORGANISM line in the
>>>> GenBank header (and the SOURCE line too).
>>>> 
>>>> Currently we (Biopython, BioPerl etc) store this annotation in BioSQL
>>>> using the seqfeature_qualifiter_value and seqfeature_dbxref tables,
>>>> associated with a "source" feature in the seqfeature table.
>>>> 
>>>> I am suggesting it could make more sense to store the "source"
>>>> feature annotation at the sequence level, using instead the
>>>> bioentry_qualifier_value and bioentry_dbxref tables.
>>>> 
>>>> This is a slight shift from the origins of BioSQL as a schema to
>>>> hold GenBank files - but to me at least it is more logical.
>>>> 
>>>> What does everyone else think? Things work as they are...
>>>> and "if it ain't broken don't fix it"?
>>>> 
>>>> Peter
>>>> 
>>>> [Even if Biopython changes its internal object structure to treat
>>>> the "source" feature annotation as sequence level annotation,
>>>> we *could* continue to use a "source" feature when loading
>>>> GenBank files to/from BioSQL if required for compatibility with
>>>> the other Bio* projects. It would be more work though. In any
>>>> case, we'd also need to recreate a "source" feature when
>>>> writing GenBank output files.]
>>>> _______________________________________________
>>>> BioSQL-l mailing list
>>>> BioSQL-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biosql-l
>>> 
>>> --
>>> Richard Holland, BSc MBCS
>>> Operations and Delivery Director, Eagle Genomics Ltd
>>> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com
>>> http://www.eaglegenomics.com/
>>> 
>>> 
>>> _______________________________________________
>>> BioSQL-l mailing list
>>> BioSQL-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biosql-l
>> 
>> 
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biosql-l
>> 
> 
> -- 
> ===========================================================
> : Hilmar Lapp  -:-  Durham, NC  -:-  hlapp at gmx dot net :
> ===========================================================
> 
> 
> 


From biopython at maubp.freeserve.co.uk  Tue Nov 24 14:27:39 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 24 Nov 2009 14:27:39 +0000
Subject: [BioSQL-l] SQLite support
In-Reply-To: <320fb6e00907280458q56f74ec6iefa420ac1caab8da@mail.gmail.com>
References: <1f864af10812150224y540f1ba6y6b30168102885fcd@mail.gmail.com>
	<320fb6e00812150243w4b0dc223g40abcf684af1ccf5@mail.gmail.com>
	<320fb6e00907050324i6d64d3abreb4d0c256bf1bdc4@mail.gmail.com>
	<320fb6e00907090529t61239952y1c86963f13c1db78@mail.gmail.com>
	<320fb6e00907280458q56f74ec6iefa420ac1caab8da@mail.gmail.com>
Message-ID: <320fb6e00911240627o49bc1ec9nc0d26065ebc23423@mail.gmail.com>

On Tue, Jul 28, 2009 at 11:58 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Thu, Jul 9, 2009 at 1:29 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
>> Hi Hilmar,
>>
>> I've filed a BioSQL enhancement bug 2870 for adding an SQLite
>> schema to BioSQL, and Brad has attached his proposed schema
>> (converted from that for MySQL) to the bug:
>>
>> http://bugzilla.open-bio.org/show_bug.cgi?id=2870
>>
>> Could you take a look at this please? If you are happy with it, we'd like
>> to have it included in BioSQL v1.0.2 (even if Biopython is initially the
>> only Bio* project to support it).
>
> Have you had a chance to look at this yet Hilmar? Brad is keen to
> include BioSQL support for SQLite in the next release of Biopython
> (hopefully within the next week or two), but to do this I'd like your
> blessing, and for the proposed SQLite BioSQL schema to be added
> to the BioSQL SVN repository.

Hi again Hilmar,

Just a reminder about the BioSQL on SQLite proposals - we'd still
like to ship this with the *next* Biopython release (having skipped it
for Biopython 1.52 a couple of months back).

Regards,

Peter


From cjfields at illinois.edu  Tue Nov 24 16:36:33 2009
From: cjfields at illinois.edu (Chris Fields)
Date: Tue, 24 Nov 2009 10:36:33 -0600
Subject: [BioSQL-l] SQLite support
In-Reply-To: <320fb6e00911240627o49bc1ec9nc0d26065ebc23423@mail.gmail.com>
References: <1f864af10812150224y540f1ba6y6b30168102885fcd@mail.gmail.com>
	<320fb6e00812150243w4b0dc223g40abcf684af1ccf5@mail.gmail.com>
	<320fb6e00907050324i6d64d3abreb4d0c256bf1bdc4@mail.gmail.com>
	<320fb6e00907090529t61239952y1c86963f13c1db78@mail.gmail.com>
	<320fb6e00907280458q56f74ec6iefa420ac1caab8da@mail.gmail.com>
	<320fb6e00911240627o49bc1ec9nc0d26065ebc23423@mail.gmail.com>
Message-ID: <070E8BA8-B2C1-4E44-AA2D-9934B3742406@illinois.edu>

On Nov 24, 2009, at 8:27 AM, Peter wrote:

> On Tue, Jul 28, 2009 at 11:58 AM, Peter <biopython at maubp.freeserve.co.uk> wrote:
>> On Thu, Jul 9, 2009 at 1:29 PM, Peter<biopython at maubp.freeserve.co.uk> wrote:
>>> Hi Hilmar,
>>> 
>>> I've filed a BioSQL enhancement bug 2870 for adding an SQLite
>>> schema to BioSQL, and Brad has attached his proposed schema
>>> (converted from that for MySQL) to the bug:
>>> 
>>> http://bugzilla.open-bio.org/show_bug.cgi?id=2870
>>> 
>>> Could you take a look at this please? If you are happy with it, we'd like
>>> to have it included in BioSQL v1.0.2 (even if Biopython is initially the
>>> only Bio* project to support it).
>> 
>> Have you had a chance to look at this yet Hilmar? Brad is keen to
>> include BioSQL support for SQLite in the next release of Biopython
>> (hopefully within the next week or two), but to do this I'd like your
>> blessing, and for the proposed SQLite BioSQL schema to be added
>> to the BioSQL SVN repository.
> 
> Hi again Hilmar,
> 
> Just a reminder about the BioSQL on SQLite proposals - we'd still
> like to ship this with the *next* Biopython release (having skipped it
> for Biopython 1.52 a couple of months back).
> 
> Regards,
> 
> Peter

Just want to add that I would like to see SQLite support as well (I might even feel the need to implement the necessary bioperl-db bits).

chris


From biopython at maubp.freeserve.co.uk  Tue Nov 24 17:07:19 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Tue, 24 Nov 2009 17:07:19 +0000
Subject: [BioSQL-l] SQLite support
In-Reply-To: <070E8BA8-B2C1-4E44-AA2D-9934B3742406@illinois.edu>
References: <1f864af10812150224y540f1ba6y6b30168102885fcd@mail.gmail.com>
	<320fb6e00812150243w4b0dc223g40abcf684af1ccf5@mail.gmail.com>
	<320fb6e00907050324i6d64d3abreb4d0c256bf1bdc4@mail.gmail.com>
	<320fb6e00907090529t61239952y1c86963f13c1db78@mail.gmail.com>
	<320fb6e00907280458q56f74ec6iefa420ac1caab8da@mail.gmail.com>
	<320fb6e00911240627o49bc1ec9nc0d26065ebc23423@mail.gmail.com>
	<070E8BA8-B2C1-4E44-AA2D-9934B3742406@illinois.edu>
Message-ID: <320fb6e00911240907u32dca751ldb488cbc38f0e035@mail.gmail.com>

On Tue, Nov 24, 2009 at 4:36 PM, Chris Fields <cjfields at illinois.edu> wrote:
>
> Just want to add that I would like to see SQLite support as well
> (I might even feel the need to implement the necessary bioperl-db bits).

Excellent :)

Peter


From desouza at ncbi.nlm.nih.gov  Thu Nov 19 21:58:45 2009
From: desouza at ncbi.nlm.nih.gov (De souza, Robson (NIH/NLM/NCBI) [F])
Date: Thu, 19 Nov 2009 16:58:45 -0500
Subject: [BioSQL-l] update ontology
Message-ID: <340BA68A9AA4F548881B3CCC5435061B45A6A2@NIHCESMLBX15.nih.gov>

Hi guys,

I'm trying to use a BioSQL database to store some protein annotation and
need to know a few things:

- First thing: I want to be able to update the ontologies we are working
on  automatically but I failed to make bp_load_ontology.pl to make it.
What I want is to replace any changed terms and their relationships for
new ones without losing the association between unmodified terms and the
annotated proteins. I still don't know what an scriptlet for --mergeobjs
should look like or whether such scriplet is the way to go in this case

- How do I represent protein domains in BioSQL?
I was thinking of writing code to add domains as bioentries and the use
bioentry_relationship to associate sequence and domain but I would also
like to store the coordinates of each domain in a protein, which would
imply associating bioentries with seqfeatures. Does any of you has
another suggestion in this direction?

Thanks!
Robson


From biopython at maubp.freeserve.co.uk  Wed Nov 25 21:39:39 2009
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Wed, 25 Nov 2009 21:39:39 +0000
Subject: [BioSQL-l] update ontology
In-Reply-To: <340BA68A9AA4F548881B3CCC5435061B45A6A2@NIHCESMLBX15.nih.gov>
References: <340BA68A9AA4F548881B3CCC5435061B45A6A2@NIHCESMLBX15.nih.gov>
Message-ID: <320fb6e00911251339j700517cckac3ddb0adc323f00@mail.gmail.com>

On Thu, Nov 19, 2009 at 9:58 PM, De souza, Robson (NIH/NLM/NCBI) [F]
<desouza at ncbi.nlm.nih.gov> wrote:
>
>
> - How do I represent protein domains in BioSQL?
> I was thinking of writing code to add domains as bioentries and the use
> bioentry_relationship to associate sequence and domain but I would also
> like to store the coordinates of each domain in a protein, which would
> imply associating bioentries with seqfeatures. Does any of you has
> another suggestion in this direction?

I would do what Biopython/BioPerl/Bio* would do on loading
a GenPept file into BioSQL - have a bioentry for each protein,
with its amino acid sequence, and for each domain a seqfeature
entry (which records the location within the parent protein).

Peter