From jijibio at gmail.com Mon Nov 9 07:47:20 2009 From: jijibio at gmail.com (=?ISO-8859-1?Q?=BB=BB=BBJiji_Kurup=AB=AB=AB?=) Date: Mon, 9 Nov 2009 18:17:20 +0530 Subject: [BioSQL-l] Project participation Message-ID: <82611efb0911090447y2aa1ae16x1106a41b7a0e2a1@mail.gmail.com> Hi Lapp, I am very much interested to be a part of "JEE5 webservice interface to BioSQL" and "BioSQL web interface and API on Google App Engine" project. But i am not a student now, i am working in a bioinformatics company, so whether it is possible to do participate in any of this projects. Kindly let me known if there is any provision for it and tell me the procedure also. -- Regards, Jiji Kurup Application Scientist From jay at jays.net Mon Nov 16 17:20:33 2009 From: jay at jays.net (Jay Hannah) Date: Mon, 16 Nov 2009 16:20:33 -0600 Subject: [BioSQL-l] [patch] load_taxonomy.pl script was renamed load_ncbi_taxonomy.pl Message-ID: <625B7B90-85D5-4084-BC56-8EBB2CFF6EEE@jays.net> Can someone activate my ('jhannah') commit bit for this biosql-schema? I can commit to bioperl-live, but not biosql-schema. Or apply the patch below for me? Thanks, j http://clab.ist.unomaha.edu/CLAB/index.php/User:Jhannah jhannah at jaysnet-MacBook:~/src/biosql-schema$ svn info Path: . URL: svn://code.open-bio.org/biosql/biosql-schema/trunk Repository Root: svn://code.open-bio.org/biosql Repository UUID: 77cd3915-3943-0410-8c1d-86e8b156039c Revision: 316 Node Kind: directory Schedule: normal Last Changed Author: lapp Last Changed Rev: 316 Last Changed Date: 2009-07-18 13:24:56 -0500 (Sat, 18 Jul 2009) jhannah at jaysnet-MacBook:~/src/biosql-schema$ svn diff Index: doc/schema-overview.txt =================================================================== --- doc/schema-overview.txt (revision 316) +++ doc/schema-overview.txt (working copy) @@ -150,7 +150,7 @@ structure of NCBI's taxonomy database. Each bioentry can be associated with only one taxon, but many bioentries can be associated with the same taxon. In order to get the most value from these tables -it's recommended that you use the BioSQL script load_taxonomy.pl +it's recommended that you use the BioSQL script load_ncbi_taxonomy.pl to populate them. The taxon_name.taxon_id field is meant to store an NCBI @@ -165,7 +165,7 @@ parent_taxon_id contains the taxon id of the parent taxon, since there should only be one parent in the taxonomic tree. The right_value and left_value fields store values that are calculated and entered by the -load_taxonomy.pl script. These arbitrary values are the upper and +load_ncbi_taxonomy.pl script. These arbitrary values are the upper and lower bounds of "nested sets", one set for each taxa, where the set of the child taxa is contained within the larger set of the parent taxon. An example would be the set for the species Procyon lotor, Index: INSTALL =================================================================== --- INSTALL (revision 316) +++ INSTALL (working copy) @@ -449,7 +449,7 @@ With bioperl and bioperl-db installed you are ready to load some data. It is advisable to pre-load the NCBI taxonomy database (use -scripts/load_taxonomy.pl in the biosql-schema package, the details are +scripts/load_ncbi_taxonomy.pl in the biosql-schema package, the details are in its documentation). Otherwise you'll see errors from misparsed organisms. jhannah at jaysnet-MacBook:~/src/biosql-schema$ svn commit svn: Commit failed (details follow): svn: Authorization failed From jay at jays.net Mon Nov 16 18:22:28 2009 From: jay at jays.net (Jay Hannah) Date: Mon, 16 Nov 2009 17:22:28 -0600 Subject: [BioSQL-l] [patch] load_taxonomy.pl script was renamed load_ncbi_taxonomy.pl In-Reply-To: <625B7B90-85D5-4084-BC56-8EBB2CFF6EEE@jays.net> References: <625B7B90-85D5-4084-BC56-8EBB2CFF6EEE@jays.net> Message-ID: <044213EE-E980-4E48-870D-1F2896E937B3@jays.net> On Nov 16, 2009, at 4:20 PM, Jay Hannah wrote: > URL: svn://code.open-bio.org/biosql/biosql-schema/trunk Oh, oops. I think I was using the wrong repo address for committing. I think I'm using the right address now. Now getting the error below. Thanks, j http://clab.ist.unomaha.edu/CLAB/index.php/User:Jhannah jhannah at jaysnet-MacBook:~/src/biosql-schema-committer$ svn info Path: . URL: svn+ssh://jhannah at dev.open-bio.org/home/svn-repositories/biosql/biosql-schema/trunk Repository Root: svn+ssh://jhannah at dev.open-bio.org/home/svn-repositories/biosql Repository UUID: 77cd3915-3943-0410-8c1d-86e8b156039c Revision: 316 Node Kind: directory Schedule: normal Last Changed Author: lapp Last Changed Rev: 316 Last Changed Date: 2009-07-18 13:24:56 -0500 (Sat, 18 Jul 2009) jhannah at jaysnet-MacBook:~/src/biosql-schema-committer$ svn commit =========================================== dev.open-bio.org - Authorized Access Only =========================================== Sending INSTALL Sending doc/schema-overview.txt Transmitting file data ..svn: Commit failed (details follow): svn: Can't create directory '/home/svn-repositories/biosql/db/transactions/316-1.txn': Permission denied svn: Your commit message was left in a temporary file: svn: '/Users/jhannah/src/biosql-schema-committer/svn-commit.tmp' From mauricio at open-bio.org Tue Nov 17 00:02:32 2009 From: mauricio at open-bio.org (Mauricio Herrera Cuadra) Date: Mon, 16 Nov 2009 23:02:32 -0600 Subject: [BioSQL-l] [patch] load_taxonomy.pl script was renamed load_ncbi_taxonomy.pl In-Reply-To: <044213EE-E980-4E48-870D-1F2896E937B3@jays.net> References: <625B7B90-85D5-4084-BC56-8EBB2CFF6EEE@jays.net> <044213EE-E980-4E48-870D-1F2896E937B3@jays.net> Message-ID: <4B022E68.2080304@open-bio.org> I added you to the biosql group in the SVN server. You should be able to commit the patch now. Cheers, Mauricio. Jay Hannah wrote: > On Nov 16, 2009, at 4:20 PM, Jay Hannah wrote: >> URL: svn://code.open-bio.org/biosql/biosql-schema/trunk > > Oh, oops. I think I was using the wrong repo address for committing. > > I think I'm using the right address now. Now getting the error below. > > Thanks, > > j > http://clab.ist.unomaha.edu/CLAB/index.php/User:Jhannah > > > > > jhannah at jaysnet-MacBook:~/src/biosql-schema-committer$ svn info > Path: . > URL: svn+ssh://jhannah at dev.open-bio.org/home/svn-repositories/biosql/biosql-schema/trunk > Repository Root: svn+ssh://jhannah at dev.open-bio.org/home/svn-repositories/biosql > Repository UUID: 77cd3915-3943-0410-8c1d-86e8b156039c > Revision: 316 > Node Kind: directory > Schedule: normal > Last Changed Author: lapp > Last Changed Rev: 316 > Last Changed Date: 2009-07-18 13:24:56 -0500 (Sat, 18 Jul 2009) > > > jhannah at jaysnet-MacBook:~/src/biosql-schema-committer$ svn commit > =========================================== > dev.open-bio.org - Authorized Access Only > =========================================== > Sending INSTALL > Sending doc/schema-overview.txt > Transmitting file data ..svn: Commit failed (details follow): > svn: Can't create directory '/home/svn-repositories/biosql/db/transactions/316-1.txn': Permission denied > svn: Your commit message was left in a temporary file: > svn: '/Users/jhannah/src/biosql-schema-committer/svn-commit.tmp' > > > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > From jay at jays.net Tue Nov 17 08:00:01 2009 From: jay at jays.net (Jay Hannah) Date: Tue, 17 Nov 2009 07:00:01 -0600 Subject: [BioSQL-l] [patch] load_taxonomy.pl script was renamed load_ncbi_taxonomy.pl In-Reply-To: <4B022E68.2080304@open-bio.org> References: <625B7B90-85D5-4084-BC56-8EBB2CFF6EEE@jays.net> <044213EE-E980-4E48-870D-1F2896E937B3@jays.net> <4B022E68.2080304@open-bio.org> Message-ID: On Nov 16, 2009, at 11:02 PM, Mauricio Herrera Cuadra wrote: > I added you to the biosql group in the SVN server. You should be able to commit the patch now. Thanks! r317 committed. :) j ------------------------------------------------------------------------ r317 | jhannah | 2009-11-17 06:58:07 -0600 (Tue, 17 Nov 2009) | 2 lines Changed paths: M /biosql-schema/trunk/INSTALL M /biosql-schema/trunk/doc/schema-overview.txt load_taxonomy.pl script was renamed load_ncbi_taxonomy.pl. ------------------------------------------------------------------------ From biopython at maubp.freeserve.co.uk Wed Nov 18 06:06:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Nov 2009 11:06:51 +0000 Subject: [BioSQL-l] Treating GenBank source features as top level annotation Message-ID: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> Hello all, Something we've just been discussing on the Biopython mailing list is a possible change to how we parse the source features in GenBank (or EMBL) files. This could have knock on implications for how we use BioSQL. For anyone interested, the thread is here: http://lists.open-bio.org/pipermail/biopython/2009-November/005826.html The basic observation is that GenBank files do not have any extensible annotation block for the whole sequence. There are a few fields like the comment, organism and taxonomy - but nothing general and structured. Instead, it seems the NCBI etc decided to use the feature table for this task by inventing the "source" feature. In every single GenBank file I have ever seen with a source feature, there is only one feature of this type and it spans the full sequence. For example, NC_005816, Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence: source 1..9609 /organism="Yersinia pestis biovar Microtus str. 91001" /mol_type="genomic DNA" /strain="91001" /db_xref="taxon:229193" /plasmid="pPCP1" /biovar="Microtus" (I reduced the white space for emailing). All of that information makes sense as annotation for the whole sequence. In fact, the "organism" entry is duplicated on the ORGANISM line in the GenBank header (and the SOURCE line too). Currently we (Biopython, BioPerl etc) store this annotation in BioSQL using the seqfeature_qualifiter_value and seqfeature_dbxref tables, associated with a "source" feature in the seqfeature table. I am suggesting it could make more sense to store the "source" feature annotation at the sequence level, using instead the bioentry_qualifier_value and bioentry_dbxref tables. This is a slight shift from the origins of BioSQL as a schema to hold GenBank files - but to me at least it is more logical. What does everyone else think? Things work as they are... and "if it ain't broken don't fix it"? Peter [Even if Biopython changes its internal object structure to treat the "source" feature annotation as sequence level annotation, we *could* continue to use a "source" feature when loading GenBank files to/from BioSQL if required for compatibility with the other Bio* projects. It would be more work though. In any case, we'd also need to recreate a "source" feature when writing GenBank output files.] From biopython at maubp.freeserve.co.uk Wed Nov 18 07:27:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Nov 2009 12:27:12 +0000 Subject: [BioSQL-l] Treating GenBank source features as top level annotation In-Reply-To: <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com> References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com> Message-ID: <320fb6e00911180427q79961f6ci5a43ebac9ff70f7a@mail.gmail.com> On Wed, Nov 18, 2009 at 12:08 PM, Richard Holland wrote: > > BioJava's latest parsers do the following: > ... Without checking all the details, that is broadly what Biopython does at the moment. > The main reason why we still use the source feature and don't go to sequence > level is because when converting between formats it's hard to tell which > sequence-level qualifier_values are from the source feature and which are > from other places. Makes sense. > The main reason why we rely entirely on the source feature for organism > and taxon ID info is because it's much easier to parse than the SOURCE > and ORGANISM tags. >From memory, Biopython also uses the taxon table here too. Peter From holland at eaglegenomics.com Wed Nov 18 07:08:48 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 18 Nov 2009 12:08:48 +0000 Subject: [BioSQL-l] Treating GenBank source features as top level annotation In-Reply-To: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> Message-ID: <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com> BioJava's latest parsers do the following: On read: SOURCE and ORGANISM top-level tags are completely ignored For each tag in each feature, including source: If it's a dbxref If it's taxon, set the taxon ID in the BioEntry table (if no /taxon is specified in the source feature the taxonomy does not get stored at all) Otherwise set dbxref as a feature CrossRef table entry If it's organism Add the organism name to the taxon ID in the Taxon table using the scientific taxon name type (if no /organism tag is specified in the source feature, the taxon gets the default name from NCBI, but only if the NCBI taxonomy data is already present in BioSQL) (if no /taxon is specified in the source feature, then the taxonomy does not get stored at all) Otherwise All other tags get mapped as feature qualifier values, including the source feature On write: SOURCE and ORGANISM tags are generated from the BioEntry taxon ID entry for the sequence, All features get qualifier values output plus /db_xref tags for all entries from the CrossRef table for the feature, The source feature is output as per a normal feature, plus /organism and /db_xref="taxon:..." tags generated as per the SOURCE and ORGANISM tags The main reason why we still use the source feature and don't go to sequence level is because when converting between formats it's hard to tell which sequence-level qualifier_values are from the source feature and which are from other places. The main reason why we rely entirely on the source feature for organism and taxon ID info is because it's much easier to parse than the SOURCE and ORGANISM tags. cheers, Richard On 18 Nov 2009, at 11:06, Peter wrote: > Hello all, > > Something we've just been discussing on the Biopython mailing list > is a possible change to how we parse the source features in GenBank > (or EMBL) files. This could have knock on implications for how we use > BioSQL. For anyone interested, the thread is here: > http://lists.open-bio.org/pipermail/biopython/2009-November/005826.html > > The basic observation is that GenBank files do not have any extensible > annotation block for the whole sequence. There are a few fields like > the comment, organism and taxonomy - but nothing general and > structured. Instead, it seems the NCBI etc decided to use the feature > table for this task by inventing the "source" feature. In every single > GenBank file I have ever seen with a source feature, there is only > one feature of this type and it spans the full sequence. > > For example, NC_005816, Yersinia pestis biovar Microtus str. 91001 > plasmid pPCP1, complete sequence: > > source 1..9609 > /organism="Yersinia pestis biovar Microtus str. 91001" > /mol_type="genomic DNA" > /strain="91001" > /db_xref="taxon:229193" > /plasmid="pPCP1" > /biovar="Microtus" > > (I reduced the white space for emailing). All of that information > makes sense as annotation for the whole sequence. In fact, the > "organism" entry is duplicated on the ORGANISM line in the > GenBank header (and the SOURCE line too). > > Currently we (Biopython, BioPerl etc) store this annotation in BioSQL > using the seqfeature_qualifiter_value and seqfeature_dbxref tables, > associated with a "source" feature in the seqfeature table. > > I am suggesting it could make more sense to store the "source" > feature annotation at the sequence level, using instead the > bioentry_qualifier_value and bioentry_dbxref tables. > > This is a slight shift from the origins of BioSQL as a schema to > hold GenBank files - but to me at least it is more logical. > > What does everyone else think? Things work as they are... > and "if it ain't broken don't fix it"? > > Peter > > [Even if Biopython changes its internal object structure to treat > the "source" feature annotation as sequence level annotation, > we *could* continue to use a "source" feature when loading > GenBank files to/from BioSQL if required for compatibility with > the other Bio* projects. It would be more work though. In any > case, we'd also need to recreate a "source" feature when > writing GenBank output files.] > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From hlapp at gmx.net Wed Nov 18 08:13:05 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 18 Nov 2009 08:13:05 -0500 Subject: [BioSQL-l] Treating GenBank source features as top level annotation In-Reply-To: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> Message-ID: <670ED558-7FBD-4219-A449-0D7E63BE0766@gmx.net> I agree completely with your interpretation of the "source" feature tag, and in fact what you outline below is what I implemented as a "SeqProcessor" module for use within the SymAtlas data integration project (BioPerl supports 'pipes' of I/O and processing modules, where the latter can modify the sequence objects coming out of the I/O module). I'm not sure I would want to hard-code this behavior into the BioPerl genbank parser. However, it would be easy enough to code it into a processing module that comes standard with the distribution to the extent that it can be enabled as simply as a format variant to SeqIO. It sounds useful enough that I guess I should post it to the BioPerl list ... -hilmar On Nov 18, 2009, at 6:06 AM, Peter wrote: > Hello all, > > Something we've just been discussing on the Biopython mailing list > is a possible change to how we parse the source features in GenBank > (or EMBL) files. This could have knock on implications for how we use > BioSQL. For anyone interested, the thread is here: > http://lists.open-bio.org/pipermail/biopython/2009-November/ > 005826.html > > The basic observation is that GenBank files do not have any extensible > annotation block for the whole sequence. There are a few fields like > the comment, organism and taxonomy - but nothing general and > structured. Instead, it seems the NCBI etc decided to use the feature > table for this task by inventing the "source" feature. In every single > GenBank file I have ever seen with a source feature, there is only > one feature of this type and it spans the full sequence. > > For example, NC_005816, Yersinia pestis biovar Microtus str. 91001 > plasmid pPCP1, complete sequence: > > source 1..9609 > /organism="Yersinia pestis biovar Microtus str. 91001" > /mol_type="genomic DNA" > /strain="91001" > /db_xref="taxon:229193" > /plasmid="pPCP1" > /biovar="Microtus" > > (I reduced the white space for emailing). All of that information > makes sense as annotation for the whole sequence. In fact, the > "organism" entry is duplicated on the ORGANISM line in the > GenBank header (and the SOURCE line too). > > Currently we (Biopython, BioPerl etc) store this annotation in BioSQL > using the seqfeature_qualifiter_value and seqfeature_dbxref tables, > associated with a "source" feature in the seqfeature table. > > I am suggesting it could make more sense to store the "source" > feature annotation at the sequence level, using instead the > bioentry_qualifier_value and bioentry_dbxref tables. > > This is a slight shift from the origins of BioSQL as a schema to > hold GenBank files - but to me at least it is more logical. > > What does everyone else think? Things work as they are... > and "if it ain't broken don't fix it"? > > Peter > > [Even if Biopython changes its internal object structure to treat > the "source" feature annotation as sequence level annotation, > we *could* continue to use a "source" feature when loading > GenBank files to/from BioSQL if required for compatibility with > the other Bio* projects. It would be more work though. In any > case, we'd also need to recreate a "source" feature when > writing GenBank output files.] > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Wed Nov 18 08:14:35 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 18 Nov 2009 08:14:35 -0500 Subject: [BioSQL-l] Treating GenBank source features as top level annotation In-Reply-To: <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com> References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com> Message-ID: <618B39A2-1CA2-405F-A8D7-4947356BF7A5@gmx.net> On Nov 18, 2009, at 7:08 AM, Richard Holland wrote: > For each tag in each feature, including source: > If it's a dbxref > If it's taxon, set the taxon ID in the BioEntry table (if no / > taxon is specified in the source feature the taxonomy does not get > stored at all) That's what the BioPerl Genbank parser does too, though only if it's a "source" feature. I don't know of any other feature key that would have a taxon dbxref entry. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Wed Nov 18 08:16:40 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 18 Nov 2009 08:16:40 -0500 Subject: [BioSQL-l] Treating GenBank source features as top level annotation In-Reply-To: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> Message-ID: <72A7ED62-7213-4136-90F9-74F4691F3003@gmx.net> On Nov 18, 2009, at 6:06 AM, Peter wrote: > For example, NC_005816, Yersinia pestis biovar Microtus str. 91001 > plasmid pPCP1, complete sequence: > > source 1..9609 > /organism="Yersinia pestis biovar Microtus str. 91001" > /mol_type="genomic DNA" > /strain="91001" > /db_xref="taxon:229193" > /plasmid="pPCP1" > /biovar="Microtus" Just FYI, the sequences coming out of the barcoding projects will have the lat/long coordinates here, too. Those obviously pertain to the specimen (and hence to the whole sequence). -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Wed Nov 18 08:34:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Nov 2009 13:34:38 +0000 Subject: [BioSQL-l] Treating GenBank source features as top level annotation In-Reply-To: References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com> Message-ID: <320fb6e00911180534o2cd0126fp62527db04e0c346f@mail.gmail.com> On Wed, Nov 18, 2009 at 1:10 PM, Chris Fields wrote: > > Just to note, there are a few cases where there are two or more source features. > This pops up mainly with chimeric sequences, for example: > > http://www.ncbi.nlm.nih.gov/nuccore/21727885 > > We have run into this a couple of times on the bioperl list. ?In this case, each > feature is limited to specific locations on the sequence and doesn't pertain to > the entire sequence. ?NCBI only notes the first source on the ORGANISM line; > last time I checked, EMBL used both. > > chris Wow - cool example. It was worth starting this thread just to learn about this interesting corner case. I wonder if this is a common enough case to warrant leaving the source features as they are? Peter From cjfields at illinois.edu Wed Nov 18 08:10:36 2009 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 18 Nov 2009 07:10:36 -0600 Subject: [BioSQL-l] Treating GenBank source features as top level annotation In-Reply-To: <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com> References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com> Message-ID: Just to note, there are a few cases where there are two or more source features. This pops up mainly with chimeric sequences, for example: http://www.ncbi.nlm.nih.gov/nuccore/21727885 We have run into this a couple of times on the bioperl list. In this case, each feature is limited to specific locations on the sequence and doesn't pertain to the entire sequence. NCBI only notes the first source on the ORGANISM line; last time I checked, EMBL used both. chris On Nov 18, 2009, at 6:08 AM, Richard Holland wrote: > BioJava's latest parsers do the following: > > On read: > > SOURCE and ORGANISM top-level tags are completely ignored > For each tag in each feature, including source: > If it's a dbxref > If it's taxon, set the taxon ID in the BioEntry table (if no /taxon is specified in the source feature the taxonomy does not get stored at all) > Otherwise set dbxref as a feature CrossRef table entry > If it's organism > Add the organism name to the taxon ID in the Taxon table using the scientific taxon name type (if no /organism tag is specified in the source feature, the taxon gets the default name from NCBI, but only if the NCBI taxonomy data is already present in BioSQL) (if no /taxon is specified in the source feature, then the taxonomy does not get stored at all) > Otherwise > All other tags get mapped as feature qualifier values, including the source feature > > On write: > > SOURCE and ORGANISM tags are generated from the BioEntry taxon ID entry for the sequence, > All features get qualifier values output plus /db_xref tags for all entries from the CrossRef table for the feature, > The source feature is output as per a normal feature, plus /organism and /db_xref="taxon:..." tags generated as per the SOURCE and ORGANISM tags > > The main reason why we still use the source feature and don't go to sequence level is because when converting between formats it's hard to tell which sequence-level qualifier_values are from the source feature and which are from other places. > > The main reason why we rely entirely on the source feature for organism and taxon ID info is because it's much easier to parse than the SOURCE and ORGANISM tags. > > cheers, > Richard > > On 18 Nov 2009, at 11:06, Peter wrote: > >> Hello all, >> >> Something we've just been discussing on the Biopython mailing list >> is a possible change to how we parse the source features in GenBank >> (or EMBL) files. This could have knock on implications for how we use >> BioSQL. For anyone interested, the thread is here: >> http://lists.open-bio.org/pipermail/biopython/2009-November/005826.html >> >> The basic observation is that GenBank files do not have any extensible >> annotation block for the whole sequence. There are a few fields like >> the comment, organism and taxonomy - but nothing general and >> structured. Instead, it seems the NCBI etc decided to use the feature >> table for this task by inventing the "source" feature. In every single >> GenBank file I have ever seen with a source feature, there is only >> one feature of this type and it spans the full sequence. >> >> For example, NC_005816, Yersinia pestis biovar Microtus str. 91001 >> plasmid pPCP1, complete sequence: >> >> source 1..9609 >> /organism="Yersinia pestis biovar Microtus str. 91001" >> /mol_type="genomic DNA" >> /strain="91001" >> /db_xref="taxon:229193" >> /plasmid="pPCP1" >> /biovar="Microtus" >> >> (I reduced the white space for emailing). All of that information >> makes sense as annotation for the whole sequence. In fact, the >> "organism" entry is duplicated on the ORGANISM line in the >> GenBank header (and the SOURCE line too). >> >> Currently we (Biopython, BioPerl etc) store this annotation in BioSQL >> using the seqfeature_qualifiter_value and seqfeature_dbxref tables, >> associated with a "source" feature in the seqfeature table. >> >> I am suggesting it could make more sense to store the "source" >> feature annotation at the sequence level, using instead the >> bioentry_qualifier_value and bioentry_dbxref tables. >> >> This is a slight shift from the origins of BioSQL as a schema to >> hold GenBank files - but to me at least it is more logical. >> >> What does everyone else think? Things work as they are... >> and "if it ain't broken don't fix it"? >> >> Peter >> >> [Even if Biopython changes its internal object structure to treat >> the "source" feature annotation as sequence level annotation, >> we *could* continue to use a "source" feature when loading >> GenBank files to/from BioSQL if required for compatibility with >> the other Bio* projects. It would be more work though. In any >> case, we'd also need to recreate a "source" feature when >> writing GenBank output files.] >> _______________________________________________ >> BioSQL-l mailing list >> BioSQL-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biosql-l > > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l From hlapp at gmx.net Wed Nov 18 09:28:01 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 18 Nov 2009 09:28:01 -0500 Subject: [BioSQL-l] Treating GenBank source features as top level annotation In-Reply-To: References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com> Message-ID: <73839DF2-F9D2-441C-8AA3-649D619D2B1E@gmx.net> True - for chimeric sequences you can have multiple sources. That should be recognizable though from the length (and span) of the source feature location? -hilmar On Nov 18, 2009, at 8:10 AM, Chris Fields wrote: > Just to note, there are a few cases where there are two or more > source features. This pops up mainly with chimeric sequences, for > example: > > http://www.ncbi.nlm.nih.gov/nuccore/21727885 > > We have run into this a couple of times on the bioperl list. In > this case, each feature is limited to specific locations on the > sequence and doesn't pertain to the entire sequence. NCBI only > notes the first source on the ORGANISM line; last time I checked, > EMBL used both. > > chris > > On Nov 18, 2009, at 6:08 AM, Richard Holland wrote: > >> BioJava's latest parsers do the following: >> >> On read: >> >> SOURCE and ORGANISM top-level tags are completely ignored >> For each tag in each feature, including source: >> If it's a dbxref >> If it's taxon, set the taxon ID in the BioEntry table (if no / >> taxon is specified in the source feature the taxonomy does not get >> stored at all) >> Otherwise set dbxref as a feature CrossRef table entry >> If it's organism >> Add the organism name to the taxon ID in the Taxon table using >> the scientific taxon name type (if no /organism tag is specified in >> the source feature, the taxon gets the default name from NCBI, but >> only if the NCBI taxonomy data is already present in BioSQL) (if >> no /taxon is specified in the source feature, then the taxonomy >> does not get stored at all) >> Otherwise >> All other tags get mapped as feature qualifier values, >> including the source feature >> >> On write: >> >> SOURCE and ORGANISM tags are generated from the BioEntry taxon ID >> entry for the sequence, >> All features get qualifier values output plus /db_xref tags for >> all entries from the CrossRef table for the feature, >> The source feature is output as per a normal feature, plus / >> organism and /db_xref="taxon:..." tags generated as per the SOURCE >> and ORGANISM tags >> >> The main reason why we still use the source feature and don't go to >> sequence level is because when converting between formats it's hard >> to tell which sequence-level qualifier_values are from the source >> feature and which are from other places. >> >> The main reason why we rely entirely on the source feature for >> organism and taxon ID info is because it's much easier to parse >> than the SOURCE and ORGANISM tags. >> >> cheers, >> Richard >> >> On 18 Nov 2009, at 11:06, Peter wrote: >> >>> Hello all, >>> >>> Something we've just been discussing on the Biopython mailing list >>> is a possible change to how we parse the source features in GenBank >>> (or EMBL) files. This could have knock on implications for how we >>> use >>> BioSQL. For anyone interested, the thread is here: >>> http://lists.open-bio.org/pipermail/biopython/2009-November/005826.html >>> >>> The basic observation is that GenBank files do not have any >>> extensible >>> annotation block for the whole sequence. There are a few fields like >>> the comment, organism and taxonomy - but nothing general and >>> structured. Instead, it seems the NCBI etc decided to use the >>> feature >>> table for this task by inventing the "source" feature. In every >>> single >>> GenBank file I have ever seen with a source feature, there is only >>> one feature of this type and it spans the full sequence. >>> >>> For example, NC_005816, Yersinia pestis biovar Microtus str. 91001 >>> plasmid pPCP1, complete sequence: >>> >>> source 1..9609 >>> /organism="Yersinia pestis biovar Microtus str. 91001" >>> /mol_type="genomic DNA" >>> /strain="91001" >>> /db_xref="taxon:229193" >>> /plasmid="pPCP1" >>> /biovar="Microtus" >>> >>> (I reduced the white space for emailing). All of that information >>> makes sense as annotation for the whole sequence. In fact, the >>> "organism" entry is duplicated on the ORGANISM line in the >>> GenBank header (and the SOURCE line too). >>> >>> Currently we (Biopython, BioPerl etc) store this annotation in >>> BioSQL >>> using the seqfeature_qualifiter_value and seqfeature_dbxref tables, >>> associated with a "source" feature in the seqfeature table. >>> >>> I am suggesting it could make more sense to store the "source" >>> feature annotation at the sequence level, using instead the >>> bioentry_qualifier_value and bioentry_dbxref tables. >>> >>> This is a slight shift from the origins of BioSQL as a schema to >>> hold GenBank files - but to me at least it is more logical. >>> >>> What does everyone else think? Things work as they are... >>> and "if it ain't broken don't fix it"? >>> >>> Peter >>> >>> [Even if Biopython changes its internal object structure to treat >>> the "source" feature annotation as sequence level annotation, >>> we *could* continue to use a "source" feature when loading >>> GenBank files to/from BioSQL if required for compatibility with >>> the other Bio* projects. It would be more work though. In any >>> case, we'd also need to recreate a "source" feature when >>> writing GenBank output files.] >>> _______________________________________________ >>> BioSQL-l mailing list >>> BioSQL-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biosql-l >> >> -- >> Richard Holland, BSc MBCS >> Operations and Delivery Director, Eagle Genomics Ltd >> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com >> http://www.eaglegenomics.com/ >> >> >> _______________________________________________ >> BioSQL-l mailing list >> BioSQL-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biosql-l > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Wed Nov 18 11:50:04 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 18 Nov 2009 11:50:04 -0500 Subject: [BioSQL-l] [patch] load_taxonomy.pl script was renamed load_ncbi_taxonomy.pl In-Reply-To: References: <625B7B90-85D5-4084-BC56-8EBB2CFF6EEE@jays.net> <044213EE-E980-4E48-870D-1F2896E937B3@jays.net> <4B022E68.2080304@open-bio.org> Message-ID: <03730DF5-706B-47F6-A191-8352E38CA42C@gmx.net> Hi Jay - thanks much for the patch, highly appreciated! -hilmar On Nov 17, 2009, at 8:00 AM, Jay Hannah wrote: > On Nov 16, 2009, at 11:02 PM, Mauricio Herrera Cuadra wrote: >> I added you to the biosql group in the SVN server. You should be >> able to commit the patch now. > > Thanks! r317 committed. :) > > j > > > > ------------------------------------------------------------------------ > r317 | jhannah | 2009-11-17 06:58:07 -0600 (Tue, 17 Nov 2009) | 2 > lines > Changed paths: > M /biosql-schema/trunk/INSTALL > M /biosql-schema/trunk/doc/schema-overview.txt > > load_taxonomy.pl script was renamed load_ncbi_taxonomy.pl. > ------------------------------------------------------------------------ > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From maruco at gmail.com Mon Nov 9 09:35:31 2009 From: maruco at gmail.com (Thiago Satake) Date: Mon, 9 Nov 2009 09:35:31 -0500 Subject: [BioSQL-l] Project participation In-Reply-To: <82611efb0911090447y2aa1ae16x1106a41b7a0e2a1@mail.gmail.com> References: <82611efb0911090447y2aa1ae16x1106a41b7a0e2a1@mail.gmail.com> Message-ID: Hi gays, It sounds to me very" interesting project!! Where can I find more information about ? Thanks, 2009/11/9 ???Jiji Kurup??? : > Hi Lapp, > > I am very much interested to be a part of "JEE5 webservice > interface to > BioSQL" and > "BioSQL web interface and API on Google App Engine" project. > But i am not a student now, i am working in a bioinformatics > company, so > whether it is possible to > do participate in any of this projects. > > Kindly let me known if there is any provision for it and tell me the > procedure also. > > > > > -- > Regards, > > Jiji Kurup > Application Scientist > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > -- Thiago Seito Satake Tel: +55(011) 6588-8045 From cjfields at illinois.edu Wed Nov 18 12:40:21 2009 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 18 Nov 2009 11:40:21 -0600 Subject: [BioSQL-l] Treating GenBank source features as top level annotation In-Reply-To: <73839DF2-F9D2-441C-8AA3-649D619D2B1E@gmx.net> References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com> <73839DF2-F9D2-441C-8AA3-649D619D2B1E@gmx.net> Message-ID: Yes; the location appears to specify regions of sequence originating from the indicated source. chris On Nov 18, 2009, at 8:28 AM, Hilmar Lapp wrote: > True - for chimeric sequences you can have multiple sources. That should be recognizable though from the length (and span) of the source feature location? > > -hilmar > > On Nov 18, 2009, at 8:10 AM, Chris Fields wrote: > >> Just to note, there are a few cases where there are two or more source features. This pops up mainly with chimeric sequences, for example: >> >> http://www.ncbi.nlm.nih.gov/nuccore/21727885 >> >> We have run into this a couple of times on the bioperl list. In this case, each feature is limited to specific locations on the sequence and doesn't pertain to the entire sequence. NCBI only notes the first source on the ORGANISM line; last time I checked, EMBL used both. >> >> chris >> >> On Nov 18, 2009, at 6:08 AM, Richard Holland wrote: >> >>> BioJava's latest parsers do the following: >>> >>> On read: >>> >>> SOURCE and ORGANISM top-level tags are completely ignored >>> For each tag in each feature, including source: >>> If it's a dbxref >>> If it's taxon, set the taxon ID in the BioEntry table (if no /taxon is specified in the source feature the taxonomy does not get stored at all) >>> Otherwise set dbxref as a feature CrossRef table entry >>> If it's organism >>> Add the organism name to the taxon ID in the Taxon table using the scientific taxon name type (if no /organism tag is specified in the source feature, the taxon gets the default name from NCBI, but only if the NCBI taxonomy data is already present in BioSQL) (if no /taxon is specified in the source feature, then the taxonomy does not get stored at all) >>> Otherwise >>> All other tags get mapped as feature qualifier values, including the source feature >>> >>> On write: >>> >>> SOURCE and ORGANISM tags are generated from the BioEntry taxon ID entry for the sequence, >>> All features get qualifier values output plus /db_xref tags for all entries from the CrossRef table for the feature, >>> The source feature is output as per a normal feature, plus /organism and /db_xref="taxon:..." tags generated as per the SOURCE and ORGANISM tags >>> >>> The main reason why we still use the source feature and don't go to sequence level is because when converting between formats it's hard to tell which sequence-level qualifier_values are from the source feature and which are from other places. >>> >>> The main reason why we rely entirely on the source feature for organism and taxon ID info is because it's much easier to parse than the SOURCE and ORGANISM tags. >>> >>> cheers, >>> Richard >>> >>> On 18 Nov 2009, at 11:06, Peter wrote: >>> >>>> Hello all, >>>> >>>> Something we've just been discussing on the Biopython mailing list >>>> is a possible change to how we parse the source features in GenBank >>>> (or EMBL) files. This could have knock on implications for how we use >>>> BioSQL. For anyone interested, the thread is here: >>>> http://lists.open-bio.org/pipermail/biopython/2009-November/005826.html >>>> >>>> The basic observation is that GenBank files do not have any extensible >>>> annotation block for the whole sequence. There are a few fields like >>>> the comment, organism and taxonomy - but nothing general and >>>> structured. Instead, it seems the NCBI etc decided to use the feature >>>> table for this task by inventing the "source" feature. In every single >>>> GenBank file I have ever seen with a source feature, there is only >>>> one feature of this type and it spans the full sequence. >>>> >>>> For example, NC_005816, Yersinia pestis biovar Microtus str. 91001 >>>> plasmid pPCP1, complete sequence: >>>> >>>> source 1..9609 >>>> /organism="Yersinia pestis biovar Microtus str. 91001" >>>> /mol_type="genomic DNA" >>>> /strain="91001" >>>> /db_xref="taxon:229193" >>>> /plasmid="pPCP1" >>>> /biovar="Microtus" >>>> >>>> (I reduced the white space for emailing). All of that information >>>> makes sense as annotation for the whole sequence. In fact, the >>>> "organism" entry is duplicated on the ORGANISM line in the >>>> GenBank header (and the SOURCE line too). >>>> >>>> Currently we (Biopython, BioPerl etc) store this annotation in BioSQL >>>> using the seqfeature_qualifiter_value and seqfeature_dbxref tables, >>>> associated with a "source" feature in the seqfeature table. >>>> >>>> I am suggesting it could make more sense to store the "source" >>>> feature annotation at the sequence level, using instead the >>>> bioentry_qualifier_value and bioentry_dbxref tables. >>>> >>>> This is a slight shift from the origins of BioSQL as a schema to >>>> hold GenBank files - but to me at least it is more logical. >>>> >>>> What does everyone else think? Things work as they are... >>>> and "if it ain't broken don't fix it"? >>>> >>>> Peter >>>> >>>> [Even if Biopython changes its internal object structure to treat >>>> the "source" feature annotation as sequence level annotation, >>>> we *could* continue to use a "source" feature when loading >>>> GenBank files to/from BioSQL if required for compatibility with >>>> the other Bio* projects. It would be more work though. In any >>>> case, we'd also need to recreate a "source" feature when >>>> writing GenBank output files.] >>>> _______________________________________________ >>>> BioSQL-l mailing list >>>> BioSQL-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biosql-l >>> >>> -- >>> Richard Holland, BSc MBCS >>> Operations and Delivery Director, Eagle Genomics Ltd >>> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com >>> http://www.eaglegenomics.com/ >>> >>> >>> _______________________________________________ >>> BioSQL-l mailing list >>> BioSQL-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biosql-l >> >> >> _______________________________________________ >> BioSQL-l mailing list >> BioSQL-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biosql-l >> > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > From biopython at maubp.freeserve.co.uk Tue Nov 24 09:27:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 24 Nov 2009 14:27:39 +0000 Subject: [BioSQL-l] SQLite support In-Reply-To: <320fb6e00907280458q56f74ec6iefa420ac1caab8da@mail.gmail.com> References: <1f864af10812150224y540f1ba6y6b30168102885fcd@mail.gmail.com> <320fb6e00812150243w4b0dc223g40abcf684af1ccf5@mail.gmail.com> <320fb6e00907050324i6d64d3abreb4d0c256bf1bdc4@mail.gmail.com> <320fb6e00907090529t61239952y1c86963f13c1db78@mail.gmail.com> <320fb6e00907280458q56f74ec6iefa420ac1caab8da@mail.gmail.com> Message-ID: <320fb6e00911240627o49bc1ec9nc0d26065ebc23423@mail.gmail.com> On Tue, Jul 28, 2009 at 11:58 AM, Peter wrote: > On Thu, Jul 9, 2009 at 1:29 PM, Peter wrote: >> Hi Hilmar, >> >> I've filed a BioSQL enhancement bug 2870 for adding an SQLite >> schema to BioSQL, and Brad has attached his proposed schema >> (converted from that for MySQL) to the bug: >> >> http://bugzilla.open-bio.org/show_bug.cgi?id=2870 >> >> Could you take a look at this please? If you are happy with it, we'd like >> to have it included in BioSQL v1.0.2 (even if Biopython is initially the >> only Bio* project to support it). > > Have you had a chance to look at this yet Hilmar? Brad is keen to > include BioSQL support for SQLite in the next release of Biopython > (hopefully within the next week or two), but to do this I'd like your > blessing, and for the proposed SQLite BioSQL schema to be added > to the BioSQL SVN repository. Hi again Hilmar, Just a reminder about the BioSQL on SQLite proposals - we'd still like to ship this with the *next* Biopython release (having skipped it for Biopython 1.52 a couple of months back). Regards, Peter From cjfields at illinois.edu Tue Nov 24 11:36:33 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 24 Nov 2009 10:36:33 -0600 Subject: [BioSQL-l] SQLite support In-Reply-To: <320fb6e00911240627o49bc1ec9nc0d26065ebc23423@mail.gmail.com> References: <1f864af10812150224y540f1ba6y6b30168102885fcd@mail.gmail.com> <320fb6e00812150243w4b0dc223g40abcf684af1ccf5@mail.gmail.com> <320fb6e00907050324i6d64d3abreb4d0c256bf1bdc4@mail.gmail.com> <320fb6e00907090529t61239952y1c86963f13c1db78@mail.gmail.com> <320fb6e00907280458q56f74ec6iefa420ac1caab8da@mail.gmail.com> <320fb6e00911240627o49bc1ec9nc0d26065ebc23423@mail.gmail.com> Message-ID: <070E8BA8-B2C1-4E44-AA2D-9934B3742406@illinois.edu> On Nov 24, 2009, at 8:27 AM, Peter wrote: > On Tue, Jul 28, 2009 at 11:58 AM, Peter wrote: >> On Thu, Jul 9, 2009 at 1:29 PM, Peter wrote: >>> Hi Hilmar, >>> >>> I've filed a BioSQL enhancement bug 2870 for adding an SQLite >>> schema to BioSQL, and Brad has attached his proposed schema >>> (converted from that for MySQL) to the bug: >>> >>> http://bugzilla.open-bio.org/show_bug.cgi?id=2870 >>> >>> Could you take a look at this please? If you are happy with it, we'd like >>> to have it included in BioSQL v1.0.2 (even if Biopython is initially the >>> only Bio* project to support it). >> >> Have you had a chance to look at this yet Hilmar? Brad is keen to >> include BioSQL support for SQLite in the next release of Biopython >> (hopefully within the next week or two), but to do this I'd like your >> blessing, and for the proposed SQLite BioSQL schema to be added >> to the BioSQL SVN repository. > > Hi again Hilmar, > > Just a reminder about the BioSQL on SQLite proposals - we'd still > like to ship this with the *next* Biopython release (having skipped it > for Biopython 1.52 a couple of months back). > > Regards, > > Peter Just want to add that I would like to see SQLite support as well (I might even feel the need to implement the necessary bioperl-db bits). chris From biopython at maubp.freeserve.co.uk Tue Nov 24 12:07:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 24 Nov 2009 17:07:19 +0000 Subject: [BioSQL-l] SQLite support In-Reply-To: <070E8BA8-B2C1-4E44-AA2D-9934B3742406@illinois.edu> References: <1f864af10812150224y540f1ba6y6b30168102885fcd@mail.gmail.com> <320fb6e00812150243w4b0dc223g40abcf684af1ccf5@mail.gmail.com> <320fb6e00907050324i6d64d3abreb4d0c256bf1bdc4@mail.gmail.com> <320fb6e00907090529t61239952y1c86963f13c1db78@mail.gmail.com> <320fb6e00907280458q56f74ec6iefa420ac1caab8da@mail.gmail.com> <320fb6e00911240627o49bc1ec9nc0d26065ebc23423@mail.gmail.com> <070E8BA8-B2C1-4E44-AA2D-9934B3742406@illinois.edu> Message-ID: <320fb6e00911240907u32dca751ldb488cbc38f0e035@mail.gmail.com> On Tue, Nov 24, 2009 at 4:36 PM, Chris Fields wrote: > > Just want to add that I would like to see SQLite support as well > (I might even feel the need to implement the necessary bioperl-db bits). Excellent :) Peter From desouza at ncbi.nlm.nih.gov Thu Nov 19 16:58:45 2009 From: desouza at ncbi.nlm.nih.gov (De souza, Robson (NIH/NLM/NCBI) [F]) Date: Thu, 19 Nov 2009 16:58:45 -0500 Subject: [BioSQL-l] update ontology Message-ID: <340BA68A9AA4F548881B3CCC5435061B45A6A2@NIHCESMLBX15.nih.gov> Hi guys, I'm trying to use a BioSQL database to store some protein annotation and need to know a few things: - First thing: I want to be able to update the ontologies we are working on automatically but I failed to make bp_load_ontology.pl to make it. What I want is to replace any changed terms and their relationships for new ones without losing the association between unmodified terms and the annotated proteins. I still don't know what an scriptlet for --mergeobjs should look like or whether such scriplet is the way to go in this case - How do I represent protein domains in BioSQL? I was thinking of writing code to add domains as bioentries and the use bioentry_relationship to associate sequence and domain but I would also like to store the coordinates of each domain in a protein, which would imply associating bioentries with seqfeatures. Does any of you has another suggestion in this direction? Thanks! Robson From biopython at maubp.freeserve.co.uk Wed Nov 25 16:39:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 25 Nov 2009 21:39:39 +0000 Subject: [BioSQL-l] update ontology In-Reply-To: <340BA68A9AA4F548881B3CCC5435061B45A6A2@NIHCESMLBX15.nih.gov> References: <340BA68A9AA4F548881B3CCC5435061B45A6A2@NIHCESMLBX15.nih.gov> Message-ID: <320fb6e00911251339j700517cckac3ddb0adc323f00@mail.gmail.com> On Thu, Nov 19, 2009 at 9:58 PM, De souza, Robson (NIH/NLM/NCBI) [F] wrote: > > > - How do I represent protein domains in BioSQL? > I was thinking of writing code to add domains as bioentries and the use > bioentry_relationship to associate sequence and domain but I would also > like to store the coordinates of each domain in a protein, which would > imply associating bioentries with seqfeatures. Does any of you has > another suggestion in this direction? I would do what Biopython/BioPerl/Bio* would do on loading a GenPept file into BioSQL - have a bioentry for each protein, with its amino acid sequence, and for each domain a seqfeature entry (which records the location within the parent protein). Peter From jijibio at gmail.com Mon Nov 9 12:47:20 2009 From: jijibio at gmail.com (=?ISO-8859-1?Q?=BB=BB=BBJiji_Kurup=AB=AB=AB?=) Date: Mon, 9 Nov 2009 18:17:20 +0530 Subject: [BioSQL-l] Project participation Message-ID: <82611efb0911090447y2aa1ae16x1106a41b7a0e2a1@mail.gmail.com> Hi Lapp, I am very much interested to be a part of "JEE5 webservice interface to BioSQL" and "BioSQL web interface and API on Google App Engine" project. But i am not a student now, i am working in a bioinformatics company, so whether it is possible to do participate in any of this projects. Kindly let me known if there is any provision for it and tell me the procedure also. -- Regards, Jiji Kurup Application Scientist From jay at jays.net Mon Nov 16 22:20:33 2009 From: jay at jays.net (Jay Hannah) Date: Mon, 16 Nov 2009 16:20:33 -0600 Subject: [BioSQL-l] [patch] load_taxonomy.pl script was renamed load_ncbi_taxonomy.pl Message-ID: <625B7B90-85D5-4084-BC56-8EBB2CFF6EEE@jays.net> Can someone activate my ('jhannah') commit bit for this biosql-schema? I can commit to bioperl-live, but not biosql-schema. Or apply the patch below for me? Thanks, j http://clab.ist.unomaha.edu/CLAB/index.php/User:Jhannah jhannah at jaysnet-MacBook:~/src/biosql-schema$ svn info Path: . URL: svn://code.open-bio.org/biosql/biosql-schema/trunk Repository Root: svn://code.open-bio.org/biosql Repository UUID: 77cd3915-3943-0410-8c1d-86e8b156039c Revision: 316 Node Kind: directory Schedule: normal Last Changed Author: lapp Last Changed Rev: 316 Last Changed Date: 2009-07-18 13:24:56 -0500 (Sat, 18 Jul 2009) jhannah at jaysnet-MacBook:~/src/biosql-schema$ svn diff Index: doc/schema-overview.txt =================================================================== --- doc/schema-overview.txt (revision 316) +++ doc/schema-overview.txt (working copy) @@ -150,7 +150,7 @@ structure of NCBI's taxonomy database. Each bioentry can be associated with only one taxon, but many bioentries can be associated with the same taxon. In order to get the most value from these tables -it's recommended that you use the BioSQL script load_taxonomy.pl +it's recommended that you use the BioSQL script load_ncbi_taxonomy.pl to populate them. The taxon_name.taxon_id field is meant to store an NCBI @@ -165,7 +165,7 @@ parent_taxon_id contains the taxon id of the parent taxon, since there should only be one parent in the taxonomic tree. The right_value and left_value fields store values that are calculated and entered by the -load_taxonomy.pl script. These arbitrary values are the upper and +load_ncbi_taxonomy.pl script. These arbitrary values are the upper and lower bounds of "nested sets", one set for each taxa, where the set of the child taxa is contained within the larger set of the parent taxon. An example would be the set for the species Procyon lotor, Index: INSTALL =================================================================== --- INSTALL (revision 316) +++ INSTALL (working copy) @@ -449,7 +449,7 @@ With bioperl and bioperl-db installed you are ready to load some data. It is advisable to pre-load the NCBI taxonomy database (use -scripts/load_taxonomy.pl in the biosql-schema package, the details are +scripts/load_ncbi_taxonomy.pl in the biosql-schema package, the details are in its documentation). Otherwise you'll see errors from misparsed organisms. jhannah at jaysnet-MacBook:~/src/biosql-schema$ svn commit svn: Commit failed (details follow): svn: Authorization failed From jay at jays.net Mon Nov 16 23:22:28 2009 From: jay at jays.net (Jay Hannah) Date: Mon, 16 Nov 2009 17:22:28 -0600 Subject: [BioSQL-l] [patch] load_taxonomy.pl script was renamed load_ncbi_taxonomy.pl In-Reply-To: <625B7B90-85D5-4084-BC56-8EBB2CFF6EEE@jays.net> References: <625B7B90-85D5-4084-BC56-8EBB2CFF6EEE@jays.net> Message-ID: <044213EE-E980-4E48-870D-1F2896E937B3@jays.net> On Nov 16, 2009, at 4:20 PM, Jay Hannah wrote: > URL: svn://code.open-bio.org/biosql/biosql-schema/trunk Oh, oops. I think I was using the wrong repo address for committing. I think I'm using the right address now. Now getting the error below. Thanks, j http://clab.ist.unomaha.edu/CLAB/index.php/User:Jhannah jhannah at jaysnet-MacBook:~/src/biosql-schema-committer$ svn info Path: . URL: svn+ssh://jhannah at dev.open-bio.org/home/svn-repositories/biosql/biosql-schema/trunk Repository Root: svn+ssh://jhannah at dev.open-bio.org/home/svn-repositories/biosql Repository UUID: 77cd3915-3943-0410-8c1d-86e8b156039c Revision: 316 Node Kind: directory Schedule: normal Last Changed Author: lapp Last Changed Rev: 316 Last Changed Date: 2009-07-18 13:24:56 -0500 (Sat, 18 Jul 2009) jhannah at jaysnet-MacBook:~/src/biosql-schema-committer$ svn commit =========================================== dev.open-bio.org - Authorized Access Only =========================================== Sending INSTALL Sending doc/schema-overview.txt Transmitting file data ..svn: Commit failed (details follow): svn: Can't create directory '/home/svn-repositories/biosql/db/transactions/316-1.txn': Permission denied svn: Your commit message was left in a temporary file: svn: '/Users/jhannah/src/biosql-schema-committer/svn-commit.tmp' From mauricio at open-bio.org Tue Nov 17 05:02:32 2009 From: mauricio at open-bio.org (Mauricio Herrera Cuadra) Date: Mon, 16 Nov 2009 23:02:32 -0600 Subject: [BioSQL-l] [patch] load_taxonomy.pl script was renamed load_ncbi_taxonomy.pl In-Reply-To: <044213EE-E980-4E48-870D-1F2896E937B3@jays.net> References: <625B7B90-85D5-4084-BC56-8EBB2CFF6EEE@jays.net> <044213EE-E980-4E48-870D-1F2896E937B3@jays.net> Message-ID: <4B022E68.2080304@open-bio.org> I added you to the biosql group in the SVN server. You should be able to commit the patch now. Cheers, Mauricio. Jay Hannah wrote: > On Nov 16, 2009, at 4:20 PM, Jay Hannah wrote: >> URL: svn://code.open-bio.org/biosql/biosql-schema/trunk > > Oh, oops. I think I was using the wrong repo address for committing. > > I think I'm using the right address now. Now getting the error below. > > Thanks, > > j > http://clab.ist.unomaha.edu/CLAB/index.php/User:Jhannah > > > > > jhannah at jaysnet-MacBook:~/src/biosql-schema-committer$ svn info > Path: . > URL: svn+ssh://jhannah at dev.open-bio.org/home/svn-repositories/biosql/biosql-schema/trunk > Repository Root: svn+ssh://jhannah at dev.open-bio.org/home/svn-repositories/biosql > Repository UUID: 77cd3915-3943-0410-8c1d-86e8b156039c > Revision: 316 > Node Kind: directory > Schedule: normal > Last Changed Author: lapp > Last Changed Rev: 316 > Last Changed Date: 2009-07-18 13:24:56 -0500 (Sat, 18 Jul 2009) > > > jhannah at jaysnet-MacBook:~/src/biosql-schema-committer$ svn commit > =========================================== > dev.open-bio.org - Authorized Access Only > =========================================== > Sending INSTALL > Sending doc/schema-overview.txt > Transmitting file data ..svn: Commit failed (details follow): > svn: Can't create directory '/home/svn-repositories/biosql/db/transactions/316-1.txn': Permission denied > svn: Your commit message was left in a temporary file: > svn: '/Users/jhannah/src/biosql-schema-committer/svn-commit.tmp' > > > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > From jay at jays.net Tue Nov 17 13:00:01 2009 From: jay at jays.net (Jay Hannah) Date: Tue, 17 Nov 2009 07:00:01 -0600 Subject: [BioSQL-l] [patch] load_taxonomy.pl script was renamed load_ncbi_taxonomy.pl In-Reply-To: <4B022E68.2080304@open-bio.org> References: <625B7B90-85D5-4084-BC56-8EBB2CFF6EEE@jays.net> <044213EE-E980-4E48-870D-1F2896E937B3@jays.net> <4B022E68.2080304@open-bio.org> Message-ID: On Nov 16, 2009, at 11:02 PM, Mauricio Herrera Cuadra wrote: > I added you to the biosql group in the SVN server. You should be able to commit the patch now. Thanks! r317 committed. :) j ------------------------------------------------------------------------ r317 | jhannah | 2009-11-17 06:58:07 -0600 (Tue, 17 Nov 2009) | 2 lines Changed paths: M /biosql-schema/trunk/INSTALL M /biosql-schema/trunk/doc/schema-overview.txt load_taxonomy.pl script was renamed load_ncbi_taxonomy.pl. ------------------------------------------------------------------------ From biopython at maubp.freeserve.co.uk Wed Nov 18 11:06:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Nov 2009 11:06:51 +0000 Subject: [BioSQL-l] Treating GenBank source features as top level annotation Message-ID: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> Hello all, Something we've just been discussing on the Biopython mailing list is a possible change to how we parse the source features in GenBank (or EMBL) files. This could have knock on implications for how we use BioSQL. For anyone interested, the thread is here: http://lists.open-bio.org/pipermail/biopython/2009-November/005826.html The basic observation is that GenBank files do not have any extensible annotation block for the whole sequence. There are a few fields like the comment, organism and taxonomy - but nothing general and structured. Instead, it seems the NCBI etc decided to use the feature table for this task by inventing the "source" feature. In every single GenBank file I have ever seen with a source feature, there is only one feature of this type and it spans the full sequence. For example, NC_005816, Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence: source 1..9609 /organism="Yersinia pestis biovar Microtus str. 91001" /mol_type="genomic DNA" /strain="91001" /db_xref="taxon:229193" /plasmid="pPCP1" /biovar="Microtus" (I reduced the white space for emailing). All of that information makes sense as annotation for the whole sequence. In fact, the "organism" entry is duplicated on the ORGANISM line in the GenBank header (and the SOURCE line too). Currently we (Biopython, BioPerl etc) store this annotation in BioSQL using the seqfeature_qualifiter_value and seqfeature_dbxref tables, associated with a "source" feature in the seqfeature table. I am suggesting it could make more sense to store the "source" feature annotation at the sequence level, using instead the bioentry_qualifier_value and bioentry_dbxref tables. This is a slight shift from the origins of BioSQL as a schema to hold GenBank files - but to me at least it is more logical. What does everyone else think? Things work as they are... and "if it ain't broken don't fix it"? Peter [Even if Biopython changes its internal object structure to treat the "source" feature annotation as sequence level annotation, we *could* continue to use a "source" feature when loading GenBank files to/from BioSQL if required for compatibility with the other Bio* projects. It would be more work though. In any case, we'd also need to recreate a "source" feature when writing GenBank output files.] From biopython at maubp.freeserve.co.uk Wed Nov 18 12:27:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Nov 2009 12:27:12 +0000 Subject: [BioSQL-l] Treating GenBank source features as top level annotation In-Reply-To: <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com> References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com> Message-ID: <320fb6e00911180427q79961f6ci5a43ebac9ff70f7a@mail.gmail.com> On Wed, Nov 18, 2009 at 12:08 PM, Richard Holland wrote: > > BioJava's latest parsers do the following: > ... Without checking all the details, that is broadly what Biopython does at the moment. > The main reason why we still use the source feature and don't go to sequence > level is because when converting between formats it's hard to tell which > sequence-level qualifier_values are from the source feature and which are > from other places. Makes sense. > The main reason why we rely entirely on the source feature for organism > and taxon ID info is because it's much easier to parse than the SOURCE > and ORGANISM tags. >From memory, Biopython also uses the taxon table here too. Peter From holland at eaglegenomics.com Wed Nov 18 12:08:48 2009 From: holland at eaglegenomics.com (Richard Holland) Date: Wed, 18 Nov 2009 12:08:48 +0000 Subject: [BioSQL-l] Treating GenBank source features as top level annotation In-Reply-To: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> Message-ID: <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com> BioJava's latest parsers do the following: On read: SOURCE and ORGANISM top-level tags are completely ignored For each tag in each feature, including source: If it's a dbxref If it's taxon, set the taxon ID in the BioEntry table (if no /taxon is specified in the source feature the taxonomy does not get stored at all) Otherwise set dbxref as a feature CrossRef table entry If it's organism Add the organism name to the taxon ID in the Taxon table using the scientific taxon name type (if no /organism tag is specified in the source feature, the taxon gets the default name from NCBI, but only if the NCBI taxonomy data is already present in BioSQL) (if no /taxon is specified in the source feature, then the taxonomy does not get stored at all) Otherwise All other tags get mapped as feature qualifier values, including the source feature On write: SOURCE and ORGANISM tags are generated from the BioEntry taxon ID entry for the sequence, All features get qualifier values output plus /db_xref tags for all entries from the CrossRef table for the feature, The source feature is output as per a normal feature, plus /organism and /db_xref="taxon:..." tags generated as per the SOURCE and ORGANISM tags The main reason why we still use the source feature and don't go to sequence level is because when converting between formats it's hard to tell which sequence-level qualifier_values are from the source feature and which are from other places. The main reason why we rely entirely on the source feature for organism and taxon ID info is because it's much easier to parse than the SOURCE and ORGANISM tags. cheers, Richard On 18 Nov 2009, at 11:06, Peter wrote: > Hello all, > > Something we've just been discussing on the Biopython mailing list > is a possible change to how we parse the source features in GenBank > (or EMBL) files. This could have knock on implications for how we use > BioSQL. For anyone interested, the thread is here: > http://lists.open-bio.org/pipermail/biopython/2009-November/005826.html > > The basic observation is that GenBank files do not have any extensible > annotation block for the whole sequence. There are a few fields like > the comment, organism and taxonomy - but nothing general and > structured. Instead, it seems the NCBI etc decided to use the feature > table for this task by inventing the "source" feature. In every single > GenBank file I have ever seen with a source feature, there is only > one feature of this type and it spans the full sequence. > > For example, NC_005816, Yersinia pestis biovar Microtus str. 91001 > plasmid pPCP1, complete sequence: > > source 1..9609 > /organism="Yersinia pestis biovar Microtus str. 91001" > /mol_type="genomic DNA" > /strain="91001" > /db_xref="taxon:229193" > /plasmid="pPCP1" > /biovar="Microtus" > > (I reduced the white space for emailing). All of that information > makes sense as annotation for the whole sequence. In fact, the > "organism" entry is duplicated on the ORGANISM line in the > GenBank header (and the SOURCE line too). > > Currently we (Biopython, BioPerl etc) store this annotation in BioSQL > using the seqfeature_qualifiter_value and seqfeature_dbxref tables, > associated with a "source" feature in the seqfeature table. > > I am suggesting it could make more sense to store the "source" > feature annotation at the sequence level, using instead the > bioentry_qualifier_value and bioentry_dbxref tables. > > This is a slight shift from the origins of BioSQL as a schema to > hold GenBank files - but to me at least it is more logical. > > What does everyone else think? Things work as they are... > and "if it ain't broken don't fix it"? > > Peter > > [Even if Biopython changes its internal object structure to treat > the "source" feature annotation as sequence level annotation, > we *could* continue to use a "source" feature when loading > GenBank files to/from BioSQL if required for compatibility with > the other Bio* projects. It would be more work though. In any > case, we'd also need to recreate a "source" feature when > writing GenBank output files.] > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l -- Richard Holland, BSc MBCS Operations and Delivery Director, Eagle Genomics Ltd T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com http://www.eaglegenomics.com/ From hlapp at gmx.net Wed Nov 18 13:13:05 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 18 Nov 2009 08:13:05 -0500 Subject: [BioSQL-l] Treating GenBank source features as top level annotation In-Reply-To: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> Message-ID: <670ED558-7FBD-4219-A449-0D7E63BE0766@gmx.net> I agree completely with your interpretation of the "source" feature tag, and in fact what you outline below is what I implemented as a "SeqProcessor" module for use within the SymAtlas data integration project (BioPerl supports 'pipes' of I/O and processing modules, where the latter can modify the sequence objects coming out of the I/O module). I'm not sure I would want to hard-code this behavior into the BioPerl genbank parser. However, it would be easy enough to code it into a processing module that comes standard with the distribution to the extent that it can be enabled as simply as a format variant to SeqIO. It sounds useful enough that I guess I should post it to the BioPerl list ... -hilmar On Nov 18, 2009, at 6:06 AM, Peter wrote: > Hello all, > > Something we've just been discussing on the Biopython mailing list > is a possible change to how we parse the source features in GenBank > (or EMBL) files. This could have knock on implications for how we use > BioSQL. For anyone interested, the thread is here: > http://lists.open-bio.org/pipermail/biopython/2009-November/ > 005826.html > > The basic observation is that GenBank files do not have any extensible > annotation block for the whole sequence. There are a few fields like > the comment, organism and taxonomy - but nothing general and > structured. Instead, it seems the NCBI etc decided to use the feature > table for this task by inventing the "source" feature. In every single > GenBank file I have ever seen with a source feature, there is only > one feature of this type and it spans the full sequence. > > For example, NC_005816, Yersinia pestis biovar Microtus str. 91001 > plasmid pPCP1, complete sequence: > > source 1..9609 > /organism="Yersinia pestis biovar Microtus str. 91001" > /mol_type="genomic DNA" > /strain="91001" > /db_xref="taxon:229193" > /plasmid="pPCP1" > /biovar="Microtus" > > (I reduced the white space for emailing). All of that information > makes sense as annotation for the whole sequence. In fact, the > "organism" entry is duplicated on the ORGANISM line in the > GenBank header (and the SOURCE line too). > > Currently we (Biopython, BioPerl etc) store this annotation in BioSQL > using the seqfeature_qualifiter_value and seqfeature_dbxref tables, > associated with a "source" feature in the seqfeature table. > > I am suggesting it could make more sense to store the "source" > feature annotation at the sequence level, using instead the > bioentry_qualifier_value and bioentry_dbxref tables. > > This is a slight shift from the origins of BioSQL as a schema to > hold GenBank files - but to me at least it is more logical. > > What does everyone else think? Things work as they are... > and "if it ain't broken don't fix it"? > > Peter > > [Even if Biopython changes its internal object structure to treat > the "source" feature annotation as sequence level annotation, > we *could* continue to use a "source" feature when loading > GenBank files to/from BioSQL if required for compatibility with > the other Bio* projects. It would be more work though. In any > case, we'd also need to recreate a "source" feature when > writing GenBank output files.] > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Wed Nov 18 13:14:35 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 18 Nov 2009 08:14:35 -0500 Subject: [BioSQL-l] Treating GenBank source features as top level annotation In-Reply-To: <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com> References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com> Message-ID: <618B39A2-1CA2-405F-A8D7-4947356BF7A5@gmx.net> On Nov 18, 2009, at 7:08 AM, Richard Holland wrote: > For each tag in each feature, including source: > If it's a dbxref > If it's taxon, set the taxon ID in the BioEntry table (if no / > taxon is specified in the source feature the taxonomy does not get > stored at all) That's what the BioPerl Genbank parser does too, though only if it's a "source" feature. I don't know of any other feature key that would have a taxon dbxref entry. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Wed Nov 18 13:16:40 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 18 Nov 2009 08:16:40 -0500 Subject: [BioSQL-l] Treating GenBank source features as top level annotation In-Reply-To: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> Message-ID: <72A7ED62-7213-4136-90F9-74F4691F3003@gmx.net> On Nov 18, 2009, at 6:06 AM, Peter wrote: > For example, NC_005816, Yersinia pestis biovar Microtus str. 91001 > plasmid pPCP1, complete sequence: > > source 1..9609 > /organism="Yersinia pestis biovar Microtus str. 91001" > /mol_type="genomic DNA" > /strain="91001" > /db_xref="taxon:229193" > /plasmid="pPCP1" > /biovar="Microtus" Just FYI, the sequences coming out of the barcoding projects will have the lat/long coordinates here, too. Those obviously pertain to the specimen (and hence to the whole sequence). -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Wed Nov 18 13:34:38 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 18 Nov 2009 13:34:38 +0000 Subject: [BioSQL-l] Treating GenBank source features as top level annotation In-Reply-To: References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com> Message-ID: <320fb6e00911180534o2cd0126fp62527db04e0c346f@mail.gmail.com> On Wed, Nov 18, 2009 at 1:10 PM, Chris Fields wrote: > > Just to note, there are a few cases where there are two or more source features. > This pops up mainly with chimeric sequences, for example: > > http://www.ncbi.nlm.nih.gov/nuccore/21727885 > > We have run into this a couple of times on the bioperl list. ?In this case, each > feature is limited to specific locations on the sequence and doesn't pertain to > the entire sequence. ?NCBI only notes the first source on the ORGANISM line; > last time I checked, EMBL used both. > > chris Wow - cool example. It was worth starting this thread just to learn about this interesting corner case. I wonder if this is a common enough case to warrant leaving the source features as they are? Peter From cjfields at illinois.edu Wed Nov 18 13:10:36 2009 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 18 Nov 2009 07:10:36 -0600 Subject: [BioSQL-l] Treating GenBank source features as top level annotation In-Reply-To: <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com> References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com> Message-ID: Just to note, there are a few cases where there are two or more source features. This pops up mainly with chimeric sequences, for example: http://www.ncbi.nlm.nih.gov/nuccore/21727885 We have run into this a couple of times on the bioperl list. In this case, each feature is limited to specific locations on the sequence and doesn't pertain to the entire sequence. NCBI only notes the first source on the ORGANISM line; last time I checked, EMBL used both. chris On Nov 18, 2009, at 6:08 AM, Richard Holland wrote: > BioJava's latest parsers do the following: > > On read: > > SOURCE and ORGANISM top-level tags are completely ignored > For each tag in each feature, including source: > If it's a dbxref > If it's taxon, set the taxon ID in the BioEntry table (if no /taxon is specified in the source feature the taxonomy does not get stored at all) > Otherwise set dbxref as a feature CrossRef table entry > If it's organism > Add the organism name to the taxon ID in the Taxon table using the scientific taxon name type (if no /organism tag is specified in the source feature, the taxon gets the default name from NCBI, but only if the NCBI taxonomy data is already present in BioSQL) (if no /taxon is specified in the source feature, then the taxonomy does not get stored at all) > Otherwise > All other tags get mapped as feature qualifier values, including the source feature > > On write: > > SOURCE and ORGANISM tags are generated from the BioEntry taxon ID entry for the sequence, > All features get qualifier values output plus /db_xref tags for all entries from the CrossRef table for the feature, > The source feature is output as per a normal feature, plus /organism and /db_xref="taxon:..." tags generated as per the SOURCE and ORGANISM tags > > The main reason why we still use the source feature and don't go to sequence level is because when converting between formats it's hard to tell which sequence-level qualifier_values are from the source feature and which are from other places. > > The main reason why we rely entirely on the source feature for organism and taxon ID info is because it's much easier to parse than the SOURCE and ORGANISM tags. > > cheers, > Richard > > On 18 Nov 2009, at 11:06, Peter wrote: > >> Hello all, >> >> Something we've just been discussing on the Biopython mailing list >> is a possible change to how we parse the source features in GenBank >> (or EMBL) files. This could have knock on implications for how we use >> BioSQL. For anyone interested, the thread is here: >> http://lists.open-bio.org/pipermail/biopython/2009-November/005826.html >> >> The basic observation is that GenBank files do not have any extensible >> annotation block for the whole sequence. There are a few fields like >> the comment, organism and taxonomy - but nothing general and >> structured. Instead, it seems the NCBI etc decided to use the feature >> table for this task by inventing the "source" feature. In every single >> GenBank file I have ever seen with a source feature, there is only >> one feature of this type and it spans the full sequence. >> >> For example, NC_005816, Yersinia pestis biovar Microtus str. 91001 >> plasmid pPCP1, complete sequence: >> >> source 1..9609 >> /organism="Yersinia pestis biovar Microtus str. 91001" >> /mol_type="genomic DNA" >> /strain="91001" >> /db_xref="taxon:229193" >> /plasmid="pPCP1" >> /biovar="Microtus" >> >> (I reduced the white space for emailing). All of that information >> makes sense as annotation for the whole sequence. In fact, the >> "organism" entry is duplicated on the ORGANISM line in the >> GenBank header (and the SOURCE line too). >> >> Currently we (Biopython, BioPerl etc) store this annotation in BioSQL >> using the seqfeature_qualifiter_value and seqfeature_dbxref tables, >> associated with a "source" feature in the seqfeature table. >> >> I am suggesting it could make more sense to store the "source" >> feature annotation at the sequence level, using instead the >> bioentry_qualifier_value and bioentry_dbxref tables. >> >> This is a slight shift from the origins of BioSQL as a schema to >> hold GenBank files - but to me at least it is more logical. >> >> What does everyone else think? Things work as they are... >> and "if it ain't broken don't fix it"? >> >> Peter >> >> [Even if Biopython changes its internal object structure to treat >> the "source" feature annotation as sequence level annotation, >> we *could* continue to use a "source" feature when loading >> GenBank files to/from BioSQL if required for compatibility with >> the other Bio* projects. It would be more work though. In any >> case, we'd also need to recreate a "source" feature when >> writing GenBank output files.] >> _______________________________________________ >> BioSQL-l mailing list >> BioSQL-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biosql-l > > -- > Richard Holland, BSc MBCS > Operations and Delivery Director, Eagle Genomics Ltd > T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com > http://www.eaglegenomics.com/ > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l From hlapp at gmx.net Wed Nov 18 14:28:01 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 18 Nov 2009 09:28:01 -0500 Subject: [BioSQL-l] Treating GenBank source features as top level annotation In-Reply-To: References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com> Message-ID: <73839DF2-F9D2-441C-8AA3-649D619D2B1E@gmx.net> True - for chimeric sequences you can have multiple sources. That should be recognizable though from the length (and span) of the source feature location? -hilmar On Nov 18, 2009, at 8:10 AM, Chris Fields wrote: > Just to note, there are a few cases where there are two or more > source features. This pops up mainly with chimeric sequences, for > example: > > http://www.ncbi.nlm.nih.gov/nuccore/21727885 > > We have run into this a couple of times on the bioperl list. In > this case, each feature is limited to specific locations on the > sequence and doesn't pertain to the entire sequence. NCBI only > notes the first source on the ORGANISM line; last time I checked, > EMBL used both. > > chris > > On Nov 18, 2009, at 6:08 AM, Richard Holland wrote: > >> BioJava's latest parsers do the following: >> >> On read: >> >> SOURCE and ORGANISM top-level tags are completely ignored >> For each tag in each feature, including source: >> If it's a dbxref >> If it's taxon, set the taxon ID in the BioEntry table (if no / >> taxon is specified in the source feature the taxonomy does not get >> stored at all) >> Otherwise set dbxref as a feature CrossRef table entry >> If it's organism >> Add the organism name to the taxon ID in the Taxon table using >> the scientific taxon name type (if no /organism tag is specified in >> the source feature, the taxon gets the default name from NCBI, but >> only if the NCBI taxonomy data is already present in BioSQL) (if >> no /taxon is specified in the source feature, then the taxonomy >> does not get stored at all) >> Otherwise >> All other tags get mapped as feature qualifier values, >> including the source feature >> >> On write: >> >> SOURCE and ORGANISM tags are generated from the BioEntry taxon ID >> entry for the sequence, >> All features get qualifier values output plus /db_xref tags for >> all entries from the CrossRef table for the feature, >> The source feature is output as per a normal feature, plus / >> organism and /db_xref="taxon:..." tags generated as per the SOURCE >> and ORGANISM tags >> >> The main reason why we still use the source feature and don't go to >> sequence level is because when converting between formats it's hard >> to tell which sequence-level qualifier_values are from the source >> feature and which are from other places. >> >> The main reason why we rely entirely on the source feature for >> organism and taxon ID info is because it's much easier to parse >> than the SOURCE and ORGANISM tags. >> >> cheers, >> Richard >> >> On 18 Nov 2009, at 11:06, Peter wrote: >> >>> Hello all, >>> >>> Something we've just been discussing on the Biopython mailing list >>> is a possible change to how we parse the source features in GenBank >>> (or EMBL) files. This could have knock on implications for how we >>> use >>> BioSQL. For anyone interested, the thread is here: >>> http://lists.open-bio.org/pipermail/biopython/2009-November/005826.html >>> >>> The basic observation is that GenBank files do not have any >>> extensible >>> annotation block for the whole sequence. There are a few fields like >>> the comment, organism and taxonomy - but nothing general and >>> structured. Instead, it seems the NCBI etc decided to use the >>> feature >>> table for this task by inventing the "source" feature. In every >>> single >>> GenBank file I have ever seen with a source feature, there is only >>> one feature of this type and it spans the full sequence. >>> >>> For example, NC_005816, Yersinia pestis biovar Microtus str. 91001 >>> plasmid pPCP1, complete sequence: >>> >>> source 1..9609 >>> /organism="Yersinia pestis biovar Microtus str. 91001" >>> /mol_type="genomic DNA" >>> /strain="91001" >>> /db_xref="taxon:229193" >>> /plasmid="pPCP1" >>> /biovar="Microtus" >>> >>> (I reduced the white space for emailing). All of that information >>> makes sense as annotation for the whole sequence. In fact, the >>> "organism" entry is duplicated on the ORGANISM line in the >>> GenBank header (and the SOURCE line too). >>> >>> Currently we (Biopython, BioPerl etc) store this annotation in >>> BioSQL >>> using the seqfeature_qualifiter_value and seqfeature_dbxref tables, >>> associated with a "source" feature in the seqfeature table. >>> >>> I am suggesting it could make more sense to store the "source" >>> feature annotation at the sequence level, using instead the >>> bioentry_qualifier_value and bioentry_dbxref tables. >>> >>> This is a slight shift from the origins of BioSQL as a schema to >>> hold GenBank files - but to me at least it is more logical. >>> >>> What does everyone else think? Things work as they are... >>> and "if it ain't broken don't fix it"? >>> >>> Peter >>> >>> [Even if Biopython changes its internal object structure to treat >>> the "source" feature annotation as sequence level annotation, >>> we *could* continue to use a "source" feature when loading >>> GenBank files to/from BioSQL if required for compatibility with >>> the other Bio* projects. It would be more work though. In any >>> case, we'd also need to recreate a "source" feature when >>> writing GenBank output files.] >>> _______________________________________________ >>> BioSQL-l mailing list >>> BioSQL-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biosql-l >> >> -- >> Richard Holland, BSc MBCS >> Operations and Delivery Director, Eagle Genomics Ltd >> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com >> http://www.eaglegenomics.com/ >> >> >> _______________________________________________ >> BioSQL-l mailing list >> BioSQL-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biosql-l > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From hlapp at gmx.net Wed Nov 18 16:50:04 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed, 18 Nov 2009 11:50:04 -0500 Subject: [BioSQL-l] [patch] load_taxonomy.pl script was renamed load_ncbi_taxonomy.pl In-Reply-To: References: <625B7B90-85D5-4084-BC56-8EBB2CFF6EEE@jays.net> <044213EE-E980-4E48-870D-1F2896E937B3@jays.net> <4B022E68.2080304@open-bio.org> Message-ID: <03730DF5-706B-47F6-A191-8352E38CA42C@gmx.net> Hi Jay - thanks much for the patch, highly appreciated! -hilmar On Nov 17, 2009, at 8:00 AM, Jay Hannah wrote: > On Nov 16, 2009, at 11:02 PM, Mauricio Herrera Cuadra wrote: >> I added you to the biosql group in the SVN server. You should be >> able to commit the patch now. > > Thanks! r317 committed. :) > > j > > > > ------------------------------------------------------------------------ > r317 | jhannah | 2009-11-17 06:58:07 -0600 (Tue, 17 Nov 2009) | 2 > lines > Changed paths: > M /biosql-schema/trunk/INSTALL > M /biosql-schema/trunk/doc/schema-overview.txt > > load_taxonomy.pl script was renamed load_ncbi_taxonomy.pl. > ------------------------------------------------------------------------ > > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From maruco at gmail.com Mon Nov 9 14:35:31 2009 From: maruco at gmail.com (Thiago Satake) Date: Mon, 9 Nov 2009 09:35:31 -0500 Subject: [BioSQL-l] Project participation In-Reply-To: <82611efb0911090447y2aa1ae16x1106a41b7a0e2a1@mail.gmail.com> References: <82611efb0911090447y2aa1ae16x1106a41b7a0e2a1@mail.gmail.com> Message-ID: Hi gays, It sounds to me very" interesting project!! Where can I find more information about ? Thanks, 2009/11/9 ???Jiji Kurup??? : > Hi Lapp, > > I am very much interested to be a part of "JEE5 webservice > interface to > BioSQL" and > "BioSQL web interface and API on Google App Engine" project. > But i am not a student now, i am working in a bioinformatics > company, so > whether it is possible to > do participate in any of this projects. > > Kindly let me known if there is any provision for it and tell me the > procedure also. > > > > > -- > Regards, > > Jiji Kurup > Application Scientist > _______________________________________________ > BioSQL-l mailing list > BioSQL-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biosql-l > -- Thiago Seito Satake Tel: +55(011) 6588-8045 From cjfields at illinois.edu Wed Nov 18 17:40:21 2009 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 18 Nov 2009 11:40:21 -0600 Subject: [BioSQL-l] Treating GenBank source features as top level annotation In-Reply-To: <73839DF2-F9D2-441C-8AA3-649D619D2B1E@gmx.net> References: <320fb6e00911180306l54d75143w57a3a2c54317bbb7@mail.gmail.com> <8D7C2C0C-F7DD-4085-822B-B5495C3E98D2@eaglegenomics.com> <73839DF2-F9D2-441C-8AA3-649D619D2B1E@gmx.net> Message-ID: Yes; the location appears to specify regions of sequence originating from the indicated source. chris On Nov 18, 2009, at 8:28 AM, Hilmar Lapp wrote: > True - for chimeric sequences you can have multiple sources. That should be recognizable though from the length (and span) of the source feature location? > > -hilmar > > On Nov 18, 2009, at 8:10 AM, Chris Fields wrote: > >> Just to note, there are a few cases where there are two or more source features. This pops up mainly with chimeric sequences, for example: >> >> http://www.ncbi.nlm.nih.gov/nuccore/21727885 >> >> We have run into this a couple of times on the bioperl list. In this case, each feature is limited to specific locations on the sequence and doesn't pertain to the entire sequence. NCBI only notes the first source on the ORGANISM line; last time I checked, EMBL used both. >> >> chris >> >> On Nov 18, 2009, at 6:08 AM, Richard Holland wrote: >> >>> BioJava's latest parsers do the following: >>> >>> On read: >>> >>> SOURCE and ORGANISM top-level tags are completely ignored >>> For each tag in each feature, including source: >>> If it's a dbxref >>> If it's taxon, set the taxon ID in the BioEntry table (if no /taxon is specified in the source feature the taxonomy does not get stored at all) >>> Otherwise set dbxref as a feature CrossRef table entry >>> If it's organism >>> Add the organism name to the taxon ID in the Taxon table using the scientific taxon name type (if no /organism tag is specified in the source feature, the taxon gets the default name from NCBI, but only if the NCBI taxonomy data is already present in BioSQL) (if no /taxon is specified in the source feature, then the taxonomy does not get stored at all) >>> Otherwise >>> All other tags get mapped as feature qualifier values, including the source feature >>> >>> On write: >>> >>> SOURCE and ORGANISM tags are generated from the BioEntry taxon ID entry for the sequence, >>> All features get qualifier values output plus /db_xref tags for all entries from the CrossRef table for the feature, >>> The source feature is output as per a normal feature, plus /organism and /db_xref="taxon:..." tags generated as per the SOURCE and ORGANISM tags >>> >>> The main reason why we still use the source feature and don't go to sequence level is because when converting between formats it's hard to tell which sequence-level qualifier_values are from the source feature and which are from other places. >>> >>> The main reason why we rely entirely on the source feature for organism and taxon ID info is because it's much easier to parse than the SOURCE and ORGANISM tags. >>> >>> cheers, >>> Richard >>> >>> On 18 Nov 2009, at 11:06, Peter wrote: >>> >>>> Hello all, >>>> >>>> Something we've just been discussing on the Biopython mailing list >>>> is a possible change to how we parse the source features in GenBank >>>> (or EMBL) files. This could have knock on implications for how we use >>>> BioSQL. For anyone interested, the thread is here: >>>> http://lists.open-bio.org/pipermail/biopython/2009-November/005826.html >>>> >>>> The basic observation is that GenBank files do not have any extensible >>>> annotation block for the whole sequence. There are a few fields like >>>> the comment, organism and taxonomy - but nothing general and >>>> structured. Instead, it seems the NCBI etc decided to use the feature >>>> table for this task by inventing the "source" feature. In every single >>>> GenBank file I have ever seen with a source feature, there is only >>>> one feature of this type and it spans the full sequence. >>>> >>>> For example, NC_005816, Yersinia pestis biovar Microtus str. 91001 >>>> plasmid pPCP1, complete sequence: >>>> >>>> source 1..9609 >>>> /organism="Yersinia pestis biovar Microtus str. 91001" >>>> /mol_type="genomic DNA" >>>> /strain="91001" >>>> /db_xref="taxon:229193" >>>> /plasmid="pPCP1" >>>> /biovar="Microtus" >>>> >>>> (I reduced the white space for emailing). All of that information >>>> makes sense as annotation for the whole sequence. In fact, the >>>> "organism" entry is duplicated on the ORGANISM line in the >>>> GenBank header (and the SOURCE line too). >>>> >>>> Currently we (Biopython, BioPerl etc) store this annotation in BioSQL >>>> using the seqfeature_qualifiter_value and seqfeature_dbxref tables, >>>> associated with a "source" feature in the seqfeature table. >>>> >>>> I am suggesting it could make more sense to store the "source" >>>> feature annotation at the sequence level, using instead the >>>> bioentry_qualifier_value and bioentry_dbxref tables. >>>> >>>> This is a slight shift from the origins of BioSQL as a schema to >>>> hold GenBank files - but to me at least it is more logical. >>>> >>>> What does everyone else think? Things work as they are... >>>> and "if it ain't broken don't fix it"? >>>> >>>> Peter >>>> >>>> [Even if Biopython changes its internal object structure to treat >>>> the "source" feature annotation as sequence level annotation, >>>> we *could* continue to use a "source" feature when loading >>>> GenBank files to/from BioSQL if required for compatibility with >>>> the other Bio* projects. It would be more work though. In any >>>> case, we'd also need to recreate a "source" feature when >>>> writing GenBank output files.] >>>> _______________________________________________ >>>> BioSQL-l mailing list >>>> BioSQL-l at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biosql-l >>> >>> -- >>> Richard Holland, BSc MBCS >>> Operations and Delivery Director, Eagle Genomics Ltd >>> T: +44 (0)1223 654481 ext 3 | E: holland at eaglegenomics.com >>> http://www.eaglegenomics.com/ >>> >>> >>> _______________________________________________ >>> BioSQL-l mailing list >>> BioSQL-l at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biosql-l >> >> >> _______________________________________________ >> BioSQL-l mailing list >> BioSQL-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biosql-l >> > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : > =========================================================== > > > From biopython at maubp.freeserve.co.uk Tue Nov 24 14:27:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 24 Nov 2009 14:27:39 +0000 Subject: [BioSQL-l] SQLite support In-Reply-To: <320fb6e00907280458q56f74ec6iefa420ac1caab8da@mail.gmail.com> References: <1f864af10812150224y540f1ba6y6b30168102885fcd@mail.gmail.com> <320fb6e00812150243w4b0dc223g40abcf684af1ccf5@mail.gmail.com> <320fb6e00907050324i6d64d3abreb4d0c256bf1bdc4@mail.gmail.com> <320fb6e00907090529t61239952y1c86963f13c1db78@mail.gmail.com> <320fb6e00907280458q56f74ec6iefa420ac1caab8da@mail.gmail.com> Message-ID: <320fb6e00911240627o49bc1ec9nc0d26065ebc23423@mail.gmail.com> On Tue, Jul 28, 2009 at 11:58 AM, Peter wrote: > On Thu, Jul 9, 2009 at 1:29 PM, Peter wrote: >> Hi Hilmar, >> >> I've filed a BioSQL enhancement bug 2870 for adding an SQLite >> schema to BioSQL, and Brad has attached his proposed schema >> (converted from that for MySQL) to the bug: >> >> http://bugzilla.open-bio.org/show_bug.cgi?id=2870 >> >> Could you take a look at this please? If you are happy with it, we'd like >> to have it included in BioSQL v1.0.2 (even if Biopython is initially the >> only Bio* project to support it). > > Have you had a chance to look at this yet Hilmar? Brad is keen to > include BioSQL support for SQLite in the next release of Biopython > (hopefully within the next week or two), but to do this I'd like your > blessing, and for the proposed SQLite BioSQL schema to be added > to the BioSQL SVN repository. Hi again Hilmar, Just a reminder about the BioSQL on SQLite proposals - we'd still like to ship this with the *next* Biopython release (having skipped it for Biopython 1.52 a couple of months back). Regards, Peter From cjfields at illinois.edu Tue Nov 24 16:36:33 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 24 Nov 2009 10:36:33 -0600 Subject: [BioSQL-l] SQLite support In-Reply-To: <320fb6e00911240627o49bc1ec9nc0d26065ebc23423@mail.gmail.com> References: <1f864af10812150224y540f1ba6y6b30168102885fcd@mail.gmail.com> <320fb6e00812150243w4b0dc223g40abcf684af1ccf5@mail.gmail.com> <320fb6e00907050324i6d64d3abreb4d0c256bf1bdc4@mail.gmail.com> <320fb6e00907090529t61239952y1c86963f13c1db78@mail.gmail.com> <320fb6e00907280458q56f74ec6iefa420ac1caab8da@mail.gmail.com> <320fb6e00911240627o49bc1ec9nc0d26065ebc23423@mail.gmail.com> Message-ID: <070E8BA8-B2C1-4E44-AA2D-9934B3742406@illinois.edu> On Nov 24, 2009, at 8:27 AM, Peter wrote: > On Tue, Jul 28, 2009 at 11:58 AM, Peter wrote: >> On Thu, Jul 9, 2009 at 1:29 PM, Peter wrote: >>> Hi Hilmar, >>> >>> I've filed a BioSQL enhancement bug 2870 for adding an SQLite >>> schema to BioSQL, and Brad has attached his proposed schema >>> (converted from that for MySQL) to the bug: >>> >>> http://bugzilla.open-bio.org/show_bug.cgi?id=2870 >>> >>> Could you take a look at this please? If you are happy with it, we'd like >>> to have it included in BioSQL v1.0.2 (even if Biopython is initially the >>> only Bio* project to support it). >> >> Have you had a chance to look at this yet Hilmar? Brad is keen to >> include BioSQL support for SQLite in the next release of Biopython >> (hopefully within the next week or two), but to do this I'd like your >> blessing, and for the proposed SQLite BioSQL schema to be added >> to the BioSQL SVN repository. > > Hi again Hilmar, > > Just a reminder about the BioSQL on SQLite proposals - we'd still > like to ship this with the *next* Biopython release (having skipped it > for Biopython 1.52 a couple of months back). > > Regards, > > Peter Just want to add that I would like to see SQLite support as well (I might even feel the need to implement the necessary bioperl-db bits). chris From biopython at maubp.freeserve.co.uk Tue Nov 24 17:07:19 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 24 Nov 2009 17:07:19 +0000 Subject: [BioSQL-l] SQLite support In-Reply-To: <070E8BA8-B2C1-4E44-AA2D-9934B3742406@illinois.edu> References: <1f864af10812150224y540f1ba6y6b30168102885fcd@mail.gmail.com> <320fb6e00812150243w4b0dc223g40abcf684af1ccf5@mail.gmail.com> <320fb6e00907050324i6d64d3abreb4d0c256bf1bdc4@mail.gmail.com> <320fb6e00907090529t61239952y1c86963f13c1db78@mail.gmail.com> <320fb6e00907280458q56f74ec6iefa420ac1caab8da@mail.gmail.com> <320fb6e00911240627o49bc1ec9nc0d26065ebc23423@mail.gmail.com> <070E8BA8-B2C1-4E44-AA2D-9934B3742406@illinois.edu> Message-ID: <320fb6e00911240907u32dca751ldb488cbc38f0e035@mail.gmail.com> On Tue, Nov 24, 2009 at 4:36 PM, Chris Fields wrote: > > Just want to add that I would like to see SQLite support as well > (I might even feel the need to implement the necessary bioperl-db bits). Excellent :) Peter From desouza at ncbi.nlm.nih.gov Thu Nov 19 21:58:45 2009 From: desouza at ncbi.nlm.nih.gov (De souza, Robson (NIH/NLM/NCBI) [F]) Date: Thu, 19 Nov 2009 16:58:45 -0500 Subject: [BioSQL-l] update ontology Message-ID: <340BA68A9AA4F548881B3CCC5435061B45A6A2@NIHCESMLBX15.nih.gov> Hi guys, I'm trying to use a BioSQL database to store some protein annotation and need to know a few things: - First thing: I want to be able to update the ontologies we are working on automatically but I failed to make bp_load_ontology.pl to make it. What I want is to replace any changed terms and their relationships for new ones without losing the association between unmodified terms and the annotated proteins. I still don't know what an scriptlet for --mergeobjs should look like or whether such scriplet is the way to go in this case - How do I represent protein domains in BioSQL? I was thinking of writing code to add domains as bioentries and the use bioentry_relationship to associate sequence and domain but I would also like to store the coordinates of each domain in a protein, which would imply associating bioentries with seqfeatures. Does any of you has another suggestion in this direction? Thanks! Robson From biopython at maubp.freeserve.co.uk Wed Nov 25 21:39:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 25 Nov 2009 21:39:39 +0000 Subject: [BioSQL-l] update ontology In-Reply-To: <340BA68A9AA4F548881B3CCC5435061B45A6A2@NIHCESMLBX15.nih.gov> References: <340BA68A9AA4F548881B3CCC5435061B45A6A2@NIHCESMLBX15.nih.gov> Message-ID: <320fb6e00911251339j700517cckac3ddb0adc323f00@mail.gmail.com> On Thu, Nov 19, 2009 at 9:58 PM, De souza, Robson (NIH/NLM/NCBI) [F] wrote: > > > - How do I represent protein domains in BioSQL? > I was thinking of writing code to add domains as bioentries and the use > bioentry_relationship to associate sequence and domain but I would also > like to store the coordinates of each domain in a protein, which would > imply associating bioentries with seqfeatures. Does any of you has > another suggestion in this direction? I would do what Biopython/BioPerl/Bio* would do on loading a GenPept file into BioSQL - have a bioentry for each protein, with its amino acid sequence, and for each domain a seqfeature entry (which records the location within the parent protein). Peter