From chapmanb at 50mail.com Tue Jun 1 07:34:20 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 1 Jun 2010 07:34:20 -0400 Subject: [Open-bio-l] Best practice for modelling data in GFF In-Reply-To: References: <4BFFF7FE.1030004@bioperl.org> <685C1029-0E3A-45DD-BBA9-FFA7747810D5@illinois.edu> Message-ID: <20100601113420.GL1054@sobchak.mgh.harvard.edu> Dan; If what you are trying to do is represent your data in a way that the most people can parse and reuse it, my suggestion would be to use SAM/BAM to represent your alignments. You'll be using a standardized and well-supported format specifically designed for this type of data. While you can do this with GFF, the parser support for correctly dealing with match_part or part_of is likely to be less robust. As data providers standardize on one way to represent nested features, it should become easier to deal with them. Brad > Thanks all for replies. > > I'm aware of the GFF spec, and the SO ontology terms. The issue here > (as I understand it) is that the feature isn't 'flat', but is a > combination of two matching 'reads' that are grouped into a mate-pair > depending on their proximity and orientation. As pointed out, not > every pair is successfully mapped, specifically one read may be > 'missing' from the pair, the pair may span two reference sequences, or > the proximity or orientation of the pair may be incorrect. > > Strictly speaking this can be handled by match and match_part (or > read_pair and part_of) terms, however, the question is, does this > reflect the biology adequately? (And specifically which terms should > be used?) > > There is a canonical way to model a gene, so I was wondering if it > makes sense to describe similar 'biology' (or in this case molecular > biology) in standard ways (when the feature isn't simply described by > a single line of GFF)? > > Perhaps I've not understood SO properly, but I'm not sure how its > structure is translated into GFF structure ... is there a 1 to 1 > mapping? > > > Cheers, > Dan. > > On 28 May 2010 18:49, Chris Fields wrote: > > All, > > > > Appears that link isn't up to date. ?Current GFF3 spec (v. 1.16, updated May 25) here: > > > > http://www.sequenceontology.org/gff3.shtml > > > > chris > > > > On May 28, 2010, at 12:06 PM, Jason Stajich wrote: > > > >> It's covered in the GFF3 spec as match_part if that helps. > >> http://song.sourceforge.net/gff3.shtml > >> > >> Dan Bolser wrote, On 5/28/10 9:29 AM: > >>> Hi guys, > >>> > >>> Not sure if this is the right forum, but I just thought I'd ask... > >>> > >>> Where can I find information on 'best practices' for modelling > >>> biological data in GFF? > >>> > >>> For example, I'd like to model paired-end sequence alignments in GFF. > >>> One suggestion was to use match/match_part to link each end into a > >>> pair. Another option is to use 'read_pair' with 'contig' for the > >>> parent feature... > >>> > >>> Should I just be using SAM/BAM? > >>> > >>> Seems a shame not to have a standard way to do this in GFF... > >>> > >>> > >>> Cheers, > >>> Dan. > >>> _______________________________________________ > >>> Open-Bio-l mailing list > >>> Open-Bio-l at lists.open-bio.org > >>> http://lists.open-bio.org/mailman/listinfo/open-bio-l > >>> > >> _______________________________________________ > >> Open-Bio-l mailing list > >> Open-Bio-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/open-bio-l > > > > > > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l From asidhu at biomap.org Fri Jun 4 10:57:36 2010 From: asidhu at biomap.org (Amandeep Sidhu) Date: Fri, 4 Jun 2010 22:57:36 +0800 Subject: [Open-bio-l] CFP: 23rd IEEE International Symposium on Computer-Based Medical Systems 2010 (IEEE CBMS 2010) Message-ID: IEEE CBMS 2010 23rd IEEE International Symposium on Computer-Based Medical Systems 2010 Perth, Australia, 12-15 October 2010 http://www.cbms2010.curtin.edu.au/ The 23rd IEEE International Symposium on Computer-Based Medical Systems (CBMS 2010) is intended to provide an international forum for discussing the latest results in the field of computational medicine. The scientific program of CBMS 2010 will consist of invited keynote talks given by leading scientists in the field, and regular and special track sessions that cover a broad array of issues which relate computing to medicine. RELEVANT TOPICS Network and Telemedicine Systems Medical Databases & Information Systems Computer-Aided Diagnosis Medical Devices with Embedded Computers Bioinformatics in Medicine Software Systems in Medicine Pervasive Health Systems and Services Web-based Delivery of Medical Information Medical Image Segmentation & Compression Content Analysis of Biomedical Image Data Knowledge-Based & Decision Support Systems Hand-held Computing Applications in Medicine Knowledge Discovery & Data Mining Signal and Image Processing in Medicine Multimedia Biomedical Databases CBMS 2010 invites original previously unpublished contributions that are not submitted concurrently to a journal or another conference. Many of the above listed topics are represented by corresponding Special Tracks, while others are solely covered by the general CBMS track. Prospective authors are expected to submit their contributions to one of the corresponding Special Tracks or to the general track if none of the special tracks is relevant. SPECIAL TRACKS ST1: Computational Proteomics and Genomics ST2: Knowledge Discovery and Decision Systems in Biomedicine ST3: Ontologies for Biomedical Systems ST4: HealthGrid & Cloud Computing ST5: Technology Enhanced Learning in Medical Education ST6: Intelligent Patient Management ST7: Data Streams in Healthcare ST8: Supporting Collaboration among Healthcare Workers ST9: Telemedicine ST10: Computer-Based Systems for Mental Health ST11: Image Informatics in Biomedical Research and Clinical Medicine ST12: e-Health SUBMISSION GUIDELINES Papers should be submitted electronically using EasyChair online submission system. The papers must be prepared following the IEEE two-column format and should not exceed the length of 6 (six) Letter-sized pages. LaTeX or Microsoft Word templates can be used when preparing the papers. Please, note that only PDF format of submissions is allowed. Submission web site: http://www.easychair.org/conferences/?conf=cbms2010 All submissions will be peer-reviewed by at least three reviewers. The proceedings will be published by the IEEE Computer Society Press. At least one of the authors of accepted papers is required to register and present the work at the conference; otherwise their papers will be removed from the digital library after the conference. IMPORTANT DATES Submission deadline for regular papers: 24 June 2010 Deadline for tutorial submission: 24 June 2010 Notification of acceptation for papers and tutorials: 2 Aug 2010 Final camera ready due: 2 Sep 2010 Author registration: 2 Sep 2010 INTENDED AUDIENCE Engineers, scientists, clinicians and managers involved in medical computing projects are encouraged to submit papers to the symposium and/or attend the symposium. The symposium provides its attendees with an opportunity to experience state-of-the-art research and development in a variety of topics directly and indirectly related to their own work. In addition to research papers, keynote speakers and tutorial sessions it provides participants with an opportunity to come up-to-date on important technological issues. The symposium encourages the participation of students engaged in research/development in computer-based medical systems. Organizing Committee GENERAL CHAIRS Tharam Dillon, Curtin University of Technology, Australia Daniel Rubin, National Center for Biomedical Ontologies, USA William Gallagher, University College Dublin, Ireland PROGRAM CHAIRS Amandeep Sidhu, Curtin University of Technology, Australia Alexey Tsymbal, Siemens, Germany PUBLICATION CHAIRS Mykola Pechenizkiy, Eindhoven University of Technology, Netherlands Tony Hu, Drexel University, USA SPECIAL TRACK CHAIRS Maja Hadzic, Curtin University of Technology, Australia Jake Chen, Indiana University, USA TUTORIAL CHAIRS Phoebe Chen, La Trobe University, Australia Xiaofang Zhou, University of Queensland, Australia PUBLICITY CHAIRS Carolyn McGregor, University of Ontario Institute of Technology, Canada Meifania Chen, Curtin University of Technology, Australia From biopython at maubp.freeserve.co.uk Mon Jun 7 13:56:07 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Jun 2010 18:56:07 +0100 Subject: [Open-bio-l] Common SQLite3 schema for flat file indexing (a new OBDA standard)? In-Reply-To: References: Message-ID: On Tue, Apr 13, 2010 at 11:26 AM, Peter wrote: > Hello all, > > Last year we had a brief disucssion about the Open Biological > Database Access (OBDA) indexing for "flat files" which BioPerl and > BioRuby at least still support (despite some confusion over the spec): > http://lists.open-bio.org/pipermail/open-bio-l/2009-August/000561.html > > There may still be life in the current Berkeley DB (DBD) based OBDA > index, but with ever larger sequences files needed in next generation > sequencing this could be a problem. Is anyone finding problems with > the current BDB index scaling to larger files (with tens of millions of > entries to index)? > > From the Biopython perspective, we have a small external incentive > to favour SQLite3 over BDB: The python standard library has > historically included a BerkleyDB module (bsddb) but it has been > deprecated in Python 3. On the other hand, all recent versions of > Python include SQLite3 support. > > Those of you on the BioPerl or Biopython mailing lists will have heard > me mention the idea of using SQLite to hold a flat file index, e.g. > http://lists.open-bio.org/pipermail/bioperl-l/2010-April/032713.html > http://lists.open-bio.org/pipermail/biopython/2009-December/005997.html > > From the BioPerl thread I know they are now looking at using SQLite to hold > a lookup table of file offsets. Are any of the other Bio* projects interested > in this approach? I'd idealy like us to agree something shared with all the > Bio* libraries (a new OBDA standard using SQLite3 instead of BDB). I was > thinking something along these lines if we want to support an index for > multiple files - just three tables: > > * meta - table with string key/values (in particular to hold a schema version > number, plus perhaps the tool which built the index) > > * offsets - table with entry accessions, file number (key to next table), > file offset > > * files - table with filenames, file type (e.g. FASTA), datestamp > (so we can spot if the index is older than the file and needs to be > updated), perhaps other things like if the file is compressed (gzip, > bz2, ...). > > Of course, there are complications. For instance, calculating the offsets > when dealing with different file encodings and new lines. Mark Schreiber > raised this as a concern with Java (see open-bio-l thread linked to above, > email dated 2 Sept 2009). The new line issue could also affect Biopython, > but this may not be a real issue in practise unless moving indexes between > operating systems. > > Regards, > > Peter > (@Biopython) Hi all, We've been discussing this again on the Biopython mailing list, and the plan to store offsets in an SQLite3 database seems quite popular. In the short term I'm just aiming for indexing single files, but it does seem likely that many people would find multi-file indexing useful. What do the other Bio* projects think? Should we try to co-ordinate a specification for a common SQLite3 file indexing schema? Regards, Peter From cjfields at illinois.edu Mon Jun 7 15:08:27 2010 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 7 Jun 2010 14:08:27 -0500 Subject: [Open-bio-l] Common SQLite3 schema for flat file indexing (a new OBDA standard)? In-Reply-To: References: Message-ID: <119CA958-B7B7-4ACA-B8C1-04616648B91A@illinois.edu> On Jun 7, 2010, at 12:56 PM, Peter wrote: > On Tue, Apr 13, 2010 at 11:26 AM, Peter wrote: >> Hello all, >> >> Last year we had a brief disucssion about the Open Biological >> Database Access (OBDA) indexing for "flat files" which BioPerl and >> BioRuby at least still support (despite some confusion over the spec): >> http://lists.open-bio.org/pipermail/open-bio-l/2009-August/000561.html >> >> There may still be life in the current Berkeley DB (DBD) based OBDA >> index, but with ever larger sequences files needed in next generation >> sequencing this could be a problem. Is anyone finding problems with >> the current BDB index scaling to larger files (with tens of millions of >> entries to index)? >> >> From the Biopython perspective, we have a small external incentive >> to favour SQLite3 over BDB: The python standard library has >> historically included a BerkleyDB module (bsddb) but it has been >> deprecated in Python 3. On the other hand, all recent versions of >> Python include SQLite3 support. >> >> Those of you on the BioPerl or Biopython mailing lists will have heard >> me mention the idea of using SQLite to hold a flat file index, e.g. >> http://lists.open-bio.org/pipermail/bioperl-l/2010-April/032713.html >> http://lists.open-bio.org/pipermail/biopython/2009-December/005997.html >> >> From the BioPerl thread I know they are now looking at using SQLite to hold >> a lookup table of file offsets. Are any of the other Bio* projects interested >> in this approach? I'd idealy like us to agree something shared with all the >> Bio* libraries (a new OBDA standard using SQLite3 instead of BDB). I was >> thinking something along these lines if we want to support an index for >> multiple files - just three tables: >> >> * meta - table with string key/values (in particular to hold a schema version >> number, plus perhaps the tool which built the index) >> >> * offsets - table with entry accessions, file number (key to next table), >> file offset >> >> * files - table with filenames, file type (e.g. FASTA), datestamp >> (so we can spot if the index is older than the file and needs to be >> updated), perhaps other things like if the file is compressed (gzip, >> bz2, ...). >> >> Of course, there are complications. For instance, calculating the offsets >> when dealing with different file encodings and new lines. Mark Schreiber >> raised this as a concern with Java (see open-bio-l thread linked to above, >> email dated 2 Sept 2009). The new line issue could also affect Biopython, >> but this may not be a real issue in practise unless moving indexes between >> operating systems. >> >> Regards, >> >> Peter >> (@Biopython) > > Hi all, > > We've been discussing this again on the Biopython mailing list, and > the plan to store offsets in an SQLite3 database seems quite popular. > In the short term I'm just aiming for indexing single files, but it does > seem likely that many people would find multi-file indexing useful. > What do the other Bio* projects think? Should we try to co-ordinate > a specification for a common SQLite3 file indexing schema? > > Regards, > > Peter We typically implement multifile indexing in bioperl (either via a directory or a list of files). Not much point in limiting it to one file. Have you looked at the OBDA standard? It is a good start along these lines, but I think it dwindled a bit. Might be worth reworking and modernizing it to suit our needs. chris From biopython at maubp.freeserve.co.uk Mon Jun 7 16:11:52 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Jun 2010 21:11:52 +0100 Subject: [Open-bio-l] Common SQLite3 schema for flat file indexing (a new OBDA standard)? In-Reply-To: <119CA958-B7B7-4ACA-B8C1-04616648B91A@illinois.edu> References: <119CA958-B7B7-4ACA-B8C1-04616648B91A@illinois.edu> Message-ID: On Mon, Jun 7, 2010 at 8:08 PM, Chris Fields wrote: > > We typically implement multifile indexing in bioperl (either via a > directory or a list of files). ?Not much point in limiting it to one file. > Well, I guess as always it depends on what you're doing - I've found single file indexes very handy on several occasions. > Have you looked at the OBDA standard? It is a good start along > these lines, but I think it dwindled a bit. ?Might be worth reworking > and modernizing it to suit our needs. I have looked at the OBDA documentation (I might even try and rewrite support for it in Biopython if anyone was interested), and an update using SQLite3 seems a sensible approach to try. Peter From cjfields at illinois.edu Mon Jun 7 16:17:09 2010 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 7 Jun 2010 15:17:09 -0500 Subject: [Open-bio-l] Common SQLite3 schema for flat file indexing (a new OBDA standard)? In-Reply-To: References: <119CA958-B7B7-4ACA-B8C1-04616648B91A@illinois.edu> Message-ID: <7C15C5C7-B72A-4DE2-AF13-EAB657074A95@illinois.edu> On Jun 7, 2010, at 3:11 PM, Peter wrote: > On Mon, Jun 7, 2010 at 8:08 PM, Chris Fields wrote: >> >> We typically implement multifile indexing in bioperl (either via a >> directory or a list of files). Not much point in limiting it to one file. >> > > Well, I guess as always it depends on what you're doing - I've > found single file indexes very handy on several occasions. I think it depends on how complex the indexing is. >> Have you looked at the OBDA standard? It is a good start along >> these lines, but I think it dwindled a bit. Might be worth reworking >> and modernizing it to suit our needs. > > I have looked at the OBDA documentation (I might even try and > rewrite support for it in Biopython if anyone was interested), and > an update using SQLite3 seems a sensible approach to try. > > Peter It would be nice to have a standard for this that works cross-Bio* (and possibly beyond). That was the original intent. Maybe OBDA v2 if needed? chris From biopython at maubp.freeserve.co.uk Tue Jun 8 05:43:46 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 8 Jun 2010 10:43:46 +0100 Subject: [Open-bio-l] Common SQLite3 schema for flat file indexing (a new OBDA standard)? In-Reply-To: <7C15C5C7-B72A-4DE2-AF13-EAB657074A95@illinois.edu> References: <119CA958-B7B7-4ACA-B8C1-04616648B91A@illinois.edu> <7C15C5C7-B72A-4DE2-AF13-EAB657074A95@illinois.edu> Message-ID: On Mon, Jun 7, 2010 at 9:17 PM, Chris Fields wrote: > > It would be nice to have a standard for this that works cross-Bio* > (and possibly beyond). ?That was the original intent. ?Maybe OBDA > v2 if needed? > Yes, that was what I was suggesting. With the Biopython code I'm almost in a position to start bench marking different on disk indexes (SQLite isn't necessarily the best option); supporting indexing multiple files at once (which would be needed to restore parity with OBDA v1) is more work. Peter From chapmanb at 50mail.com Tue Jun 1 11:34:20 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 1 Jun 2010 07:34:20 -0400 Subject: [Open-bio-l] Best practice for modelling data in GFF In-Reply-To: References: <4BFFF7FE.1030004@bioperl.org> <685C1029-0E3A-45DD-BBA9-FFA7747810D5@illinois.edu> Message-ID: <20100601113420.GL1054@sobchak.mgh.harvard.edu> Dan; If what you are trying to do is represent your data in a way that the most people can parse and reuse it, my suggestion would be to use SAM/BAM to represent your alignments. You'll be using a standardized and well-supported format specifically designed for this type of data. While you can do this with GFF, the parser support for correctly dealing with match_part or part_of is likely to be less robust. As data providers standardize on one way to represent nested features, it should become easier to deal with them. Brad > Thanks all for replies. > > I'm aware of the GFF spec, and the SO ontology terms. The issue here > (as I understand it) is that the feature isn't 'flat', but is a > combination of two matching 'reads' that are grouped into a mate-pair > depending on their proximity and orientation. As pointed out, not > every pair is successfully mapped, specifically one read may be > 'missing' from the pair, the pair may span two reference sequences, or > the proximity or orientation of the pair may be incorrect. > > Strictly speaking this can be handled by match and match_part (or > read_pair and part_of) terms, however, the question is, does this > reflect the biology adequately? (And specifically which terms should > be used?) > > There is a canonical way to model a gene, so I was wondering if it > makes sense to describe similar 'biology' (or in this case molecular > biology) in standard ways (when the feature isn't simply described by > a single line of GFF)? > > Perhaps I've not understood SO properly, but I'm not sure how its > structure is translated into GFF structure ... is there a 1 to 1 > mapping? > > > Cheers, > Dan. > > On 28 May 2010 18:49, Chris Fields wrote: > > All, > > > > Appears that link isn't up to date. ?Current GFF3 spec (v. 1.16, updated May 25) here: > > > > http://www.sequenceontology.org/gff3.shtml > > > > chris > > > > On May 28, 2010, at 12:06 PM, Jason Stajich wrote: > > > >> It's covered in the GFF3 spec as match_part if that helps. > >> http://song.sourceforge.net/gff3.shtml > >> > >> Dan Bolser wrote, On 5/28/10 9:29 AM: > >>> Hi guys, > >>> > >>> Not sure if this is the right forum, but I just thought I'd ask... > >>> > >>> Where can I find information on 'best practices' for modelling > >>> biological data in GFF? > >>> > >>> For example, I'd like to model paired-end sequence alignments in GFF. > >>> One suggestion was to use match/match_part to link each end into a > >>> pair. Another option is to use 'read_pair' with 'contig' for the > >>> parent feature... > >>> > >>> Should I just be using SAM/BAM? > >>> > >>> Seems a shame not to have a standard way to do this in GFF... > >>> > >>> > >>> Cheers, > >>> Dan. > >>> _______________________________________________ > >>> Open-Bio-l mailing list > >>> Open-Bio-l at lists.open-bio.org > >>> http://lists.open-bio.org/mailman/listinfo/open-bio-l > >>> > >> _______________________________________________ > >> Open-Bio-l mailing list > >> Open-Bio-l at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/open-bio-l > > > > > > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l From asidhu at biomap.org Fri Jun 4 14:57:36 2010 From: asidhu at biomap.org (Amandeep Sidhu) Date: Fri, 4 Jun 2010 22:57:36 +0800 Subject: [Open-bio-l] CFP: 23rd IEEE International Symposium on Computer-Based Medical Systems 2010 (IEEE CBMS 2010) Message-ID: IEEE CBMS 2010 23rd IEEE International Symposium on Computer-Based Medical Systems 2010 Perth, Australia, 12-15 October 2010 http://www.cbms2010.curtin.edu.au/ The 23rd IEEE International Symposium on Computer-Based Medical Systems (CBMS 2010) is intended to provide an international forum for discussing the latest results in the field of computational medicine. The scientific program of CBMS 2010 will consist of invited keynote talks given by leading scientists in the field, and regular and special track sessions that cover a broad array of issues which relate computing to medicine. RELEVANT TOPICS Network and Telemedicine Systems Medical Databases & Information Systems Computer-Aided Diagnosis Medical Devices with Embedded Computers Bioinformatics in Medicine Software Systems in Medicine Pervasive Health Systems and Services Web-based Delivery of Medical Information Medical Image Segmentation & Compression Content Analysis of Biomedical Image Data Knowledge-Based & Decision Support Systems Hand-held Computing Applications in Medicine Knowledge Discovery & Data Mining Signal and Image Processing in Medicine Multimedia Biomedical Databases CBMS 2010 invites original previously unpublished contributions that are not submitted concurrently to a journal or another conference. Many of the above listed topics are represented by corresponding Special Tracks, while others are solely covered by the general CBMS track. Prospective authors are expected to submit their contributions to one of the corresponding Special Tracks or to the general track if none of the special tracks is relevant. SPECIAL TRACKS ST1: Computational Proteomics and Genomics ST2: Knowledge Discovery and Decision Systems in Biomedicine ST3: Ontologies for Biomedical Systems ST4: HealthGrid & Cloud Computing ST5: Technology Enhanced Learning in Medical Education ST6: Intelligent Patient Management ST7: Data Streams in Healthcare ST8: Supporting Collaboration among Healthcare Workers ST9: Telemedicine ST10: Computer-Based Systems for Mental Health ST11: Image Informatics in Biomedical Research and Clinical Medicine ST12: e-Health SUBMISSION GUIDELINES Papers should be submitted electronically using EasyChair online submission system. The papers must be prepared following the IEEE two-column format and should not exceed the length of 6 (six) Letter-sized pages. LaTeX or Microsoft Word templates can be used when preparing the papers. Please, note that only PDF format of submissions is allowed. Submission web site: http://www.easychair.org/conferences/?conf=cbms2010 All submissions will be peer-reviewed by at least three reviewers. The proceedings will be published by the IEEE Computer Society Press. At least one of the authors of accepted papers is required to register and present the work at the conference; otherwise their papers will be removed from the digital library after the conference. IMPORTANT DATES Submission deadline for regular papers: 24 June 2010 Deadline for tutorial submission: 24 June 2010 Notification of acceptation for papers and tutorials: 2 Aug 2010 Final camera ready due: 2 Sep 2010 Author registration: 2 Sep 2010 INTENDED AUDIENCE Engineers, scientists, clinicians and managers involved in medical computing projects are encouraged to submit papers to the symposium and/or attend the symposium. The symposium provides its attendees with an opportunity to experience state-of-the-art research and development in a variety of topics directly and indirectly related to their own work. In addition to research papers, keynote speakers and tutorial sessions it provides participants with an opportunity to come up-to-date on important technological issues. The symposium encourages the participation of students engaged in research/development in computer-based medical systems. Organizing Committee GENERAL CHAIRS Tharam Dillon, Curtin University of Technology, Australia Daniel Rubin, National Center for Biomedical Ontologies, USA William Gallagher, University College Dublin, Ireland PROGRAM CHAIRS Amandeep Sidhu, Curtin University of Technology, Australia Alexey Tsymbal, Siemens, Germany PUBLICATION CHAIRS Mykola Pechenizkiy, Eindhoven University of Technology, Netherlands Tony Hu, Drexel University, USA SPECIAL TRACK CHAIRS Maja Hadzic, Curtin University of Technology, Australia Jake Chen, Indiana University, USA TUTORIAL CHAIRS Phoebe Chen, La Trobe University, Australia Xiaofang Zhou, University of Queensland, Australia PUBLICITY CHAIRS Carolyn McGregor, University of Ontario Institute of Technology, Canada Meifania Chen, Curtin University of Technology, Australia From biopython at maubp.freeserve.co.uk Mon Jun 7 17:56:07 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Jun 2010 18:56:07 +0100 Subject: [Open-bio-l] Common SQLite3 schema for flat file indexing (a new OBDA standard)? In-Reply-To: References: Message-ID: On Tue, Apr 13, 2010 at 11:26 AM, Peter wrote: > Hello all, > > Last year we had a brief disucssion about the Open Biological > Database Access (OBDA) indexing for "flat files" which BioPerl and > BioRuby at least still support (despite some confusion over the spec): > http://lists.open-bio.org/pipermail/open-bio-l/2009-August/000561.html > > There may still be life in the current Berkeley DB (DBD) based OBDA > index, but with ever larger sequences files needed in next generation > sequencing this could be a problem. Is anyone finding problems with > the current BDB index scaling to larger files (with tens of millions of > entries to index)? > > From the Biopython perspective, we have a small external incentive > to favour SQLite3 over BDB: The python standard library has > historically included a BerkleyDB module (bsddb) but it has been > deprecated in Python 3. On the other hand, all recent versions of > Python include SQLite3 support. > > Those of you on the BioPerl or Biopython mailing lists will have heard > me mention the idea of using SQLite to hold a flat file index, e.g. > http://lists.open-bio.org/pipermail/bioperl-l/2010-April/032713.html > http://lists.open-bio.org/pipermail/biopython/2009-December/005997.html > > From the BioPerl thread I know they are now looking at using SQLite to hold > a lookup table of file offsets. Are any of the other Bio* projects interested > in this approach? I'd idealy like us to agree something shared with all the > Bio* libraries (a new OBDA standard using SQLite3 instead of BDB). I was > thinking something along these lines if we want to support an index for > multiple files - just three tables: > > * meta - table with string key/values (in particular to hold a schema version > number, plus perhaps the tool which built the index) > > * offsets - table with entry accessions, file number (key to next table), > file offset > > * files - table with filenames, file type (e.g. FASTA), datestamp > (so we can spot if the index is older than the file and needs to be > updated), perhaps other things like if the file is compressed (gzip, > bz2, ...). > > Of course, there are complications. For instance, calculating the offsets > when dealing with different file encodings and new lines. Mark Schreiber > raised this as a concern with Java (see open-bio-l thread linked to above, > email dated 2 Sept 2009). The new line issue could also affect Biopython, > but this may not be a real issue in practise unless moving indexes between > operating systems. > > Regards, > > Peter > (@Biopython) Hi all, We've been discussing this again on the Biopython mailing list, and the plan to store offsets in an SQLite3 database seems quite popular. In the short term I'm just aiming for indexing single files, but it does seem likely that many people would find multi-file indexing useful. What do the other Bio* projects think? Should we try to co-ordinate a specification for a common SQLite3 file indexing schema? Regards, Peter From cjfields at illinois.edu Mon Jun 7 19:08:27 2010 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 7 Jun 2010 14:08:27 -0500 Subject: [Open-bio-l] Common SQLite3 schema for flat file indexing (a new OBDA standard)? In-Reply-To: References: Message-ID: <119CA958-B7B7-4ACA-B8C1-04616648B91A@illinois.edu> On Jun 7, 2010, at 12:56 PM, Peter wrote: > On Tue, Apr 13, 2010 at 11:26 AM, Peter wrote: >> Hello all, >> >> Last year we had a brief disucssion about the Open Biological >> Database Access (OBDA) indexing for "flat files" which BioPerl and >> BioRuby at least still support (despite some confusion over the spec): >> http://lists.open-bio.org/pipermail/open-bio-l/2009-August/000561.html >> >> There may still be life in the current Berkeley DB (DBD) based OBDA >> index, but with ever larger sequences files needed in next generation >> sequencing this could be a problem. Is anyone finding problems with >> the current BDB index scaling to larger files (with tens of millions of >> entries to index)? >> >> From the Biopython perspective, we have a small external incentive >> to favour SQLite3 over BDB: The python standard library has >> historically included a BerkleyDB module (bsddb) but it has been >> deprecated in Python 3. On the other hand, all recent versions of >> Python include SQLite3 support. >> >> Those of you on the BioPerl or Biopython mailing lists will have heard >> me mention the idea of using SQLite to hold a flat file index, e.g. >> http://lists.open-bio.org/pipermail/bioperl-l/2010-April/032713.html >> http://lists.open-bio.org/pipermail/biopython/2009-December/005997.html >> >> From the BioPerl thread I know they are now looking at using SQLite to hold >> a lookup table of file offsets. Are any of the other Bio* projects interested >> in this approach? I'd idealy like us to agree something shared with all the >> Bio* libraries (a new OBDA standard using SQLite3 instead of BDB). I was >> thinking something along these lines if we want to support an index for >> multiple files - just three tables: >> >> * meta - table with string key/values (in particular to hold a schema version >> number, plus perhaps the tool which built the index) >> >> * offsets - table with entry accessions, file number (key to next table), >> file offset >> >> * files - table with filenames, file type (e.g. FASTA), datestamp >> (so we can spot if the index is older than the file and needs to be >> updated), perhaps other things like if the file is compressed (gzip, >> bz2, ...). >> >> Of course, there are complications. For instance, calculating the offsets >> when dealing with different file encodings and new lines. Mark Schreiber >> raised this as a concern with Java (see open-bio-l thread linked to above, >> email dated 2 Sept 2009). The new line issue could also affect Biopython, >> but this may not be a real issue in practise unless moving indexes between >> operating systems. >> >> Regards, >> >> Peter >> (@Biopython) > > Hi all, > > We've been discussing this again on the Biopython mailing list, and > the plan to store offsets in an SQLite3 database seems quite popular. > In the short term I'm just aiming for indexing single files, but it does > seem likely that many people would find multi-file indexing useful. > What do the other Bio* projects think? Should we try to co-ordinate > a specification for a common SQLite3 file indexing schema? > > Regards, > > Peter We typically implement multifile indexing in bioperl (either via a directory or a list of files). Not much point in limiting it to one file. Have you looked at the OBDA standard? It is a good start along these lines, but I think it dwindled a bit. Might be worth reworking and modernizing it to suit our needs. chris From biopython at maubp.freeserve.co.uk Mon Jun 7 20:11:52 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 7 Jun 2010 21:11:52 +0100 Subject: [Open-bio-l] Common SQLite3 schema for flat file indexing (a new OBDA standard)? In-Reply-To: <119CA958-B7B7-4ACA-B8C1-04616648B91A@illinois.edu> References: <119CA958-B7B7-4ACA-B8C1-04616648B91A@illinois.edu> Message-ID: On Mon, Jun 7, 2010 at 8:08 PM, Chris Fields wrote: > > We typically implement multifile indexing in bioperl (either via a > directory or a list of files). ?Not much point in limiting it to one file. > Well, I guess as always it depends on what you're doing - I've found single file indexes very handy on several occasions. > Have you looked at the OBDA standard? It is a good start along > these lines, but I think it dwindled a bit. ?Might be worth reworking > and modernizing it to suit our needs. I have looked at the OBDA documentation (I might even try and rewrite support for it in Biopython if anyone was interested), and an update using SQLite3 seems a sensible approach to try. Peter From cjfields at illinois.edu Mon Jun 7 20:17:09 2010 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 7 Jun 2010 15:17:09 -0500 Subject: [Open-bio-l] Common SQLite3 schema for flat file indexing (a new OBDA standard)? In-Reply-To: References: <119CA958-B7B7-4ACA-B8C1-04616648B91A@illinois.edu> Message-ID: <7C15C5C7-B72A-4DE2-AF13-EAB657074A95@illinois.edu> On Jun 7, 2010, at 3:11 PM, Peter wrote: > On Mon, Jun 7, 2010 at 8:08 PM, Chris Fields wrote: >> >> We typically implement multifile indexing in bioperl (either via a >> directory or a list of files). Not much point in limiting it to one file. >> > > Well, I guess as always it depends on what you're doing - I've > found single file indexes very handy on several occasions. I think it depends on how complex the indexing is. >> Have you looked at the OBDA standard? It is a good start along >> these lines, but I think it dwindled a bit. Might be worth reworking >> and modernizing it to suit our needs. > > I have looked at the OBDA documentation (I might even try and > rewrite support for it in Biopython if anyone was interested), and > an update using SQLite3 seems a sensible approach to try. > > Peter It would be nice to have a standard for this that works cross-Bio* (and possibly beyond). That was the original intent. Maybe OBDA v2 if needed? chris From biopython at maubp.freeserve.co.uk Tue Jun 8 09:43:46 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 8 Jun 2010 10:43:46 +0100 Subject: [Open-bio-l] Common SQLite3 schema for flat file indexing (a new OBDA standard)? In-Reply-To: <7C15C5C7-B72A-4DE2-AF13-EAB657074A95@illinois.edu> References: <119CA958-B7B7-4ACA-B8C1-04616648B91A@illinois.edu> <7C15C5C7-B72A-4DE2-AF13-EAB657074A95@illinois.edu> Message-ID: On Mon, Jun 7, 2010 at 9:17 PM, Chris Fields wrote: > > It would be nice to have a standard for this that works cross-Bio* > (and possibly beyond). ?That was the original intent. ?Maybe OBDA > v2 if needed? > Yes, that was what I was suggesting. With the Biopython code I'm almost in a position to start bench marking different on disk indexes (SQLite isn't necessarily the best option); supporting indexing multiple files at once (which would be needed to restore parity with OBDA v1) is more work. Peter