From p.j.a.cock at googlemail.com Thu Nov 3 14:52:50 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 3 Nov 2011 18:52:50 +0000 Subject: [Open-bio-l] OBDA redux? Message-ID: On Thu, Nov 3, 2011 at 6:28 PM, Fields, Christopher J wrote: > (side thread, so re-titling...) > And CC'ing open-bio-l, which is a better home for this than bioperl-l, where OBDA v2 talk came up again in discussion of a BioPerl indexing problem. Archive links for thread here: http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035807.html http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035808.html http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035811.html http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035812.html http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035813.html http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035822.html > On Nov 1, 2011, at 1:06 PM, Peter Cock wrote: >> >> Yes, we're using SQLite3 to store essentially a list of filenames >> and their format as one table, and then in the main table an >> entry for each sequence recording the ID (only one accession, >> unlike OBDA which had infrastructure for a secondary accession), >> file number, offset of the start of the record, and optionally the >> length of the record on disk. >> >> i.e. Basically what OBDA does, but using SQLite rather >> than BDB (not included in Python 3) or a flat file index >> (poor performance with large datasets). >> >> I find this design attractive on several levels: >> * File format neutral, covers FASTA, FASTQ, GenBank, etc >> * Preserves the original file untouched >> * Index is a small single file (thanks to SQLite) >> * Back end could be switched out >> * Could be applied to compressed file formats >> * Reuses existing parsing code to access entries >> >> This could easily form basis of OBDA v2, the main points >> of difference I anticipate between the Bio* projects would >> be naming conventions for the different file formats, and >> what we consider to be the default record ID of each read >> (e.g. which field in a GenBank file - although agreement >> here is not essential). Some of that was already settled in >> principle with OBDA v1. > > The primary/secondary IDs could be configurable with a sane > default, I think the bioperl implementations allowed this (and > it is certainly something that will be requested). One reason I went with a single ID only was to keep the Python dictionary based API simple (think hash in Perl). You don't get secondary keys in a Python dict or a hash ;) As a nod to flexibility, in Biopython's Bio.SeqIO indexing you can provide a call back function to map the suggested ID to something else. Obviously this doesn't give the full flexibility of extracting a field from the record's annotation because we don't parse the whole record during indexing (it would be too slow). However, I'm happy for there to be an *optional* secondary key in an OBDA v2 SQLite schema, but Biopython probably won't populate it. We could optionally use it rather than the primary ID on loading an existing index though. Personally I would stick with one key in the index - it should be faster and makes it simpler to switch the back end if we need to later. If anyone wants a second key, they can build a second index *grin*. >> On the other hand, you could try and store the parsed data >> itself, which is where NOSQL looks more interesting. That >> essentially requires the ability to serialise your annotated >> sequence object model to disk - which would be tricky to do >> cross project (much more ambitious than BioSQL is). It also >> means the "index" becomes very large because it now holds >> all the original data. >> >> Peter > > For a fully cross-Bio* compliant format, I don't think it's feasible > to use serialized data unless they are serialized in something > that is easily deserialized across HLLs (JSON, BSON, YAML, > XML, etc). Either that, or such data is stored concurrently with > the binary blob, along with meta data that indicates the source > of the blob, parser, version, etc, etc (unless there are tools out > there that reliably interconvert serialized complex data structures > between HLLs). Anyway you go about it, it seems like it could > be a major ball of hurt, unless implemented very carefully. You missed out RDF as a serialisation ;) But yes, going down the shared serialisation route is going to be messy - as you are well aware: > Aside: I think this was one of the problems with > Bio::DB::SeqFeature::Store, in that it at one point stored > Perl-specific Storable blobs. > > chris Peter From cjfields at illinois.edu Thu Nov 3 15:47:51 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 3 Nov 2011 19:47:51 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References: Message-ID: On Nov 3, 2011, at 1:52 PM, Peter Cock wrote: > On Thu, Nov 3, 2011 at 6:28 PM, Fields, Christopher J > wrote: >> (side thread, so re-titling...) >> > And CC'ing open-bio-l, which is a better home for this than bioperl-l, > where OBDA v2 talk came up again in discussion of a BioPerl indexing > problem. Archive links for thread here: > > http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035807.html > http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035808.html > http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035811.html > http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035812.html > http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035813.html > http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035822.html yes, good idea... >> On Nov 1, 2011, at 1:06 PM, Peter Cock wrote: >>> >>> Yes, we're using SQLite3 to store essentially a list of filenames >>> and their format as one table, and then in the main table an >>> entry for each sequence recording the ID (only one accession, >>> unlike OBDA which had infrastructure for a secondary accession), >>> file number, offset of the start of the record, and optionally the >>> length of the record on disk. >>> >>> i.e. Basically what OBDA does, but using SQLite rather >>> than BDB (not included in Python 3) or a flat file index >>> (poor performance with large datasets). >>> >>> I find this design attractive on several levels: >>> * File format neutral, covers FASTA, FASTQ, GenBank, etc >>> * Preserves the original file untouched >>> * Index is a small single file (thanks to SQLite) >>> * Back end could be switched out >>> * Could be applied to compressed file formats >>> * Reuses existing parsing code to access entries >>> >>> This could easily form basis of OBDA v2, the main points >>> of difference I anticipate between the Bio* projects would >>> be naming conventions for the different file formats, and >>> what we consider to be the default record ID of each read >>> (e.g. which field in a GenBank file - although agreement >>> here is not essential). Some of that was already settled in >>> principle with OBDA v1. >> >> The primary/secondary IDs could be configurable with a sane >> default, I think the bioperl implementations allowed this (and >> it is certainly something that will be requested). > > One reason I went with a single ID only was to keep the > Python dictionary based API simple (think hash in Perl). > You don't get secondary keys in a Python dict or a hash ;) > > As a nod to flexibility, in Biopython's Bio.SeqIO indexing you > can provide a call back function to map the suggested ID to > something else. Obviously this doesn't give the full flexibility > of extracting a field from the record's annotation because we > don't parse the whole record during indexing (it would be too > slow). Same with bioperl. > However, I'm happy for there to be an *optional* secondary > key in an OBDA v2 SQLite schema, but Biopython probably > won't populate it. We could optionally use it rather than the > primary ID on loading an existing index though. Optional implementation of that is fine by me. > Personally I would stick with one key in the index - it should > be faster and makes it simpler to switch the back end if we > need to later. If anyone wants a second key, they can build > a second index *grin*. That's easy enough. >>> On the other hand, you could try and store the parsed data >>> itself, which is where NOSQL looks more interesting. That >>> essentially requires the ability to serialise your annotated >>> sequence object model to disk - which would be tricky to do >>> cross project (much more ambitious than BioSQL is). It also >>> means the "index" becomes very large because it now holds >>> all the original data. >>> >>> Peter >> >> For a fully cross-Bio* compliant format, I don't think it's feasible >> to use serialized data unless they are serialized in something >> that is easily deserialized across HLLs (JSON, BSON, YAML, >> XML, etc). Either that, or such data is stored concurrently with >> the binary blob, along with meta data that indicates the source >> of the blob, parser, version, etc, etc (unless there are tools out >> there that reliably interconvert serialized complex data structures >> between HLLs). Anyway you go about it, it seems like it could >> be a major ball of hurt, unless implemented very carefully. > > You missed out RDF as a serialisation ;) > > But yes, going down the shared serialisation route is going > to be messy - as you are well aware: > >> Aside: I think this was one of the problems with >> Bio::DB::SeqFeature::Store, in that it at one point stored >> Perl-specific Storable blobs. >> >> chris > > Peter yes, it's a problem w/o an easy solution. Anyway, I think an implementation of such at this point would be a premature optimization. chris From p.j.a.cock at googlemail.com Sun Nov 13 07:24:35 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 13 Nov 2011 12:24:35 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References:

Message-ID: On Thu, Nov 3, 2011 at 7:47 PM, Fields, Christopher J wrote: > On Nov 3, 2011, at 1:52 PM, Peter Cock wrote: > >> On Thu, Nov 3, 2011 at 6:28 PM, Fields, Christopher J >> wrote: >>> (side thread, so re-titling...) >>> >> And CC'ing open-bio-l, which is a better home for this than bioperl-l, >> where OBDA v2 talk came up again in discussion of a BioPerl indexing >> problem. Archive links for thread here: >> >> http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035807.html >> http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035808.html >> http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035811.html >> http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035812.html >> http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035813.html >> http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035822.html > > yes, good idea... I've not CC'd the bioperl-l anymore. >>> On Nov 1, 2011, at 1:06 PM, Peter Cock wrote: >>>> >>>> Yes, we're using SQLite3 to store essentially a list of filenames >>>> and their format as one table, and then in the main table an >>>> entry for each sequence recording the ID (only one accession, >>>> unlike OBDA which had infrastructure for a secondary accession), >>>> file number, offset of the start of the record, and optionally the >>>> length of the record on disk. >>>> >>>> i.e. Basically what OBDA does, but using SQLite rather >>>> than BDB (not included in Python 3) or a flat file index >>>> (poor performance with large datasets). >>>> >>>> I find this design attractive on several levels: >>>> * File format neutral, covers FASTA, FASTQ, GenBank, etc >>>> * Preserves the original file untouched >>>> * Index is a small single file (thanks to SQLite) >>>> * Back end could be switched out >>>> * Could be applied to compressed file formats >>>> * Reuses existing parsing code to access entries >>>> >>>> This could easily form basis of OBDA v2, the main points >>>> of difference I anticipate between the Bio* projects would >>>> be naming conventions for the different file formats, and >>>> what we consider to be the default record ID of each read >>>> (e.g. which field in a GenBank file - although agreement >>>> here is not essential). Some of that was already settled in >>>> principle with OBDA v1. >>> >>> The primary/secondary IDs could be configurable with a sane >>> default, I think the bioperl implementations allowed this (and >>> it is certainly something that will be requested). >> >> One reason I went with a single ID only was to keep the >> Python dictionary based API simple (think hash in Perl). >> You don't get secondary keys in a Python dict or a hash ;) >> >> As a nod to flexibility, in Biopython's Bio.SeqIO indexing you >> can provide a call back function to map the suggested ID to >> something else. Obviously this doesn't give the full flexibility >> of extracting a field from the record's annotation because we >> don't parse the whole record during indexing (it would be too >> slow). > > Same with bioperl. > >> However, I'm happy for there to be an *optional* secondary >> key in an OBDA v2 SQLite schema, but Biopython probably >> won't populate it. We could optionally use it rather than the >> primary ID on loading an existing index though. > > Optional implementation of that is fine by me. > >> Personally I would stick with one key in the index - it should >> be faster and makes it simpler to switch the back end if we >> need to later. If anyone wants a second key, they can build >> a second index *grin*. > > That's easy enough. > >>>> On the other hand, you could try and store the parsed data >>>> itself, which is where NOSQL looks more interesting. That >>>> essentially requires the ability to serialise your annotated >>>> sequence object model to disk - which would be tricky to do >>>> cross project (much more ambitious than BioSQL is). It also >>>> means the "index" becomes very large because it now holds >>>> all the original data. >>>> >>>> Peter >>> >>> For a fully cross-Bio* compliant format, I don't think it's feasible >>> to use serialized data unless they are serialized in something >>> that is easily deserialized across HLLs (JSON, BSON, YAML, >>> XML, etc). ?Either that, or such data is stored concurrently with >>> the binary blob, along with meta data that indicates the source >>> of the blob, parser, version, etc, etc (unless there are tools out >>> there that reliably interconvert serialized complex data structures >>> between HLLs). ?Anyway you go about it, it seems like it could >>> be a major ball of hurt, unless implemented very carefully. >> >> You missed out RDF as a serialisation ;) >> >> But yes, going down the shared serialisation route is going >> to be messy - as you are well aware: >> >>> Aside: I think this was one of the problems with >>> Bio::DB::SeqFeature::Store, in that it at one point stored >>> Perl-specific Storable blobs. >>> >>> chris >> >> Peter > > yes, it's a problem w/o an easy solution. ?Anyway, I think an > implementation of such at this point would be a premature > optimization. > > chris So, Chris and I seem in general agreement that an OBDA v2 using SQLite but based on essentially the same approach as the BDB or flat file based OBDA v1 is a good idea. i.e. Tables mapping record identifiers to file offsets in the original sequence files. I hope to get BioRuby on board, they already have an OBDA v1 support so that shouldn't be too hard. Right now I don't recall if BioJava has/had OBDA v1 support, and if they did if it was affected in their recent move to BioJava v3 (I understand from their mailing list that some older lower priority functionality has not all been ported yet). Also EMBOSS are likely to be interested, certainly Peter Rice was interested in the SQLite indexing we're already using in Biopython for sequence files (i.e. what is effectively the prototype for OBDA v2). Note that in addition to simple indexing of text files, we are already using the same simple offset + length approach for indexing binary files (e.g. SFF). On the immediate practical side, I think I can edit the current OBDA website of http://obda.open-bio.org/ via /home/websites/obda.open-bio.org/html on the server. We need to work out where the current OBDA indexing specification lives (CVS or SVN?) and perhaps move that to github. We may need a general OBF organisation account on git hub for this and any other cross-project repositories. I see there is already an OBDA project on RedMine, (Chris can you add me to that please?) https://redmine.open-bio.org/projects/obda Peter From p.j.a.cock at googlemail.com Sun Nov 13 07:30:37 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 13 Nov 2011 12:30:37 +0000 Subject: [Open-bio-l] OBDA redux? Compressed files Message-ID: Hi again, I've retitled this as it is a little off topic from the main OBDA redux thread, http://lists.open-bio.org/pipermail/open-bio-l/2011-November/000819.html http://lists.open-bio.org/pipermail/open-bio-l/2011-November/000820.html http://lists.open-bio.org/pipermail/open-bio-l/2011-November/000821.html As far as I recall, the original flat file and BDB based OBDA specification for indexing sequencing files didn't cover compressed files. That might be something to consider (although we should sort of uncompressed text/binary files first). I've recently been experimenting with using compressed files - in particular simple GZIP files (ignoring any block structure) and BGZF (the specialised gzipped blocking used in BAM), see: http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html http://seqanswers.com/forums/showthread.php?t=15347 The virtual offset approach used in BGZF squeezes a 16 bit within block offset (thus limiting you to 64kb blocks) and at 48 bit block start offset (thus limiting you to a 256TB file) into a single 64bit "virtual" offset. That makes sense if you are keeping the lookup table or many offsets in memory, and can be used as is with code expecting a single offset (like the current Biopython SQLite index schema). Also bzip2 but this is block based, with the block size ranging from 100KB to 900KB. http://bzip.org/ http://bzip.org/1.0.5/bzip2-manual-1.0.5.html I haven't tried any performance tests yet, which would be interesting as I believe compression/decompression of bfzip2 is more costly in CPU terms than gzip (although both will be block size dependent). If we wanted to imitate the BGZF virtual offset scheme for arbitrary BZIP2 files, an alternative 64 bit virtual offset scheme could use 20 bits to cover bz2 blocks of up to 900KB, leaving 64 - 20 = 44 bits for the start offset, thus limiting you to to just 2^44 bytes or 16Tb which sounds OK only in the medium term. On the bright side this could be used to index any BZIP2 file (under 16TB), whereas BGZF cannot be applied to any GZIP file. On the other hand, storing the block start and within block separately is truly generic and could be used on any blocked GZIP file (including BGZF) and BZIP2 etc. It would make the SQLite schema a bit more complicated though. Maybe something to consider for the next revision to OBDA, and focus on the non-compressed case for now? Regards, Peter From p.j.a.cock at googlemail.com Sun Nov 13 07:32:12 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 13 Nov 2011 12:32:12 +0000 Subject: [Open-bio-l] OBDA redux? Compressed files In-Reply-To: References: Message-ID: On Sun, Nov 13, 2011 at 12:30 PM, Peter Cock wrote: > Hi again, > > I've retitled this as it is a little off topic from the main OBDA redux thread, > http://lists.open-bio.org/pipermail/open-bio-l/2011-November/000819.html > http://lists.open-bio.org/pipermail/open-bio-l/2011-November/000820.html > http://lists.open-bio.org/pipermail/open-bio-l/2011-November/000821.html > > As far as I recall, the original flat file and BDB based OBDA > specification for indexing sequencing files didn't cover > compressed files. That might be something to consider > (although we should sort of uncompressed text/binary > files first). Sorry - didn't meant to include bioperl-l on that, although it may be of interest to you guys anyway. Peter From cjfields at illinois.edu Mon Nov 14 12:59:35 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 14 Nov 2011 17:59:35 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References:

Message-ID: <12E3B71D-6E61-41AD-A956-A1FC2076AF24@illinois.edu> On Nov 13, 2011, at 6:24 AM, Peter Cock wrote: > So, Chris and I seem in general agreement that an OBDA v2 > using SQLite but based on essentially the same approach as > the BDB or flat file based OBDA v1 is a good idea. i.e. Tables > mapping record identifiers to file offsets in the original sequence > files. The worry I have is adhering to a specific backend (e.g. SQLite). The reason I say this is b/c BDB in it's time was the go-to way of storing simple index data, but that is no longer feasible for very large data sets. Who's to say something similar won't happen to SQLite, or that it is the best option available? Maybe we should focus on the data storage schema, as simple as it may be, then indicate the default backend must be SQLite but others are allowed (maybe with a mention that SQLite can be replaced by alternatives in the future if needed). > I hope to get BioRuby on board, they already have an OBDA > v1 support so that shouldn't be too hard. > > Right now I don't recall if BioJava has/had OBDA v1 support, > and if they did if it was affected in their recent move to BioJava > v3 (I understand from their mailing list that some older lower > priority functionality has not all been ported yet). I wouldn't be surprised at that, OBDA kind of lingered for a while, and I'm not sure how widely adopted it became (maybe others can shed light on that?) > Also EMBOSS are likely to be interested, certainly Peter Rice > was interested in the SQLite indexing we're already using in > Biopython for sequence files (i.e. what is effectively the > prototype for OBDA v2). > > Note that in addition to simple indexing of text files, we are > already using the same simple offset + length approach for > indexing binary files (e.g. SFF). I think that's the general idea, that is how all bioperl data was indexed, before with the Bio::Index modules and with the OBDA implementations as well. > On the immediate practical side, I think I can edit the > current OBDA website of http://obda.open-bio.org/ > via /home/websites/obda.open-bio.org/html on the > server. See below w/ regards to my thoughts on the wiki. > We need to work out where the current OBDA indexing > specification lives (CVS or SVN?) and perhaps move > that to github. We may need a general OBF organisation > account on git hub for this and any other cross-project > repositories. +1 to a move to github, but maybe this belongs in an OBF-specific organization. And maybe we should take advantage of the simple wiki or project homepage that GitHub offers and move everything (docs) there. > I see there is already an OBDA project on RedMine, > (Chris can you add me to that please?) > https://redmine.open-bio.org/projects/obda > > Peter Done (last night actually, but I didn't have time to respond immediately). chris From p.j.a.cock at googlemail.com Mon Nov 14 13:14:18 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 14 Nov 2011 18:14:18 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: <12E3B71D-6E61-41AD-A956-A1FC2076AF24@illinois.edu> References:

<12E3B71D-6E61-41AD-A956-A1FC2076AF24@illinois.edu> Message-ID: Hi Chris, [Did you mean to CC BioPerl-l? Should I have?] On Mon, Nov 14, 2011 at 5:59 PM, Fields, Christopher J wrote: > On Nov 13, 2011, at 6:24 AM, Peter Cock wrote: > >> So, Chris and I seem in general agreement that an OBDA v2 >> using SQLite but based on essentially the same approach as >> the BDB or flat file based OBDA v1 is a good idea. i.e. Tables >> mapping record identifiers to file offsets in the original sequence >> files. > > The worry I have is adhering to a specific backend (e.g. SQLite). > The reason I say this is b/c BDB in it's time was the go-to way > of storing simple index data, but that is no longer feasible for > very large data sets. ?Who's to say something similar won't > happen to SQLite, or that it is the best option available? Right now I would think SQLite is one of the best (if not the best) option. If supporting the old back ends is important for cross-project compatibility, I'm willing to have another go at using BDB in Biopython, but had limited success last time I tried. > Maybe we should focus on the data storage schema, as > simple as it may be, then indicate the default backend > must be SQLite but others are allowed (maybe with a > mention that SQLite can be replaced by alternatives in > the future if needed). It would make sense to talk about an SQL schema if the "other options" would also be SQL based. But they might not be... but certainly we should keep potential alternative back ends in mind. >> I hope to get BioRuby on board, they already have an OBDA >> v1 support so that shouldn't be too hard. >> >> Right now I don't recall if BioJava has/had OBDA v1 support, >> and if they did if it was affected in their recent move to BioJava >> v3 (I understand from their mailing list that some older lower >> priority functionality has not all been ported yet). > > I wouldn't be surprised at that, OBDA kind of lingered for a > while, and I'm not sure how widely adopted it became > (maybe others can shed light on that?) Well, OBDA went beyond just indexing flat files - it also tried to standard things like remote database access. I don't think we every really had that side working in Biopython, so I am less familiar with it. I know EMBOSS has something fairly extensive for online databases, but have not checked if it uses the OBDA style or their own. For now I was only planning to tackle indexing sequence files in this "OBDA redux". >> Also EMBOSS are likely to be interested, certainly Peter Rice >> was interested in the SQLite indexing we're already using in >> Biopython for sequence files (i.e. what is effectively the >> prototype for OBDA v2). >> >> Note that in addition to simple indexing of text files, we are >> already using the same simple offset + length approach for >> indexing binary files (e.g. SFF). > > I think that's the general idea, that is how all bioperl data > was indexed, before with the Bio::Index modules and with > the OBDA implementations as well. Good. >> On the immediate practical side, I think I can edit the >> current OBDA website of http://obda.open-bio.org/ >> via /home/websites/obda.open-bio.org/html on the >> server. > > See below w/ regards to my thoughts on the wiki. > >> We need to work out where the current OBDA indexing >> specification lives (CVS or SVN?) and perhaps move >> that to github. We may need a general OBF organisation >> account on git hub for this and any other cross-project >> repositories. > > +1 to a move to github, but maybe this belongs in an > OBF-specific organization. Yes, definitely under an OBF github account (not under Biopython, BioPerl, etc). > And maybe we should take advantage of the simple > wiki or project homepage that GitHub offers and move > everything (docs) there. That could work. We'd have to go through all the old documentation and relocate it, then we could make the obda.open-bio.org domain point at the github pages. >> I see there is already an OBDA project on RedMine, >> (Chris can you add me to that please?) >> https://redmine.open-bio.org/projects/obda >> >> Peter > > Done (last night actually, but I didn't have time to respond > immediately). > > chris Thanks, Peter From cjfields at illinois.edu Mon Nov 14 13:47:10 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 14 Nov 2011 18:47:10 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References:

<12E3B71D-6E61-41AD-A956-A1FC2076AF24@illinois.edu> Message-ID: <5BE2A95E-103D-4FEE-B8C4-3754CCF67507@illinois.edu> On Nov 14, 2011, at 12:14 PM, Peter Cock wrote: > Hi Chris, > > [Did you mean to CC BioPerl-l? Should I have?] > > On Mon, Nov 14, 2011 at 5:59 PM, Fields, Christopher J > wrote: >> On Nov 13, 2011, at 6:24 AM, Peter Cock wrote: >> >>> So, Chris and I seem in general agreement that an OBDA v2 >>> using SQLite but based on essentially the same approach as >>> the BDB or flat file based OBDA v1 is a good idea. i.e. Tables >>> mapping record identifiers to file offsets in the original sequence >>> files. >> >> The worry I have is adhering to a specific backend (e.g. SQLite). >> The reason I say this is b/c BDB in it's time was the go-to way >> of storing simple index data, but that is no longer feasible for >> very large data sets. Who's to say something similar won't >> happen to SQLite, or that it is the best option available? > > Right now I would think SQLite is one of the best (if not the > best) option. If supporting the old back ends is important for > cross-project compatibility, I'm willing to have another go > at using BDB in Biopython, but had limited success last > time I tried. No, I agree re: SQLite at the moment, it's probably the best option (fast, widely adopted, etc), though Jason mentioned (Tokyo|Kyoto)Cabinet also worked very well. I would rather not paint ourselves into a corner if the 'nice-and-shiny' next thing down the road performs better and gains wide adoption. >> Maybe we should focus on the data storage schema, as >> simple as it may be, then indicate the default backend >> must be SQLite but others are allowed (maybe with a >> mention that SQLite can be replaced by alternatives in >> the future if needed). > > It would make sense to talk about an SQL schema if > the "other options" would also be SQL based. But they > might not be... but certainly we should keep potential > alternative back ends in mind. It's probably necessary to allow for both possibilities (SQL and other). For instance, a move to SQLite will necessitate describing the table data with SQL anyway. >>> I hope to get BioRuby on board, they already have an OBDA >>> v1 support so that shouldn't be too hard. >>> >>> Right now I don't recall if BioJava has/had OBDA v1 support, >>> and if they did if it was affected in their recent move to BioJava >>> v3 (I understand from their mailing list that some older lower >>> priority functionality has not all been ported yet). >> >> I wouldn't be surprised at that, OBDA kind of lingered for a >> while, and I'm not sure how widely adopted it became >> (maybe others can shed light on that?) > > Well, OBDA went beyond just indexing flat files - it also > tried to standard things like remote database access. > I don't think we every really had that side working in > Biopython, so I am less familiar with it. I know EMBOSS > has something fairly extensive for online databases, > but have not checked if it uses the OBDA style or their > own. Right, but I wonder if that may have been one problem with the original OBDA specification, that it was perhaps overly ambitious out-the-gate. > For now I was only planning to tackle indexing sequence > files in this "OBDA redux". That's a good and simpler start; the rest (remote access) fall in naturally once that is in place. >>> Also EMBOSS are likely to be interested, certainly Peter Rice >>> was interested in the SQLite indexing we're already using in >>> Biopython for sequence files (i.e. what is effectively the >>> prototype for OBDA v2). >>> >>> Note that in addition to simple indexing of text files, we are >>> already using the same simple offset + length approach for >>> indexing binary files (e.g. SFF). >> >> I think that's the general idea, that is how all bioperl data >> was indexed, before with the Bio::Index modules and with >> the OBDA implementations as well. > > Good. > >>> On the immediate practical side, I think I can edit the >>> current OBDA website of http://obda.open-bio.org/ >>> via /home/websites/obda.open-bio.org/html on the >>> server. >> >> See below w/ regards to my thoughts on the wiki. >> >>> We need to work out where the current OBDA indexing >>> specification lives (CVS or SVN?) and perhaps move >>> that to github. We may need a general OBF organisation >>> account on git hub for this and any other cross-project >>> repositories. >> >> +1 to a move to github, but maybe this belongs in an >> OBF-specific organization. > > Yes, definitely under an OBF github account (not under > Biopython, BioPerl, etc). > >> And maybe we should take advantage of the simple >> wiki or project homepage that GitHub offers and move >> everything (docs) there. > > That could work. We'd have to go through all the old > documentation and relocate it, then we could make the > obda.open-bio.org domain point at the github pages. Yes, I think that's the idea. >>> I see there is already an OBDA project on RedMine, >>> (Chris can you add me to that please?) >>> https://redmine.open-bio.org/projects/obda >>> >>> Peter >> >> Done (last night actually, but I didn't have time to respond >> immediately). >> >> chris > > Thanks, > > Peter np. -c From p.j.a.cock at googlemail.com Mon Nov 14 18:01:03 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 14 Nov 2011 23:01:03 +0000 Subject: [Open-bio-l] OBDA redux? Compressed files In-Reply-To: References: Message-ID: On Sun, Nov 13, 2011 at 12:30 PM, Peter Cock wrote: > > I've recently been experimenting with using compressed > files - in particular simple GZIP files (ignoring any block structure) > and BGZF (the specialised gzipped blocking used in BAM), see: > > http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html > http://seqanswers.com/forums/showthread.php?t=15347 > > The virtual offset approach used in BGZF squeezes a 16 bit > within block offset (thus limiting you to 64kb blocks) and at > 48 bit block start offset (thus limiting you to a 256TB file) into > a single 64bit "virtual" offset. That makes sense if you are > keeping the lookup table or many offsets in memory, and > can be used as is with code expecting a single offset (like > the current Biopython SQLite index schema). > > Also bzip2 ... is block based, with the block size ranging > from 100KB to 900KB. > > http://bzip.org/ > http://bzip.org/1.0.5/bzip2-manual-1.0.5.html > A point of clarification since discovering the wikipedia page http://en.wikipedia.org/wiki/Bzip2 to be very informative, those are the compressed block sizes (100kb to 900kb), and this means that after decompression a 900kb block can in some cases reach about 46MB. Clearly that means the BGZF virtual offset approach cannot be applied to any bzip2 file (much like it can't be applied to any gzip file), without imposing some a priori limit on the decompressed size of each block. > On the other hand, storing the block start and within block > separately is truly generic and could be used on any blocked > GZIP file (including BGZF) and BZIP2 etc. It would make > the SQLite schema a bit more complicated though. So on reflection, if we want to index any blocked compressed file format such as GZIP file (including BGZF) and BZIP2 then two offsets does seem to be required (the block offset, and the data offset within the block after decompression). Peter From jason.stajich at gmail.com Wed Nov 16 15:19:14 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Wed, 16 Nov 2011 15:19:14 -0500 Subject: [Open-bio-l] OBDA redux? In-Reply-To: <5BE2A95E-103D-4FEE-B8C4-3754CCF67507@illinois.edu> References:

<12E3B71D-6E61-41AD-A956-A1FC2076AF24@illinois.edu> <5BE2A95E-103D-4FEE-B8C4-3754CCF67507@illinois.edu> Message-ID: Not to overlly advocate for the NOSQL as I think for our purposes the jury is still out. So I think it is worth benchmarking - NOSQL and SQL-based systems will have dfferent overheads. I know when I have tried to store 100M -> 500M records in SQLite the performance degrades whereas I was able to store that range of keys in NOSQL db without problem. I don't know if there is a generic API for the NOSQL systems which would help for standarization. Jason Stajich jason at bioperl.org On Mon, Nov 14, 2011 at 1:47 PM, Fields, Christopher J < cjfields at illinois.edu> wrote: > On Nov 14, 2011, at 12:14 PM, Peter Cock wrote: > > > Hi Chris, > > > > [Did you mean to CC BioPerl-l? Should I have?] > > > > On Mon, Nov 14, 2011 at 5:59 PM, Fields, Christopher J > > wrote: > >> On Nov 13, 2011, at 6:24 AM, Peter Cock wrote: > >> > >>> So, Chris and I seem in general agreement that an OBDA v2 > >>> using SQLite but based on essentially the same approach as > >>> the BDB or flat file based OBDA v1 is a good idea. i.e. Tables > >>> mapping record identifiers to file offsets in the original sequence > >>> files. > >> > >> The worry I have is adhering to a specific backend (e.g. SQLite). > >> The reason I say this is b/c BDB in it's time was the go-to way > >> of storing simple index data, but that is no longer feasible for > >> very large data sets. Who's to say something similar won't > >> happen to SQLite, or that it is the best option available? > > > > Right now I would think SQLite is one of the best (if not the > > best) option. If supporting the old back ends is important for > > cross-project compatibility, I'm willing to have another go > > at using BDB in Biopython, but had limited success last > > time I tried. > > No, I agree re: SQLite at the moment, it's probably the best option (fast, > widely adopted, etc), though Jason mentioned (Tokyo|Kyoto)Cabinet also > worked very well. I would rather not paint ourselves into a corner if the > 'nice-and-shiny' next thing down the road performs better and gains wide > adoption. > > >> Maybe we should focus on the data storage schema, as > >> simple as it may be, then indicate the default backend > >> must be SQLite but others are allowed (maybe with a > >> mention that SQLite can be replaced by alternatives in > >> the future if needed). > > > > It would make sense to talk about an SQL schema if > > the "other options" would also be SQL based. But they > > might not be... but certainly we should keep potential > > alternative back ends in mind. > > It's probably necessary to allow for both possibilities (SQL and other). > For instance, a move to SQLite will necessitate describing the table data > with SQL anyway. > > >>> I hope to get BioRuby on board, they already have an OBDA > >>> v1 support so that shouldn't be too hard. > >>> > >>> Right now I don't recall if BioJava has/had OBDA v1 support, > >>> and if they did if it was affected in their recent move to BioJava > >>> v3 (I understand from their mailing list that some older lower > >>> priority functionality has not all been ported yet). > >> > >> I wouldn't be surprised at that, OBDA kind of lingered for a > >> while, and I'm not sure how widely adopted it became > >> (maybe others can shed light on that?) > > > > Well, OBDA went beyond just indexing flat files - it also > > tried to standard things like remote database access. > > I don't think we every really had that side working in > > Biopython, so I am less familiar with it. I know EMBOSS > > has something fairly extensive for online databases, > > but have not checked if it uses the OBDA style or their > > own. > > Right, but I wonder if that may have been one problem with the original > OBDA specification, that it was perhaps overly ambitious out-the-gate. > > > For now I was only planning to tackle indexing sequence > > files in this "OBDA redux". > > That's a good and simpler start; the rest (remote access) fall in > naturally once that is in place. > > >>> Also EMBOSS are likely to be interested, certainly Peter Rice > >>> was interested in the SQLite indexing we're already using in > >>> Biopython for sequence files (i.e. what is effectively the > >>> prototype for OBDA v2). > >>> > >>> Note that in addition to simple indexing of text files, we are > >>> already using the same simple offset + length approach for > >>> indexing binary files (e.g. SFF). > >> > >> I think that's the general idea, that is how all bioperl data > >> was indexed, before with the Bio::Index modules and with > >> the OBDA implementations as well. > > > > Good. > > > >>> On the immediate practical side, I think I can edit the > >>> current OBDA website of http://obda.open-bio.org/ > >>> via /home/websites/obda.open-bio.org/html on the > >>> server. > >> > >> See below w/ regards to my thoughts on the wiki. > >> > >>> We need to work out where the current OBDA indexing > >>> specification lives (CVS or SVN?) and perhaps move > >>> that to github. We may need a general OBF organisation > >>> account on git hub for this and any other cross-project > >>> repositories. > >> > >> +1 to a move to github, but maybe this belongs in an > >> OBF-specific organization. > > > > Yes, definitely under an OBF github account (not under > > Biopython, BioPerl, etc). > > > >> And maybe we should take advantage of the simple > >> wiki or project homepage that GitHub offers and move > >> everything (docs) there. > > > > That could work. We'd have to go through all the old > > documentation and relocate it, then we could make the > > obda.open-bio.org domain point at the github pages. > > Yes, I think that's the idea. > > >>> I see there is already an OBDA project on RedMine, > >>> (Chris can you add me to that please?) > >>> https://redmine.open-bio.org/projects/obda > >>> > >>> Peter > >> > >> Done (last night actually, but I didn't have time to respond > >> immediately). > >> > >> chris > > > > Thanks, > > > > Peter > > np. > > -c > > From k at bioruby.org Wed Nov 16 19:00:50 2011 From: k at bioruby.org (Toshiaki Katayama) Date: Thu, 17 Nov 2011 09:00:50 +0900 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References:

<12E3B71D-6E61-41AD-A956-A1FC2076AF24@illinois.edu> <5BE2A95E-103D-4FEE-B8C4-3754CCF67507@illinois.edu> Message-ID: <175D6437-4DD6-42FA-A6D6-84BDCFB74E83@bioruby.org> Hi Jason, I was not actively following this thread but have one comment: > I don't know if there is a generic API for the NOSQL systems which would > help for standarization. To my knowledge, RDF/SPARQL is the only standardized format/protocol among the NoSQL stores. Unfortunately, its performance and scalability are not yet comparable to the widely used key-value stores (e.g. Tokyo Cabinet), however, Semantic Web may have a potential to be a standard for storing heterogeneous data sets as an integrated biological DB without designing any schema (we need ontologies instead). Cheers, Toshiaki Katayama On 2011/11/17, at 5:19, Jason Stajich wrote: > Not to overlly advocate for the NOSQL as I think for our purposes the jury > is still out. So I think it is worth benchmarking - NOSQL and SQL-based > systems will have dfferent overheads. > > I know when I have tried to store 100M -> 500M records in SQLite the > performance degrades whereas I was able to store that range of keys in > NOSQL db without problem. > > I don't know if there is a generic API for the NOSQL systems which would > help for standarization. > > Jason Stajich > jason at bioperl.org > > > On Mon, Nov 14, 2011 at 1:47 PM, Fields, Christopher J < > cjfields at illinois.edu> wrote: > >> On Nov 14, 2011, at 12:14 PM, Peter Cock wrote: >> >>> Hi Chris, >>> >>> [Did you mean to CC BioPerl-l? Should I have?] >>> >>> On Mon, Nov 14, 2011 at 5:59 PM, Fields, Christopher J >>> wrote: >>>> On Nov 13, 2011, at 6:24 AM, Peter Cock wrote: >>>> >>>>> So, Chris and I seem in general agreement that an OBDA v2 >>>>> using SQLite but based on essentially the same approach as >>>>> the BDB or flat file based OBDA v1 is a good idea. i.e. Tables >>>>> mapping record identifiers to file offsets in the original sequence >>>>> files. >>>> >>>> The worry I have is adhering to a specific backend (e.g. SQLite). >>>> The reason I say this is b/c BDB in it's time was the go-to way >>>> of storing simple index data, but that is no longer feasible for >>>> very large data sets. Who's to say something similar won't >>>> happen to SQLite, or that it is the best option available? >>> >>> Right now I would think SQLite is one of the best (if not the >>> best) option. If supporting the old back ends is important for >>> cross-project compatibility, I'm willing to have another go >>> at using BDB in Biopython, but had limited success last >>> time I tried. >> >> No, I agree re: SQLite at the moment, it's probably the best option (fast, >> widely adopted, etc), though Jason mentioned (Tokyo|Kyoto)Cabinet also >> worked very well. I would rather not paint ourselves into a corner if the >> 'nice-and-shiny' next thing down the road performs better and gains wide >> adoption. >> >>>> Maybe we should focus on the data storage schema, as >>>> simple as it may be, then indicate the default backend >>>> must be SQLite but others are allowed (maybe with a >>>> mention that SQLite can be replaced by alternatives in >>>> the future if needed). >>> >>> It would make sense to talk about an SQL schema if >>> the "other options" would also be SQL based. But they >>> might not be... but certainly we should keep potential >>> alternative back ends in mind. >> >> It's probably necessary to allow for both possibilities (SQL and other). >> For instance, a move to SQLite will necessitate describing the table data >> with SQL anyway. >> >>>>> I hope to get BioRuby on board, they already have an OBDA >>>>> v1 support so that shouldn't be too hard. >>>>> >>>>> Right now I don't recall if BioJava has/had OBDA v1 support, >>>>> and if they did if it was affected in their recent move to BioJava >>>>> v3 (I understand from their mailing list that some older lower >>>>> priority functionality has not all been ported yet). >>>> >>>> I wouldn't be surprised at that, OBDA kind of lingered for a >>>> while, and I'm not sure how widely adopted it became >>>> (maybe others can shed light on that?) >>> >>> Well, OBDA went beyond just indexing flat files - it also >>> tried to standard things like remote database access. >>> I don't think we every really had that side working in >>> Biopython, so I am less familiar with it. I know EMBOSS >>> has something fairly extensive for online databases, >>> but have not checked if it uses the OBDA style or their >>> own. >> >> Right, but I wonder if that may have been one problem with the original >> OBDA specification, that it was perhaps overly ambitious out-the-gate. >> >>> For now I was only planning to tackle indexing sequence >>> files in this "OBDA redux". >> >> That's a good and simpler start; the rest (remote access) fall in >> naturally once that is in place. >> >>>>> Also EMBOSS are likely to be interested, certainly Peter Rice >>>>> was interested in the SQLite indexing we're already using in >>>>> Biopython for sequence files (i.e. what is effectively the >>>>> prototype for OBDA v2). >>>>> >>>>> Note that in addition to simple indexing of text files, we are >>>>> already using the same simple offset + length approach for >>>>> indexing binary files (e.g. SFF). >>>> >>>> I think that's the general idea, that is how all bioperl data >>>> was indexed, before with the Bio::Index modules and with >>>> the OBDA implementations as well. >>> >>> Good. >>> >>>>> On the immediate practical side, I think I can edit the >>>>> current OBDA website of http://obda.open-bio.org/ >>>>> via /home/websites/obda.open-bio.org/html on the >>>>> server. >>>> >>>> See below w/ regards to my thoughts on the wiki. >>>> >>>>> We need to work out where the current OBDA indexing >>>>> specification lives (CVS or SVN?) and perhaps move >>>>> that to github. We may need a general OBF organisation >>>>> account on git hub for this and any other cross-project >>>>> repositories. >>>> >>>> +1 to a move to github, but maybe this belongs in an >>>> OBF-specific organization. >>> >>> Yes, definitely under an OBF github account (not under >>> Biopython, BioPerl, etc). >>> >>>> And maybe we should take advantage of the simple >>>> wiki or project homepage that GitHub offers and move >>>> everything (docs) there. >>> >>> That could work. We'd have to go through all the old >>> documentation and relocate it, then we could make the >>> obda.open-bio.org domain point at the github pages. >> >> Yes, I think that's the idea. >> >>>>> I see there is already an OBDA project on RedMine, >>>>> (Chris can you add me to that please?) >>>>> https://redmine.open-bio.org/projects/obda >>>>> >>>>> Peter >>>> >>>> Done (last night actually, but I didn't have time to respond >>>> immediately). >>>> >>>> chris >>> >>> Thanks, >>> >>> Peter >> >> np. >> >> -c >> >> > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l From cjfields at illinois.edu Thu Nov 17 09:13:40 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 17 Nov 2011 14:13:40 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References:

<12E3B71D-6E61-41AD-A956-A1FC2076AF24@illinois.edu> <5BE2A95E-103D-4FEE-B8C4-3754CCF67507@illinois.edu> Message-ID: <735E98B4-B8EB-4735-9C26-7A6E077F606C@illinois.edu> On Nov 16, 2011, at 2:19 PM, Jason Stajich wrote: > Not to overlly advocate for the NOSQL as I think for our purposes the jury is still out. So I think it is worth benchmarking - NOSQL and SQL-based systems will have dfferent overheads. > > I know when I have tried to store 100M -> 500M records in SQLite the performance degrades whereas I was able to store that range of keys in NOSQL db without problem. +1. This will only get worse, with the projections for upcoming HiSeq upgrades, it is possible 1-2 channel runs would hit that limit. > I don't know if there is a generic API for the NOSQL systems which would help for standarization. > > Jason Stajich > jason at bioperl.org Not that I know of, though one could probably come up with an AnyDBM wrapper for those, if the modules aren't already compliant with that API. Tokyo/KyotoCabinet have perl, python, ruby, and java interfaces with the distribution. chris > On Mon, Nov 14, 2011 at 1:47 PM, Fields, Christopher J wrote: > On Nov 14, 2011, at 12:14 PM, Peter Cock wrote: > > > Hi Chris, > > > > [Did you mean to CC BioPerl-l? Should I have?] > > > > On Mon, Nov 14, 2011 at 5:59 PM, Fields, Christopher J > > wrote: > >> On Nov 13, 2011, at 6:24 AM, Peter Cock wrote: > >> > >>> So, Chris and I seem in general agreement that an OBDA v2 > >>> using SQLite but based on essentially the same approach as > >>> the BDB or flat file based OBDA v1 is a good idea. i.e. Tables > >>> mapping record identifiers to file offsets in the original sequence > >>> files. > >> > >> The worry I have is adhering to a specific backend (e.g. SQLite). > >> The reason I say this is b/c BDB in it's time was the go-to way > >> of storing simple index data, but that is no longer feasible for > >> very large data sets. Who's to say something similar won't > >> happen to SQLite, or that it is the best option available? > > > > Right now I would think SQLite is one of the best (if not the > > best) option. If supporting the old back ends is important for > > cross-project compatibility, I'm willing to have another go > > at using BDB in Biopython, but had limited success last > > time I tried. > > No, I agree re: SQLite at the moment, it's probably the best option (fast, widely adopted, etc), though Jason mentioned (Tokyo|Kyoto)Cabinet also worked very well. I would rather not paint ourselves into a corner if the 'nice-and-shiny' next thing down the road performs better and gains wide adoption. > > >> Maybe we should focus on the data storage schema, as > >> simple as it may be, then indicate the default backend > >> must be SQLite but others are allowed (maybe with a > >> mention that SQLite can be replaced by alternatives in > >> the future if needed). > > > > It would make sense to talk about an SQL schema if > > the "other options" would also be SQL based. But they > > might not be... but certainly we should keep potential > > alternative back ends in mind. > > It's probably necessary to allow for both possibilities (SQL and other). For instance, a move to SQLite will necessitate describing the table data with SQL anyway. > > >>> I hope to get BioRuby on board, they already have an OBDA > >>> v1 support so that shouldn't be too hard. > >>> > >>> Right now I don't recall if BioJava has/had OBDA v1 support, > >>> and if they did if it was affected in their recent move to BioJava > >>> v3 (I understand from their mailing list that some older lower > >>> priority functionality has not all been ported yet). > >> > >> I wouldn't be surprised at that, OBDA kind of lingered for a > >> while, and I'm not sure how widely adopted it became > >> (maybe others can shed light on that?) > > > > Well, OBDA went beyond just indexing flat files - it also > > tried to standard things like remote database access. > > I don't think we every really had that side working in > > Biopython, so I am less familiar with it. I know EMBOSS > > has something fairly extensive for online databases, > > but have not checked if it uses the OBDA style or their > > own. > > Right, but I wonder if that may have been one problem with the original OBDA specification, that it was perhaps overly ambitious out-the-gate. > > > For now I was only planning to tackle indexing sequence > > files in this "OBDA redux". > > That's a good and simpler start; the rest (remote access) fall in naturally once that is in place. > > >>> Also EMBOSS are likely to be interested, certainly Peter Rice > >>> was interested in the SQLite indexing we're already using in > >>> Biopython for sequence files (i.e. what is effectively the > >>> prototype for OBDA v2). > >>> > >>> Note that in addition to simple indexing of text files, we are > >>> already using the same simple offset + length approach for > >>> indexing binary files (e.g. SFF). > >> > >> I think that's the general idea, that is how all bioperl data > >> was indexed, before with the Bio::Index modules and with > >> the OBDA implementations as well. > > > > Good. > > > >>> On the immediate practical side, I think I can edit the > >>> current OBDA website of http://obda.open-bio.org/ > >>> via /home/websites/obda.open-bio.org/html on the > >>> server. > >> > >> See below w/ regards to my thoughts on the wiki. > >> > >>> We need to work out where the current OBDA indexing > >>> specification lives (CVS or SVN?) and perhaps move > >>> that to github. We may need a general OBF organisation > >>> account on git hub for this and any other cross-project > >>> repositories. > >> > >> +1 to a move to github, but maybe this belongs in an > >> OBF-specific organization. > > > > Yes, definitely under an OBF github account (not under > > Biopython, BioPerl, etc). > > > >> And maybe we should take advantage of the simple > >> wiki or project homepage that GitHub offers and move > >> everything (docs) there. > > > > That could work. We'd have to go through all the old > > documentation and relocate it, then we could make the > > obda.open-bio.org domain point at the github pages. > > Yes, I think that's the idea. > > >>> I see there is already an OBDA project on RedMine, > >>> (Chris can you add me to that please?) > >>> https://redmine.open-bio.org/projects/obda > >>> > >>> Peter > >> > >> Done (last night actually, but I didn't have time to respond > >> immediately). > >> > >> chris > > > > Thanks, > > > > Peter > > np. > > -c > > From p.j.a.cock at googlemail.com Thu Nov 17 09:39:49 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 17 Nov 2011 14:39:49 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: <735E98B4-B8EB-4735-9C26-7A6E077F606C@illinois.edu> References:

<12E3B71D-6E61-41AD-A956-A1FC2076AF24@illinois.edu> <5BE2A95E-103D-4FEE-B8C4-3754CCF67507@illinois.edu> <735E98B4-B8EB-4735-9C26-7A6E077F606C@illinois.edu> Message-ID: On Thu, Nov 17, 2011 at 2:13 PM, Fields, Christopher J wrote: > On Nov 16, 2011, at 2:19 PM, Jason Stajich wrote: > >> Not to overlly advocate for the NOSQL as I think for our purposes the jury >> is still out. So I think it is worth benchmarking - NOSQL and SQL-based >> systems will have dfferent overheads. >> >> I know when I have tried to store 100M -> 500M records in SQLite the >> performance degrades whereas I was able to store that range of keys >> in NOSQL db without problem. > > +1. ?This will only get worse, with the projections for upcoming HiSeq > upgrades, it is possible 1-2 channel runs would hit that limit. That's a useful scale to aim to cover in profiling then, 100M to 500M records. Jason, do you have any more details about the slowdown you found with SQLite? For this use case we want to write the index once, and read it many times. I found it is quicker to populate the offset table before creating the index - perhaps you were seeing the index being updated while adding records? Peter From pjotr.public41 at thebird.nl Thu Nov 17 12:11:44 2011 From: pjotr.public41 at thebird.nl (Pjotr Prins) Date: Thu, 17 Nov 2011 18:11:44 +0100 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References:

<12E3B71D-6E61-41AD-A956-A1FC2076AF24@illinois.edu> <5BE2A95E-103D-4FEE-B8C4-3754CCF67507@illinois.edu> <735E98B4-B8EB-4735-9C26-7A6E077F606C@illinois.edu> Message-ID: <20111117171144.GA17301@thebird.nl> On Thu, Nov 17, 2011 at 02:39:49PM +0000, Peter Cock wrote: > > +1. ?This will only get worse, with the projections for upcoming HiSeq > > upgrades, it is possible 1-2 channel runs would hit that limit. > > That's a useful scale to aim to cover in profiling then, 100M to 500M > records. Jason, do you have any more details about the slowdown > you found with SQLite? For this use case we want to write the index > once, and read it many times. I found it is quicker to populate the > offset table before creating the index - perhaps you were seeing the > index being updated while adding records? I have also found that hammering SQLite quickly deteriorates performance. Rather too quickly in fact. Don't forget that SQL is inherently slower that 'simple' indexers. Also SQLite is a convenience library, rather than a library designed for optimized performance. We used to run sleepycat/bdb for that reason, now it is Tokyo/Kyoto cabinet. In the (rather) near future we will be looking at parallel feeds from multiple machines, to keep it somewhat interesting. Hadoop has indexing support. In fact, Hadoop should be ideal for indexed sequence information, though I have not used it. Still, when the time comes, I am kinda interested in parallelized NoSQL solutions for scaling up. Hadoop kills me because of its complexity. I hate complexity (one reason I have tried to avoid SQL servers). BTW 500M records takes significant RAM for an in-memory index. Quite a number of solutions, to retain their performance, have to have the indexes in memory. 500M records now, will grow to 500G records soon. Just a thing to keep in mind. I would opt for a non-RAM solution. Pj. From bonnal at ingm.org Fri Nov 18 04:35:56 2011 From: bonnal at ingm.org (Raoul Bonnal) Date: Fri, 18 Nov 2011 10:35:56 +0100 Subject: [Open-bio-l] OBDA redux? In-Reply-To: <20111117171144.GA17301@thebird.nl> Message-ID: Dear all, Would be possible to have a test dataset and clear requirements, functionalities? Not a huge doc, just few points for benchmarking. On 17/11/11 18.11, "Pjotr Prins" wrote: > On Thu, Nov 17, 2011 at 02:39:49PM +0000, Peter Cock wrote: >>> +1. ?This will only get worse, with the projections for upcoming HiSeq >>> upgrades, it is possible 1-2 channel runs would hit that limit. >> >> That's a useful scale to aim to cover in profiling then, 100M to 500M >> records. Jason, do you have any more details about the slowdown >> you found with SQLite? For this use case we want to write the index >> once, and read it many times. I found it is quicker to populate the >> offset table before creating the index - perhaps you were seeing the >> index being updated while adding records? > > I have also found that hammering SQLite quickly deteriorates > performance. Rather too quickly in fact. Don't forget that SQL is > inherently slower that 'simple' indexers. Also SQLite is a convenience > library, rather than a library designed for optimized performance. We > used to run sleepycat/bdb for that reason, now it is Tokyo/Kyoto > cabinet. > > In the (rather) near future we will be looking at parallel feeds from > multiple machines, to keep it somewhat interesting. Hadoop has > indexing support. In fact, Hadoop should be ideal for indexed sequence > information, though I have not used it. Still, when the time comes, I > am kinda interested in parallelized NoSQL solutions for scaling up. > Hadoop kills me because of its complexity. I hate complexity (one > reason I have tried to avoid SQL servers). > > BTW 500M records takes significant RAM for an in-memory index. Quite a > number of solutions, to retain their performance, have to have the > indexes in memory. 500M records now, will grow to 500G records soon. > Just a thing to keep in mind. I would opt for a non-RAM solution. > > Pj. > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l From p.j.a.cock at googlemail.com Fri Nov 18 05:20:54 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 18 Nov 2011 10:20:54 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References: <20111117171144.GA17301@thebird.nl> Message-ID: On Fri, Nov 18, 2011 at 9:35 AM, Raoul Bonnal wrote: > Dear all, > Would be possible to have a test dataset and clear requirements, > functionalities? Not a huge doc, just few points for benchmarking. I was thinking of using the UniProt SProt and TrEMBL datasets as test cases (FASTA, plain text "swiss", and UniProt-XML format). These have 532,792 and 17,651,715 records each (in the version I have on disk - they've just released an update), which is a good size, but not in the scale where we might start to worry about SQLite scaling. ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/ So, we'd also want some thing else like some big FASTQ files with 100M -> 500M records (or more). Perhaps we'll have to combine a couple of SRA data files together for that, which is fine. Also a full GenBank download would be good, e.g. the EST dataset files gbest1.seq.gz to gbest209.seq.gz would make a good test of indexing multiple files together as a single database: ftp://ftp.ncbi.nih.gov/genbank/ Peter From bonnal at ingm.org Fri Nov 18 05:55:48 2011 From: bonnal at ingm.org (Raoul Bonnal) Date: Fri, 18 Nov 2011 11:55:48 +0100 Subject: [Open-bio-l] OBDA redux? In-Reply-To: Message-ID: On 18/11/11 11.20, "Peter Cock" wrote: > On Fri, Nov 18, 2011 at 9:35 AM, Raoul Bonnal wrote: >> Dear all, >> Would be possible to have a test dataset and clear requirements, >> functionalities? Not a huge doc, just few points for benchmarking. > > I was thinking of using the UniProt SProt and TrEMBL datasets > as test cases (FASTA, plain text "swiss", and UniProt-XML format). > These have 532,792 and 17,651,715 records each (in the version > I have on disk - they've just released an update), which is a good > size, but not in the scale where we might start to worry about > SQLite scaling. > ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/comple > te/ > > So, we'd also want some thing else like some big FASTQ files with > 100M -> 500M records (or more). Perhaps we'll have to combine a > couple of SRA data files together for that, which is fine. > > Also a full GenBank download would be good, e.g. the EST dataset > files gbest1.seq.gz to gbest209.seq.gz would make a good test of > indexing multiple files together as a single database: > ftp://ftp.ncbi.nih.gov/genbank/ > It's a stating point. And which are the information you want to extract once you have your index ? -- Ra From p.j.a.cock at googlemail.com Fri Nov 18 06:21:04 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 18 Nov 2011 11:21:04 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References: Message-ID: On Fri, Nov 18, 2011 at 10:55 AM, Raoul Bonnal wrote: > On 18/11/11 11.20, "Peter Cock" wrote: >> On Fri, Nov 18, 2011 at 9:35 AM, Raoul Bonnal wrote: >>> Dear all, >>> Would be possible to have a test dataset and clear requirements, >>> functionalities? Not a huge doc, just few points for benchmarking. >> >> I was thinking of using the UniProt SProt and TrEMBL datasets >> as test cases (FASTA, plain text "swiss", and UniProt-XML format). >> These have 532,792 and 17,651,715 records each (in the version >> I have on disk - they've just released an update), which is a good >> size, but not in the scale where we might start to worry about >> SQLite scaling. >> ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/ >> >> So, we'd also want some thing else like some big FASTQ files with >> 100M -> 500M records (or more). Perhaps we'll have to combine a >> couple of SRA data files together for that, which is fine. >> >> Also a full GenBank download would be good, e.g. the EST dataset >> files gbest1.seq.gz to gbest209.seq.gz would make a good test of >> indexing multiple files together as a single database: >> ftp://ftp.ncbi.nih.gov/genbank/ >> > It's a stating point. > > And which are the information you want to extract once you > have your index ? > Biopython and BioPerl have their SeqIO parsers hooked up to indexing code. This means you can access a record via its ID, and it is parsed for you on demand - just like if you'd iterated over the file in order parsing the records one by one. Biopython (not sure about BioPerl) can also just fetch the raw text of that record. I presume BioRuby has something similar using the OBDA flatfile / BDB indexes? Peter From cjfields at illinois.edu Fri Nov 18 08:45:14 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 18 Nov 2011 13:45:14 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References:

Message-ID: On Nov 18, 2011, at 5:21 AM, Peter Cock wrote: > On Fri, Nov 18, 2011 at 10:55 AM, Raoul Bonnal wrote: >> ... >> And which are the information you want to extract once you >> have your index ? >> > > Biopython and BioPerl have their SeqIO parsers hooked up > to indexing code. This means you can access a record via its > ID, and it is parsed for you on demand - just like if you'd > iterated over the file in order parsing the records one by one. > > Biopython (not sure about BioPerl) can also just fetch the raw > text of that record. Re: BioPerl, I'm not sure about the OBDA implementations, but I know the older Bio::Index modules allow this. I would be surprised if the OBDA-specific code didn't, but adding this should be easy. chris From p.j.a.cock at googlemail.com Wed Nov 30 05:41:37 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Nov 2011 10:41:37 +0000 Subject: [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) Message-ID: On Wed, Nov 30, 2011 at 10:30 AM, Peter Rice wrote: > On 11/29/2011 07:09 PM, Fields, Christopher J wrote: >> >> On Nov 29, 2011, at 12:35 PM, Peter Cock wrote: >>> >>> Doesn't BioPerl just use the Staden libraries for this internally? >> >> Yes, and it uses an old version as well (via bioperl-ext). ?Much of this >> effort was to go into the biolib initiative for creating cross-lang bindings >> using swig, but that seems to be silent at the moment. ?I'm surprised Python >> doesn't have io_lib bindings. > > BioLib is just swig wrappers around the existing Bio* interfaces and > code, so it will not help in this case if the projects are too divergent. > > Could we set up a Bio* collection of data formats with examples and > note which projects can handle each one? > > We do not need any one project to cover everything - we can reasonably > expect users to use some other project to interconvert formats if there are > gaps. > > regards, > > Peter Rice > EMBOSS Team Good plan. I suggest we make a repository on github, perhaps bio-data or something like that, under the recently created OBF account, https://github.com/OBF Peter R - do you have a GitHub account yet? If so we (me, Chris Field, etc) can give you access to the OBF org account. For licensing, where we are free to choose the licence, I would like to go with something as liberal as possible to allow the files to be used by any OSS project (or closed source project), (e.g. Public Domain, CC0, MIT/BSD) rather than something more principled but restricted like CC-BY or CC-BY-ND. However, as we know from recent Debian packaging discussion about test cases taken from UniProt, licensing and copyright of samples from a database is complicated. Here we must at least keep careful records about where data came from. Peter From pmr at ebi.ac.uk Wed Nov 30 06:04:49 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 30 Nov 2011 11:04:49 +0000 Subject: [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: References: Message-ID: <4ED60DD1.8080005@ebi.ac.uk> On 11/30/2011 10:41 AM, Peter Cock wrote: > On Wed, Nov 30, 2011 at 10:30 AM, Peter Rice wrote: >> >> BioLib is just swig wrappers around the existing Bio* interfaces and >> code, so it will not help in this case if the projects are too divergent. >> >> Could we set up a Bio* collection of data formats with examples and >> note which projects can handle each one? >> >> We do not need any one project to cover everything - we can reasonably >> expect users to use some other project to interconvert formats if there are >> gaps. > > Good plan. I suggest we make a repository on github, perhaps > bio-data or something like that, under the recently created OBF > account, https://github.com/OBF > > Peter R - do you have a GitHub account yet? If so we (me, > Chris Field, etc) can give you access to the OBF org account. No ... rather a pain that EMBOSS got used. I've register under some other name: EMBOSSTEAM and created an EMBOSS project under it. Looks like git import requires subversion for any automation. Preumably I need a fresh EMBOSS checkout from CVS and then commit everything by hand ... best done after the release 6.5.0 code freeze. > For licensing, where we are free to choose the licence, I would > like to go with something as liberal as possible to allow the > files to be used by any OSS project (or closed source project), > (e.g. Public Domain, CC0, MIT/BSD) rather than something > more principled but restricted like CC-BY or CC-BY-ND. Public domain would be my choice - we don't want to cause conflicts if any data is imported into other projects (e.g. as test cases) > However, as we know from recent Debian packaging > discussion about test cases taken from UniProt, licensing > and copyright of samples from a database is complicated. > Here we must at least keep careful records about where > data came from. For that reason we probably should fake all the files for the public database formats. regards, Peter Rice EMBOSS team From p.j.a.cock at googlemail.com Wed Nov 30 06:14:44 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Nov 2011 11:14:44 +0000 Subject: [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: <4ED60DD1.8080005@ebi.ac.uk> References: <4ED60DD1.8080005@ebi.ac.uk> Message-ID: On Wed, Nov 30, 2011 at 11:04 AM, Peter Rice wrote: > On 11/30/2011 10:41 AM, Peter Cock wrote: >> >> On Wed, Nov 30, 2011 at 10:30 AM, Peter Rice ?wrote: >>> >>> BioLib is just swig wrappers around the existing Bio* interfaces and >>> code, so it will not help in this case if the projects are too divergent. >>> >>> Could we set up a Bio* collection of data formats with examples and >>> note which projects can handle each one? >>> >>> We do not need any one project to cover everything - we can reasonably >>> expect users to use some other project to interconvert formats if there >>> are >>> gaps. >> >> Good plan. I suggest we make a repository on github, perhaps >> bio-data or something like that, under the recently created OBF >> account, https://github.com/OBF >> >> Peter R - do you have a GitHub account yet? If so we (me, >> Chris Field, etc) can give you access to the OBF org account. > > No ... rather a pain that EMBOSS got used. I've register under some other > name: EMBOSSTEAM and created an EMBOSS project under it. > > Looks like git import requires subversion for any automation. > Preumably I need a fresh EMBOSS checkout from CVS and > then commit everything by hand ... best done after the release > 6.5.0 code freeze. If you are talking about converting the EMBOSS CVS into git, we can help with that having done it for Biopython. As part of this it is possible to map CVS user names to github users. I meant do you personally have a github account? >> For licensing, where we are free to choose the licence, I would >> like to go with something as liberal as possible to allow the >> files to be used by any OSS project (or closed source project), >> (e.g. Public Domain, CC0, MIT/BSD) rather than something >> more principled but restricted like CC-BY or CC-BY-ND. > > Public domain would be my choice - we don't want to cause > conflicts if any data is imported into other projects (e.g. as > test cases) Yes, public domain would be simplest where possible. >> However, as we know from recent Debian packaging >> discussion about test cases taken from UniProt, licensing >> and copyright of samples from a database is complicated. >> Here we must at least keep careful records about where >> data came from. > > For that reason we probably should fake all the files for the > public database formats. Yes, a practical solution - although it has downsides of course. Peter From pmr at ebi.ac.uk Wed Nov 30 06:38:30 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 30 Nov 2011 11:38:30 +0000 Subject: [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: <20111130113250.GA32452@thebird.nl> References: <20111130113250.GA32452@thebird.nl> Message-ID: <4ED615B6.1080002@ebi.ac.uk> On 11/30/2011 11:32 AM, Pjotr Prins wrote: > Git is not very good for storing large data files, which we would want > to fetch partially. My suggestion would be to have a plain old file > repo, e.g. on S3, which can be mirrored by others. We had issues with large files in the EMBOSS release, and make those available via rsync to add to the developers CVS checkout. They include the NCBI taxonomy source and index files and the ontology source and index files. The next EMBOSS release will include http and ftp URLs as valid inputs for any data type, so EMBOSS could use remote files for format tests. I' look into how other repositories could be added. I had to add some extra qualifiers to allow queries and offsets to be specified, and rewrote the query language parsing to merge very similar code segments. regards, Peter Rice EMBOSS Team From pjotr.public41 at thebird.nl Wed Nov 30 06:32:50 2011 From: pjotr.public41 at thebird.nl (Pjotr Prins) Date: Wed, 30 Nov 2011 12:32:50 +0100 Subject: [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: References: Message-ID: <20111130113250.GA32452@thebird.nl> On Wed, Nov 30, 2011 at 10:41:37AM +0000, Peter Cock wrote: > > BioLib is just swig wrappers around the existing Bio* interfaces and > > code, so it will not help in this case if the projects are too divergent. It is a bit more than that. Mostly biolib is a multi-platform build system. Code-wise, most libraries are not immediately suitable for wrapping (SWIG of FFI), including EMBOSS, so adapters are required. I wrote an example for EMBOSS/transeq, which outperforms all other Bio* implementations (published in upcoming Springer book). BioLib also does automated document generation (parsing SWIG XML) and testing. The current BioLib went into maintenance mode, after my visit to Chris Fields. I see BioLib v1 as a proof-of-concept mostly, at this point, though I use it, and I know of others. A new high performance library is in the works - but these things move slowly. > Good plan. I suggest we make a repository on github, perhaps > bio-data or something like that, under the recently created OBF > account, https://github.com/OBF Git is not very good for storing large data files, which we would want to fetch partially. My suggestion would be to have a plain old file repo, e.g. on S3, which can be mirrored by others. Pj. From p.j.a.cock at googlemail.com Wed Nov 30 06:42:22 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Nov 2011 11:42:22 +0000 Subject: [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: <4ED615B6.1080002@ebi.ac.uk> References: <20111130113250.GA32452@thebird.nl> <4ED615B6.1080002@ebi.ac.uk> Message-ID: On Wed, Nov 30, 2011 at 11:38 AM, Peter Rice wrote: > On 11/30/2011 11:32 AM, Pjotr Prins wrote: > >> Git is not very good for storing large data files, which we would want >> to fetch partially. My suggestion would be to have a plain old file >> repo, e.g. on S3, which can be mirrored by others. > > We had issues with large files in the EMBOSS release, and make those > available via rsync to add to the developers CVS checkout. They include the > NCBI taxonomy source and index files and the ontology source and index > files. > > The next EMBOSS release will include http and ftp URLs as valid inputs for > any data type, so EMBOSS could use remote files for format tests. I' look > into how other repositories could be added. > > I had to add some extra qualifiers to allow queries and offsets to be > specified, and rewrote the query language parsing to merge very similar code > segments. > > regards, > > Peter Rice > EMBOSS Team How about an OBF hosted FTP site then if we want big data? I guess we'd mostly be adding files, and changes/deletions should be rare, so a full version tracking repository isn't essential if we are disciplined about updating README files or more formal meta data. Peter From pjotr.public41 at thebird.nl Wed Nov 30 06:45:04 2011 From: pjotr.public41 at thebird.nl (Pjotr Prins) Date: Wed, 30 Nov 2011 12:45:04 +0100 Subject: [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: References: <20111130113250.GA32452@thebird.nl> <4ED615B6.1080002@ebi.ac.uk> Message-ID: <20111130114504.GA1542@thebird.nl> On Wed, Nov 30, 2011 at 11:42:22AM +0000, Peter Cock wrote: > How about an OBF hosted FTP site then if we want big data? Yes :) > I guess we'd mostly be adding files, and changes/deletions > should be rare, so a full version tracking repository isn't > essential if we are disciplined about updating README files > or more formal meta data. We can still have the readme's and MD5s mirrored in a small repo. That would track changes/moving/renaming. Pj. From p.j.a.cock at googlemail.com Wed Nov 30 06:58:06 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Nov 2011 11:58:06 +0000 Subject: [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: <20111130114504.GA1542@thebird.nl> References: <20111130113250.GA32452@thebird.nl> <4ED615B6.1080002@ebi.ac.uk> <20111130114504.GA1542@thebird.nl> Message-ID: On Wed, Nov 30, 2011 at 11:45 AM, Pjotr Prins wrote: > On Wed, Nov 30, 2011 at 11:42:22AM +0000, Peter Cock wrote: >> How about an OBF hosted FTP site then if we want big data? > > Yes :) > >> I guess we'd mostly be adding files, and changes/deletions >> should be rare, so a full version tracking repository isn't >> essential if we are disciplined about updating README files >> or more formal meta data. > > We can still have the readme's and MD5s mirrored in a small repo. That > would track changes/moving/renaming. > > Pj. True, or even a hybrid where small files also live in a git repo, but for larger files we just store the URL and MD5? Peter From p.j.a.cock at googlemail.com Wed Nov 30 09:49:35 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Nov 2011 14:49:35 +0000 Subject: [Open-bio-l] [EMBOSS] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: <39471.84.92.187.247.1322651650.squirrel@webmail.ebi.ac.uk> References: <39471.84.92.187.247.1322651650.squirrel@webmail.ebi.ac.uk> Message-ID: I just checked with Jon and he was happy to forward this back to the list, and also added a couple of URLs that I'd asked about: http://bioportal.bioontology.org/ontologies/44600 http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=EDAM Peter On Wed, Nov 30, 2011 at 11:14 AM, Jon Ison wrote: > Hi Peter (and Peter) > > Just a quick note to say that all (well, nearly all) common bioinformatics data formats are > catalogued in the EDAM ontology: > > http://sourceforge.net/projects/edamontology/files > http://edamontology.sourceforge.net/ > > OK - there's bound to be some we've missed :) > > Anyhow, I thought it might help to structure any effort to document data formats (an effort which > I wholeheartedly approve of by the way). ?One thing I'd like to add to the EDAM "format" > definitions is a link to the format specification, or failing that, an example. > > Cheers both > > Jon > From cjfields at illinois.edu Wed Nov 30 11:53:41 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Wed, 30 Nov 2011 16:53:41 +0000 Subject: [Open-bio-l] [EMBOSS] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: References: <39471.84.92.187.247.1322651650.squirrel@webmail.ebi.ac.uk> Message-ID: That might be the best source to pull from. Does it archive old file examples (such as older SwissProt/GenBank/EMBL)? chris On Nov 30, 2011, at 8:49 AM, Peter Cock wrote: > I just checked with Jon and he was happy to forward this back to > the list, and also added a couple of URLs that I'd asked about: > > http://bioportal.bioontology.org/ontologies/44600 > http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=EDAM > > Peter > > On Wed, Nov 30, 2011 at 11:14 AM, Jon Ison wrote: >> Hi Peter (and Peter) >> >> Just a quick note to say that all (well, nearly all) common bioinformatics data formats are >> catalogued in the EDAM ontology: >> >> http://sourceforge.net/projects/edamontology/files >> http://edamontology.sourceforge.net/ >> >> OK - there's bound to be some we've missed :) >> >> Anyhow, I thought it might help to structure any effort to document data formats (an effort which >> I wholeheartedly approve of by the way). One thing I'd like to add to the EDAM "format" >> definitions is a link to the format specification, or failing that, an example. >> >> Cheers both >> >> Jon >> > > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l From cjfields at illinois.edu Wed Nov 30 11:54:52 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Wed, 30 Nov 2011 16:54:52 +0000 Subject: [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: References: <20111130113250.GA32452@thebird.nl> <4ED615B6.1080002@ebi.ac.uk> <20111130114504.GA1542@thebird.nl> Message-ID: <5C116AE6-F2B1-4635-9673-15F35AC9C71D@illinois.edu> On Nov 30, 2011, at 5:58 AM, Peter Cock wrote: > On Wed, Nov 30, 2011 at 11:45 AM, Pjotr Prins wrote: >> On Wed, Nov 30, 2011 at 11:42:22AM +0000, Peter Cock wrote: >>> How about an OBF hosted FTP site then if we want big data? >> >> Yes :) >> >>> I guess we'd mostly be adding files, and changes/deletions >>> should be rare, so a full version tracking repository isn't >>> essential if we are disciplined about updating README files >>> or more formal meta data. >> >> We can still have the readme's and MD5s mirrored in a small repo. That >> would track changes/moving/renaming. >> >> Pj. > > True, or even a hybrid where small files also live in a git > repo, but for larger files we just store the URL and MD5? > > Peter There was an initial push for this years ago IIRC, with the biodata repository, but it never took off. Not sure if the dev.open-bio.org CVS repo is even browsable anymore (I believe this was all synced to portal for browsing), but the old biodata CVS repo is still in /home/repositories/biodata (very little there, might as well start from scratch). chris From cjfields at illinois.edu Wed Nov 30 22:50:03 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 1 Dec 2011 03:50:03 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References:

Message-ID: <6A5077BE-11D6-4E00-8E04-BF3D790B02CB@illinois.edu> Just a quick update on this: the old OBDA specs were still in CVS in the obda-specs module (the old obda site had the module wrong). I ran git cvsimport on that after I copied the CVS repo to my laptop, so it's now on github: https://github.com/OBF/OBDA We could probably work on updates from there. chris On Nov 18, 2011, at 7:45 AM, Fields, Christopher J wrote: > On Nov 18, 2011, at 5:21 AM, Peter Cock wrote: > >> On Fri, Nov 18, 2011 at 10:55 AM, Raoul Bonnal wrote: >>> ... >>> And which are the information you want to extract once you >>> have your index ? >>> >> >> Biopython and BioPerl have their SeqIO parsers hooked up >> to indexing code. This means you can access a record via its >> ID, and it is parsed for you on demand - just like if you'd >> iterated over the file in order parsing the records one by one. >> >> Biopython (not sure about BioPerl) can also just fetch the raw >> text of that record. > > Re: BioPerl, I'm not sure about the OBDA implementations, but I know the older Bio::Index modules allow this. I would be surprised if the OBDA-specific code didn't, but adding this should be easy. > > chris > > > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l From p.j.a.cock at googlemail.com Thu Nov 3 18:52:50 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 3 Nov 2011 18:52:50 +0000 Subject: [Open-bio-l] OBDA redux? Message-ID: On Thu, Nov 3, 2011 at 6:28 PM, Fields, Christopher J wrote: > (side thread, so re-titling...) > And CC'ing open-bio-l, which is a better home for this than bioperl-l, where OBDA v2 talk came up again in discussion of a BioPerl indexing problem. Archive links for thread here: http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035807.html http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035808.html http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035811.html http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035812.html http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035813.html http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035822.html > On Nov 1, 2011, at 1:06 PM, Peter Cock wrote: >> >> Yes, we're using SQLite3 to store essentially a list of filenames >> and their format as one table, and then in the main table an >> entry for each sequence recording the ID (only one accession, >> unlike OBDA which had infrastructure for a secondary accession), >> file number, offset of the start of the record, and optionally the >> length of the record on disk. >> >> i.e. Basically what OBDA does, but using SQLite rather >> than BDB (not included in Python 3) or a flat file index >> (poor performance with large datasets). >> >> I find this design attractive on several levels: >> * File format neutral, covers FASTA, FASTQ, GenBank, etc >> * Preserves the original file untouched >> * Index is a small single file (thanks to SQLite) >> * Back end could be switched out >> * Could be applied to compressed file formats >> * Reuses existing parsing code to access entries >> >> This could easily form basis of OBDA v2, the main points >> of difference I anticipate between the Bio* projects would >> be naming conventions for the different file formats, and >> what we consider to be the default record ID of each read >> (e.g. which field in a GenBank file - although agreement >> here is not essential). Some of that was already settled in >> principle with OBDA v1. > > The primary/secondary IDs could be configurable with a sane > default, I think the bioperl implementations allowed this (and > it is certainly something that will be requested). One reason I went with a single ID only was to keep the Python dictionary based API simple (think hash in Perl). You don't get secondary keys in a Python dict or a hash ;) As a nod to flexibility, in Biopython's Bio.SeqIO indexing you can provide a call back function to map the suggested ID to something else. Obviously this doesn't give the full flexibility of extracting a field from the record's annotation because we don't parse the whole record during indexing (it would be too slow). However, I'm happy for there to be an *optional* secondary key in an OBDA v2 SQLite schema, but Biopython probably won't populate it. We could optionally use it rather than the primary ID on loading an existing index though. Personally I would stick with one key in the index - it should be faster and makes it simpler to switch the back end if we need to later. If anyone wants a second key, they can build a second index *grin*. >> On the other hand, you could try and store the parsed data >> itself, which is where NOSQL looks more interesting. That >> essentially requires the ability to serialise your annotated >> sequence object model to disk - which would be tricky to do >> cross project (much more ambitious than BioSQL is). It also >> means the "index" becomes very large because it now holds >> all the original data. >> >> Peter > > For a fully cross-Bio* compliant format, I don't think it's feasible > to use serialized data unless they are serialized in something > that is easily deserialized across HLLs (JSON, BSON, YAML, > XML, etc). Either that, or such data is stored concurrently with > the binary blob, along with meta data that indicates the source > of the blob, parser, version, etc, etc (unless there are tools out > there that reliably interconvert serialized complex data structures > between HLLs). Anyway you go about it, it seems like it could > be a major ball of hurt, unless implemented very carefully. You missed out RDF as a serialisation ;) But yes, going down the shared serialisation route is going to be messy - as you are well aware: > Aside: I think this was one of the problems with > Bio::DB::SeqFeature::Store, in that it at one point stored > Perl-specific Storable blobs. > > chris Peter From cjfields at illinois.edu Thu Nov 3 19:47:51 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 3 Nov 2011 19:47:51 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References: Message-ID: On Nov 3, 2011, at 1:52 PM, Peter Cock wrote: > On Thu, Nov 3, 2011 at 6:28 PM, Fields, Christopher J > wrote: >> (side thread, so re-titling...) >> > And CC'ing open-bio-l, which is a better home for this than bioperl-l, > where OBDA v2 talk came up again in discussion of a BioPerl indexing > problem. Archive links for thread here: > > http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035807.html > http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035808.html > http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035811.html > http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035812.html > http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035813.html > http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035822.html yes, good idea... >> On Nov 1, 2011, at 1:06 PM, Peter Cock wrote: >>> >>> Yes, we're using SQLite3 to store essentially a list of filenames >>> and their format as one table, and then in the main table an >>> entry for each sequence recording the ID (only one accession, >>> unlike OBDA which had infrastructure for a secondary accession), >>> file number, offset of the start of the record, and optionally the >>> length of the record on disk. >>> >>> i.e. Basically what OBDA does, but using SQLite rather >>> than BDB (not included in Python 3) or a flat file index >>> (poor performance with large datasets). >>> >>> I find this design attractive on several levels: >>> * File format neutral, covers FASTA, FASTQ, GenBank, etc >>> * Preserves the original file untouched >>> * Index is a small single file (thanks to SQLite) >>> * Back end could be switched out >>> * Could be applied to compressed file formats >>> * Reuses existing parsing code to access entries >>> >>> This could easily form basis of OBDA v2, the main points >>> of difference I anticipate between the Bio* projects would >>> be naming conventions for the different file formats, and >>> what we consider to be the default record ID of each read >>> (e.g. which field in a GenBank file - although agreement >>> here is not essential). Some of that was already settled in >>> principle with OBDA v1. >> >> The primary/secondary IDs could be configurable with a sane >> default, I think the bioperl implementations allowed this (and >> it is certainly something that will be requested). > > One reason I went with a single ID only was to keep the > Python dictionary based API simple (think hash in Perl). > You don't get secondary keys in a Python dict or a hash ;) > > As a nod to flexibility, in Biopython's Bio.SeqIO indexing you > can provide a call back function to map the suggested ID to > something else. Obviously this doesn't give the full flexibility > of extracting a field from the record's annotation because we > don't parse the whole record during indexing (it would be too > slow). Same with bioperl. > However, I'm happy for there to be an *optional* secondary > key in an OBDA v2 SQLite schema, but Biopython probably > won't populate it. We could optionally use it rather than the > primary ID on loading an existing index though. Optional implementation of that is fine by me. > Personally I would stick with one key in the index - it should > be faster and makes it simpler to switch the back end if we > need to later. If anyone wants a second key, they can build > a second index *grin*. That's easy enough. >>> On the other hand, you could try and store the parsed data >>> itself, which is where NOSQL looks more interesting. That >>> essentially requires the ability to serialise your annotated >>> sequence object model to disk - which would be tricky to do >>> cross project (much more ambitious than BioSQL is). It also >>> means the "index" becomes very large because it now holds >>> all the original data. >>> >>> Peter >> >> For a fully cross-Bio* compliant format, I don't think it's feasible >> to use serialized data unless they are serialized in something >> that is easily deserialized across HLLs (JSON, BSON, YAML, >> XML, etc). Either that, or such data is stored concurrently with >> the binary blob, along with meta data that indicates the source >> of the blob, parser, version, etc, etc (unless there are tools out >> there that reliably interconvert serialized complex data structures >> between HLLs). Anyway you go about it, it seems like it could >> be a major ball of hurt, unless implemented very carefully. > > You missed out RDF as a serialisation ;) > > But yes, going down the shared serialisation route is going > to be messy - as you are well aware: > >> Aside: I think this was one of the problems with >> Bio::DB::SeqFeature::Store, in that it at one point stored >> Perl-specific Storable blobs. >> >> chris > > Peter yes, it's a problem w/o an easy solution. Anyway, I think an implementation of such at this point would be a premature optimization. chris From p.j.a.cock at googlemail.com Sun Nov 13 12:24:35 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 13 Nov 2011 12:24:35 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References:

Message-ID: On Thu, Nov 3, 2011 at 7:47 PM, Fields, Christopher J wrote: > On Nov 3, 2011, at 1:52 PM, Peter Cock wrote: > >> On Thu, Nov 3, 2011 at 6:28 PM, Fields, Christopher J >> wrote: >>> (side thread, so re-titling...) >>> >> And CC'ing open-bio-l, which is a better home for this than bioperl-l, >> where OBDA v2 talk came up again in discussion of a BioPerl indexing >> problem. Archive links for thread here: >> >> http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035807.html >> http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035808.html >> http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035811.html >> http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035812.html >> http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035813.html >> http://lists.open-bio.org/pipermail/bioperl-l/2011-November/035822.html > > yes, good idea... I've not CC'd the bioperl-l anymore. >>> On Nov 1, 2011, at 1:06 PM, Peter Cock wrote: >>>> >>>> Yes, we're using SQLite3 to store essentially a list of filenames >>>> and their format as one table, and then in the main table an >>>> entry for each sequence recording the ID (only one accession, >>>> unlike OBDA which had infrastructure for a secondary accession), >>>> file number, offset of the start of the record, and optionally the >>>> length of the record on disk. >>>> >>>> i.e. Basically what OBDA does, but using SQLite rather >>>> than BDB (not included in Python 3) or a flat file index >>>> (poor performance with large datasets). >>>> >>>> I find this design attractive on several levels: >>>> * File format neutral, covers FASTA, FASTQ, GenBank, etc >>>> * Preserves the original file untouched >>>> * Index is a small single file (thanks to SQLite) >>>> * Back end could be switched out >>>> * Could be applied to compressed file formats >>>> * Reuses existing parsing code to access entries >>>> >>>> This could easily form basis of OBDA v2, the main points >>>> of difference I anticipate between the Bio* projects would >>>> be naming conventions for the different file formats, and >>>> what we consider to be the default record ID of each read >>>> (e.g. which field in a GenBank file - although agreement >>>> here is not essential). Some of that was already settled in >>>> principle with OBDA v1. >>> >>> The primary/secondary IDs could be configurable with a sane >>> default, I think the bioperl implementations allowed this (and >>> it is certainly something that will be requested). >> >> One reason I went with a single ID only was to keep the >> Python dictionary based API simple (think hash in Perl). >> You don't get secondary keys in a Python dict or a hash ;) >> >> As a nod to flexibility, in Biopython's Bio.SeqIO indexing you >> can provide a call back function to map the suggested ID to >> something else. Obviously this doesn't give the full flexibility >> of extracting a field from the record's annotation because we >> don't parse the whole record during indexing (it would be too >> slow). > > Same with bioperl. > >> However, I'm happy for there to be an *optional* secondary >> key in an OBDA v2 SQLite schema, but Biopython probably >> won't populate it. We could optionally use it rather than the >> primary ID on loading an existing index though. > > Optional implementation of that is fine by me. > >> Personally I would stick with one key in the index - it should >> be faster and makes it simpler to switch the back end if we >> need to later. If anyone wants a second key, they can build >> a second index *grin*. > > That's easy enough. > >>>> On the other hand, you could try and store the parsed data >>>> itself, which is where NOSQL looks more interesting. That >>>> essentially requires the ability to serialise your annotated >>>> sequence object model to disk - which would be tricky to do >>>> cross project (much more ambitious than BioSQL is). It also >>>> means the "index" becomes very large because it now holds >>>> all the original data. >>>> >>>> Peter >>> >>> For a fully cross-Bio* compliant format, I don't think it's feasible >>> to use serialized data unless they are serialized in something >>> that is easily deserialized across HLLs (JSON, BSON, YAML, >>> XML, etc). ?Either that, or such data is stored concurrently with >>> the binary blob, along with meta data that indicates the source >>> of the blob, parser, version, etc, etc (unless there are tools out >>> there that reliably interconvert serialized complex data structures >>> between HLLs). ?Anyway you go about it, it seems like it could >>> be a major ball of hurt, unless implemented very carefully. >> >> You missed out RDF as a serialisation ;) >> >> But yes, going down the shared serialisation route is going >> to be messy - as you are well aware: >> >>> Aside: I think this was one of the problems with >>> Bio::DB::SeqFeature::Store, in that it at one point stored >>> Perl-specific Storable blobs. >>> >>> chris >> >> Peter > > yes, it's a problem w/o an easy solution. ?Anyway, I think an > implementation of such at this point would be a premature > optimization. > > chris So, Chris and I seem in general agreement that an OBDA v2 using SQLite but based on essentially the same approach as the BDB or flat file based OBDA v1 is a good idea. i.e. Tables mapping record identifiers to file offsets in the original sequence files. I hope to get BioRuby on board, they already have an OBDA v1 support so that shouldn't be too hard. Right now I don't recall if BioJava has/had OBDA v1 support, and if they did if it was affected in their recent move to BioJava v3 (I understand from their mailing list that some older lower priority functionality has not all been ported yet). Also EMBOSS are likely to be interested, certainly Peter Rice was interested in the SQLite indexing we're already using in Biopython for sequence files (i.e. what is effectively the prototype for OBDA v2). Note that in addition to simple indexing of text files, we are already using the same simple offset + length approach for indexing binary files (e.g. SFF). On the immediate practical side, I think I can edit the current OBDA website of http://obda.open-bio.org/ via /home/websites/obda.open-bio.org/html on the server. We need to work out where the current OBDA indexing specification lives (CVS or SVN?) and perhaps move that to github. We may need a general OBF organisation account on git hub for this and any other cross-project repositories. I see there is already an OBDA project on RedMine, (Chris can you add me to that please?) https://redmine.open-bio.org/projects/obda Peter From p.j.a.cock at googlemail.com Sun Nov 13 12:30:37 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 13 Nov 2011 12:30:37 +0000 Subject: [Open-bio-l] OBDA redux? Compressed files Message-ID: Hi again, I've retitled this as it is a little off topic from the main OBDA redux thread, http://lists.open-bio.org/pipermail/open-bio-l/2011-November/000819.html http://lists.open-bio.org/pipermail/open-bio-l/2011-November/000820.html http://lists.open-bio.org/pipermail/open-bio-l/2011-November/000821.html As far as I recall, the original flat file and BDB based OBDA specification for indexing sequencing files didn't cover compressed files. That might be something to consider (although we should sort of uncompressed text/binary files first). I've recently been experimenting with using compressed files - in particular simple GZIP files (ignoring any block structure) and BGZF (the specialised gzipped blocking used in BAM), see: http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html http://seqanswers.com/forums/showthread.php?t=15347 The virtual offset approach used in BGZF squeezes a 16 bit within block offset (thus limiting you to 64kb blocks) and at 48 bit block start offset (thus limiting you to a 256TB file) into a single 64bit "virtual" offset. That makes sense if you are keeping the lookup table or many offsets in memory, and can be used as is with code expecting a single offset (like the current Biopython SQLite index schema). Also bzip2 but this is block based, with the block size ranging from 100KB to 900KB. http://bzip.org/ http://bzip.org/1.0.5/bzip2-manual-1.0.5.html I haven't tried any performance tests yet, which would be interesting as I believe compression/decompression of bfzip2 is more costly in CPU terms than gzip (although both will be block size dependent). If we wanted to imitate the BGZF virtual offset scheme for arbitrary BZIP2 files, an alternative 64 bit virtual offset scheme could use 20 bits to cover bz2 blocks of up to 900KB, leaving 64 - 20 = 44 bits for the start offset, thus limiting you to to just 2^44 bytes or 16Tb which sounds OK only in the medium term. On the bright side this could be used to index any BZIP2 file (under 16TB), whereas BGZF cannot be applied to any GZIP file. On the other hand, storing the block start and within block separately is truly generic and could be used on any blocked GZIP file (including BGZF) and BZIP2 etc. It would make the SQLite schema a bit more complicated though. Maybe something to consider for the next revision to OBDA, and focus on the non-compressed case for now? Regards, Peter From p.j.a.cock at googlemail.com Sun Nov 13 12:32:12 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 13 Nov 2011 12:32:12 +0000 Subject: [Open-bio-l] OBDA redux? Compressed files In-Reply-To: References: Message-ID: On Sun, Nov 13, 2011 at 12:30 PM, Peter Cock wrote: > Hi again, > > I've retitled this as it is a little off topic from the main OBDA redux thread, > http://lists.open-bio.org/pipermail/open-bio-l/2011-November/000819.html > http://lists.open-bio.org/pipermail/open-bio-l/2011-November/000820.html > http://lists.open-bio.org/pipermail/open-bio-l/2011-November/000821.html > > As far as I recall, the original flat file and BDB based OBDA > specification for indexing sequencing files didn't cover > compressed files. That might be something to consider > (although we should sort of uncompressed text/binary > files first). Sorry - didn't meant to include bioperl-l on that, although it may be of interest to you guys anyway. Peter From cjfields at illinois.edu Mon Nov 14 17:59:35 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 14 Nov 2011 17:59:35 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References:

Message-ID: <12E3B71D-6E61-41AD-A956-A1FC2076AF24@illinois.edu> On Nov 13, 2011, at 6:24 AM, Peter Cock wrote: > So, Chris and I seem in general agreement that an OBDA v2 > using SQLite but based on essentially the same approach as > the BDB or flat file based OBDA v1 is a good idea. i.e. Tables > mapping record identifiers to file offsets in the original sequence > files. The worry I have is adhering to a specific backend (e.g. SQLite). The reason I say this is b/c BDB in it's time was the go-to way of storing simple index data, but that is no longer feasible for very large data sets. Who's to say something similar won't happen to SQLite, or that it is the best option available? Maybe we should focus on the data storage schema, as simple as it may be, then indicate the default backend must be SQLite but others are allowed (maybe with a mention that SQLite can be replaced by alternatives in the future if needed). > I hope to get BioRuby on board, they already have an OBDA > v1 support so that shouldn't be too hard. > > Right now I don't recall if BioJava has/had OBDA v1 support, > and if they did if it was affected in their recent move to BioJava > v3 (I understand from their mailing list that some older lower > priority functionality has not all been ported yet). I wouldn't be surprised at that, OBDA kind of lingered for a while, and I'm not sure how widely adopted it became (maybe others can shed light on that?) > Also EMBOSS are likely to be interested, certainly Peter Rice > was interested in the SQLite indexing we're already using in > Biopython for sequence files (i.e. what is effectively the > prototype for OBDA v2). > > Note that in addition to simple indexing of text files, we are > already using the same simple offset + length approach for > indexing binary files (e.g. SFF). I think that's the general idea, that is how all bioperl data was indexed, before with the Bio::Index modules and with the OBDA implementations as well. > On the immediate practical side, I think I can edit the > current OBDA website of http://obda.open-bio.org/ > via /home/websites/obda.open-bio.org/html on the > server. See below w/ regards to my thoughts on the wiki. > We need to work out where the current OBDA indexing > specification lives (CVS or SVN?) and perhaps move > that to github. We may need a general OBF organisation > account on git hub for this and any other cross-project > repositories. +1 to a move to github, but maybe this belongs in an OBF-specific organization. And maybe we should take advantage of the simple wiki or project homepage that GitHub offers and move everything (docs) there. > I see there is already an OBDA project on RedMine, > (Chris can you add me to that please?) > https://redmine.open-bio.org/projects/obda > > Peter Done (last night actually, but I didn't have time to respond immediately). chris From p.j.a.cock at googlemail.com Mon Nov 14 18:14:18 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 14 Nov 2011 18:14:18 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: <12E3B71D-6E61-41AD-A956-A1FC2076AF24@illinois.edu> References:

<12E3B71D-6E61-41AD-A956-A1FC2076AF24@illinois.edu> Message-ID: Hi Chris, [Did you mean to CC BioPerl-l? Should I have?] On Mon, Nov 14, 2011 at 5:59 PM, Fields, Christopher J wrote: > On Nov 13, 2011, at 6:24 AM, Peter Cock wrote: > >> So, Chris and I seem in general agreement that an OBDA v2 >> using SQLite but based on essentially the same approach as >> the BDB or flat file based OBDA v1 is a good idea. i.e. Tables >> mapping record identifiers to file offsets in the original sequence >> files. > > The worry I have is adhering to a specific backend (e.g. SQLite). > The reason I say this is b/c BDB in it's time was the go-to way > of storing simple index data, but that is no longer feasible for > very large data sets. ?Who's to say something similar won't > happen to SQLite, or that it is the best option available? Right now I would think SQLite is one of the best (if not the best) option. If supporting the old back ends is important for cross-project compatibility, I'm willing to have another go at using BDB in Biopython, but had limited success last time I tried. > Maybe we should focus on the data storage schema, as > simple as it may be, then indicate the default backend > must be SQLite but others are allowed (maybe with a > mention that SQLite can be replaced by alternatives in > the future if needed). It would make sense to talk about an SQL schema if the "other options" would also be SQL based. But they might not be... but certainly we should keep potential alternative back ends in mind. >> I hope to get BioRuby on board, they already have an OBDA >> v1 support so that shouldn't be too hard. >> >> Right now I don't recall if BioJava has/had OBDA v1 support, >> and if they did if it was affected in their recent move to BioJava >> v3 (I understand from their mailing list that some older lower >> priority functionality has not all been ported yet). > > I wouldn't be surprised at that, OBDA kind of lingered for a > while, and I'm not sure how widely adopted it became > (maybe others can shed light on that?) Well, OBDA went beyond just indexing flat files - it also tried to standard things like remote database access. I don't think we every really had that side working in Biopython, so I am less familiar with it. I know EMBOSS has something fairly extensive for online databases, but have not checked if it uses the OBDA style or their own. For now I was only planning to tackle indexing sequence files in this "OBDA redux". >> Also EMBOSS are likely to be interested, certainly Peter Rice >> was interested in the SQLite indexing we're already using in >> Biopython for sequence files (i.e. what is effectively the >> prototype for OBDA v2). >> >> Note that in addition to simple indexing of text files, we are >> already using the same simple offset + length approach for >> indexing binary files (e.g. SFF). > > I think that's the general idea, that is how all bioperl data > was indexed, before with the Bio::Index modules and with > the OBDA implementations as well. Good. >> On the immediate practical side, I think I can edit the >> current OBDA website of http://obda.open-bio.org/ >> via /home/websites/obda.open-bio.org/html on the >> server. > > See below w/ regards to my thoughts on the wiki. > >> We need to work out where the current OBDA indexing >> specification lives (CVS or SVN?) and perhaps move >> that to github. We may need a general OBF organisation >> account on git hub for this and any other cross-project >> repositories. > > +1 to a move to github, but maybe this belongs in an > OBF-specific organization. Yes, definitely under an OBF github account (not under Biopython, BioPerl, etc). > And maybe we should take advantage of the simple > wiki or project homepage that GitHub offers and move > everything (docs) there. That could work. We'd have to go through all the old documentation and relocate it, then we could make the obda.open-bio.org domain point at the github pages. >> I see there is already an OBDA project on RedMine, >> (Chris can you add me to that please?) >> https://redmine.open-bio.org/projects/obda >> >> Peter > > Done (last night actually, but I didn't have time to respond > immediately). > > chris Thanks, Peter From cjfields at illinois.edu Mon Nov 14 18:47:10 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Mon, 14 Nov 2011 18:47:10 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References:

<12E3B71D-6E61-41AD-A956-A1FC2076AF24@illinois.edu> Message-ID: <5BE2A95E-103D-4FEE-B8C4-3754CCF67507@illinois.edu> On Nov 14, 2011, at 12:14 PM, Peter Cock wrote: > Hi Chris, > > [Did you mean to CC BioPerl-l? Should I have?] > > On Mon, Nov 14, 2011 at 5:59 PM, Fields, Christopher J > wrote: >> On Nov 13, 2011, at 6:24 AM, Peter Cock wrote: >> >>> So, Chris and I seem in general agreement that an OBDA v2 >>> using SQLite but based on essentially the same approach as >>> the BDB or flat file based OBDA v1 is a good idea. i.e. Tables >>> mapping record identifiers to file offsets in the original sequence >>> files. >> >> The worry I have is adhering to a specific backend (e.g. SQLite). >> The reason I say this is b/c BDB in it's time was the go-to way >> of storing simple index data, but that is no longer feasible for >> very large data sets. Who's to say something similar won't >> happen to SQLite, or that it is the best option available? > > Right now I would think SQLite is one of the best (if not the > best) option. If supporting the old back ends is important for > cross-project compatibility, I'm willing to have another go > at using BDB in Biopython, but had limited success last > time I tried. No, I agree re: SQLite at the moment, it's probably the best option (fast, widely adopted, etc), though Jason mentioned (Tokyo|Kyoto)Cabinet also worked very well. I would rather not paint ourselves into a corner if the 'nice-and-shiny' next thing down the road performs better and gains wide adoption. >> Maybe we should focus on the data storage schema, as >> simple as it may be, then indicate the default backend >> must be SQLite but others are allowed (maybe with a >> mention that SQLite can be replaced by alternatives in >> the future if needed). > > It would make sense to talk about an SQL schema if > the "other options" would also be SQL based. But they > might not be... but certainly we should keep potential > alternative back ends in mind. It's probably necessary to allow for both possibilities (SQL and other). For instance, a move to SQLite will necessitate describing the table data with SQL anyway. >>> I hope to get BioRuby on board, they already have an OBDA >>> v1 support so that shouldn't be too hard. >>> >>> Right now I don't recall if BioJava has/had OBDA v1 support, >>> and if they did if it was affected in their recent move to BioJava >>> v3 (I understand from their mailing list that some older lower >>> priority functionality has not all been ported yet). >> >> I wouldn't be surprised at that, OBDA kind of lingered for a >> while, and I'm not sure how widely adopted it became >> (maybe others can shed light on that?) > > Well, OBDA went beyond just indexing flat files - it also > tried to standard things like remote database access. > I don't think we every really had that side working in > Biopython, so I am less familiar with it. I know EMBOSS > has something fairly extensive for online databases, > but have not checked if it uses the OBDA style or their > own. Right, but I wonder if that may have been one problem with the original OBDA specification, that it was perhaps overly ambitious out-the-gate. > For now I was only planning to tackle indexing sequence > files in this "OBDA redux". That's a good and simpler start; the rest (remote access) fall in naturally once that is in place. >>> Also EMBOSS are likely to be interested, certainly Peter Rice >>> was interested in the SQLite indexing we're already using in >>> Biopython for sequence files (i.e. what is effectively the >>> prototype for OBDA v2). >>> >>> Note that in addition to simple indexing of text files, we are >>> already using the same simple offset + length approach for >>> indexing binary files (e.g. SFF). >> >> I think that's the general idea, that is how all bioperl data >> was indexed, before with the Bio::Index modules and with >> the OBDA implementations as well. > > Good. > >>> On the immediate practical side, I think I can edit the >>> current OBDA website of http://obda.open-bio.org/ >>> via /home/websites/obda.open-bio.org/html on the >>> server. >> >> See below w/ regards to my thoughts on the wiki. >> >>> We need to work out where the current OBDA indexing >>> specification lives (CVS or SVN?) and perhaps move >>> that to github. We may need a general OBF organisation >>> account on git hub for this and any other cross-project >>> repositories. >> >> +1 to a move to github, but maybe this belongs in an >> OBF-specific organization. > > Yes, definitely under an OBF github account (not under > Biopython, BioPerl, etc). > >> And maybe we should take advantage of the simple >> wiki or project homepage that GitHub offers and move >> everything (docs) there. > > That could work. We'd have to go through all the old > documentation and relocate it, then we could make the > obda.open-bio.org domain point at the github pages. Yes, I think that's the idea. >>> I see there is already an OBDA project on RedMine, >>> (Chris can you add me to that please?) >>> https://redmine.open-bio.org/projects/obda >>> >>> Peter >> >> Done (last night actually, but I didn't have time to respond >> immediately). >> >> chris > > Thanks, > > Peter np. -c From p.j.a.cock at googlemail.com Mon Nov 14 23:01:03 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 14 Nov 2011 23:01:03 +0000 Subject: [Open-bio-l] OBDA redux? Compressed files In-Reply-To: References: Message-ID: On Sun, Nov 13, 2011 at 12:30 PM, Peter Cock wrote: > > I've recently been experimenting with using compressed > files - in particular simple GZIP files (ignoring any block structure) > and BGZF (the specialised gzipped blocking used in BAM), see: > > http://blastedbio.blogspot.com/2011/11/bgzf-blocked-bigger-better-gzip.html > http://seqanswers.com/forums/showthread.php?t=15347 > > The virtual offset approach used in BGZF squeezes a 16 bit > within block offset (thus limiting you to 64kb blocks) and at > 48 bit block start offset (thus limiting you to a 256TB file) into > a single 64bit "virtual" offset. That makes sense if you are > keeping the lookup table or many offsets in memory, and > can be used as is with code expecting a single offset (like > the current Biopython SQLite index schema). > > Also bzip2 ... is block based, with the block size ranging > from 100KB to 900KB. > > http://bzip.org/ > http://bzip.org/1.0.5/bzip2-manual-1.0.5.html > A point of clarification since discovering the wikipedia page http://en.wikipedia.org/wiki/Bzip2 to be very informative, those are the compressed block sizes (100kb to 900kb), and this means that after decompression a 900kb block can in some cases reach about 46MB. Clearly that means the BGZF virtual offset approach cannot be applied to any bzip2 file (much like it can't be applied to any gzip file), without imposing some a priori limit on the decompressed size of each block. > On the other hand, storing the block start and within block > separately is truly generic and could be used on any blocked > GZIP file (including BGZF) and BZIP2 etc. It would make > the SQLite schema a bit more complicated though. So on reflection, if we want to index any blocked compressed file format such as GZIP file (including BGZF) and BZIP2 then two offsets does seem to be required (the block offset, and the data offset within the block after decompression). Peter From jason.stajich at gmail.com Wed Nov 16 20:19:14 2011 From: jason.stajich at gmail.com (Jason Stajich) Date: Wed, 16 Nov 2011 15:19:14 -0500 Subject: [Open-bio-l] OBDA redux? In-Reply-To: <5BE2A95E-103D-4FEE-B8C4-3754CCF67507@illinois.edu> References:

<12E3B71D-6E61-41AD-A956-A1FC2076AF24@illinois.edu> <5BE2A95E-103D-4FEE-B8C4-3754CCF67507@illinois.edu> Message-ID: Not to overlly advocate for the NOSQL as I think for our purposes the jury is still out. So I think it is worth benchmarking - NOSQL and SQL-based systems will have dfferent overheads. I know when I have tried to store 100M -> 500M records in SQLite the performance degrades whereas I was able to store that range of keys in NOSQL db without problem. I don't know if there is a generic API for the NOSQL systems which would help for standarization. Jason Stajich jason at bioperl.org On Mon, Nov 14, 2011 at 1:47 PM, Fields, Christopher J < cjfields at illinois.edu> wrote: > On Nov 14, 2011, at 12:14 PM, Peter Cock wrote: > > > Hi Chris, > > > > [Did you mean to CC BioPerl-l? Should I have?] > > > > On Mon, Nov 14, 2011 at 5:59 PM, Fields, Christopher J > > wrote: > >> On Nov 13, 2011, at 6:24 AM, Peter Cock wrote: > >> > >>> So, Chris and I seem in general agreement that an OBDA v2 > >>> using SQLite but based on essentially the same approach as > >>> the BDB or flat file based OBDA v1 is a good idea. i.e. Tables > >>> mapping record identifiers to file offsets in the original sequence > >>> files. > >> > >> The worry I have is adhering to a specific backend (e.g. SQLite). > >> The reason I say this is b/c BDB in it's time was the go-to way > >> of storing simple index data, but that is no longer feasible for > >> very large data sets. Who's to say something similar won't > >> happen to SQLite, or that it is the best option available? > > > > Right now I would think SQLite is one of the best (if not the > > best) option. If supporting the old back ends is important for > > cross-project compatibility, I'm willing to have another go > > at using BDB in Biopython, but had limited success last > > time I tried. > > No, I agree re: SQLite at the moment, it's probably the best option (fast, > widely adopted, etc), though Jason mentioned (Tokyo|Kyoto)Cabinet also > worked very well. I would rather not paint ourselves into a corner if the > 'nice-and-shiny' next thing down the road performs better and gains wide > adoption. > > >> Maybe we should focus on the data storage schema, as > >> simple as it may be, then indicate the default backend > >> must be SQLite but others are allowed (maybe with a > >> mention that SQLite can be replaced by alternatives in > >> the future if needed). > > > > It would make sense to talk about an SQL schema if > > the "other options" would also be SQL based. But they > > might not be... but certainly we should keep potential > > alternative back ends in mind. > > It's probably necessary to allow for both possibilities (SQL and other). > For instance, a move to SQLite will necessitate describing the table data > with SQL anyway. > > >>> I hope to get BioRuby on board, they already have an OBDA > >>> v1 support so that shouldn't be too hard. > >>> > >>> Right now I don't recall if BioJava has/had OBDA v1 support, > >>> and if they did if it was affected in their recent move to BioJava > >>> v3 (I understand from their mailing list that some older lower > >>> priority functionality has not all been ported yet). > >> > >> I wouldn't be surprised at that, OBDA kind of lingered for a > >> while, and I'm not sure how widely adopted it became > >> (maybe others can shed light on that?) > > > > Well, OBDA went beyond just indexing flat files - it also > > tried to standard things like remote database access. > > I don't think we every really had that side working in > > Biopython, so I am less familiar with it. I know EMBOSS > > has something fairly extensive for online databases, > > but have not checked if it uses the OBDA style or their > > own. > > Right, but I wonder if that may have been one problem with the original > OBDA specification, that it was perhaps overly ambitious out-the-gate. > > > For now I was only planning to tackle indexing sequence > > files in this "OBDA redux". > > That's a good and simpler start; the rest (remote access) fall in > naturally once that is in place. > > >>> Also EMBOSS are likely to be interested, certainly Peter Rice > >>> was interested in the SQLite indexing we're already using in > >>> Biopython for sequence files (i.e. what is effectively the > >>> prototype for OBDA v2). > >>> > >>> Note that in addition to simple indexing of text files, we are > >>> already using the same simple offset + length approach for > >>> indexing binary files (e.g. SFF). > >> > >> I think that's the general idea, that is how all bioperl data > >> was indexed, before with the Bio::Index modules and with > >> the OBDA implementations as well. > > > > Good. > > > >>> On the immediate practical side, I think I can edit the > >>> current OBDA website of http://obda.open-bio.org/ > >>> via /home/websites/obda.open-bio.org/html on the > >>> server. > >> > >> See below w/ regards to my thoughts on the wiki. > >> > >>> We need to work out where the current OBDA indexing > >>> specification lives (CVS or SVN?) and perhaps move > >>> that to github. We may need a general OBF organisation > >>> account on git hub for this and any other cross-project > >>> repositories. > >> > >> +1 to a move to github, but maybe this belongs in an > >> OBF-specific organization. > > > > Yes, definitely under an OBF github account (not under > > Biopython, BioPerl, etc). > > > >> And maybe we should take advantage of the simple > >> wiki or project homepage that GitHub offers and move > >> everything (docs) there. > > > > That could work. We'd have to go through all the old > > documentation and relocate it, then we could make the > > obda.open-bio.org domain point at the github pages. > > Yes, I think that's the idea. > > >>> I see there is already an OBDA project on RedMine, > >>> (Chris can you add me to that please?) > >>> https://redmine.open-bio.org/projects/obda > >>> > >>> Peter > >> > >> Done (last night actually, but I didn't have time to respond > >> immediately). > >> > >> chris > > > > Thanks, > > > > Peter > > np. > > -c > > From k at bioruby.org Thu Nov 17 00:00:50 2011 From: k at bioruby.org (Toshiaki Katayama) Date: Thu, 17 Nov 2011 09:00:50 +0900 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References:

<12E3B71D-6E61-41AD-A956-A1FC2076AF24@illinois.edu> <5BE2A95E-103D-4FEE-B8C4-3754CCF67507@illinois.edu> Message-ID: <175D6437-4DD6-42FA-A6D6-84BDCFB74E83@bioruby.org> Hi Jason, I was not actively following this thread but have one comment: > I don't know if there is a generic API for the NOSQL systems which would > help for standarization. To my knowledge, RDF/SPARQL is the only standardized format/protocol among the NoSQL stores. Unfortunately, its performance and scalability are not yet comparable to the widely used key-value stores (e.g. Tokyo Cabinet), however, Semantic Web may have a potential to be a standard for storing heterogeneous data sets as an integrated biological DB without designing any schema (we need ontologies instead). Cheers, Toshiaki Katayama On 2011/11/17, at 5:19, Jason Stajich wrote: > Not to overlly advocate for the NOSQL as I think for our purposes the jury > is still out. So I think it is worth benchmarking - NOSQL and SQL-based > systems will have dfferent overheads. > > I know when I have tried to store 100M -> 500M records in SQLite the > performance degrades whereas I was able to store that range of keys in > NOSQL db without problem. > > I don't know if there is a generic API for the NOSQL systems which would > help for standarization. > > Jason Stajich > jason at bioperl.org > > > On Mon, Nov 14, 2011 at 1:47 PM, Fields, Christopher J < > cjfields at illinois.edu> wrote: > >> On Nov 14, 2011, at 12:14 PM, Peter Cock wrote: >> >>> Hi Chris, >>> >>> [Did you mean to CC BioPerl-l? Should I have?] >>> >>> On Mon, Nov 14, 2011 at 5:59 PM, Fields, Christopher J >>> wrote: >>>> On Nov 13, 2011, at 6:24 AM, Peter Cock wrote: >>>> >>>>> So, Chris and I seem in general agreement that an OBDA v2 >>>>> using SQLite but based on essentially the same approach as >>>>> the BDB or flat file based OBDA v1 is a good idea. i.e. Tables >>>>> mapping record identifiers to file offsets in the original sequence >>>>> files. >>>> >>>> The worry I have is adhering to a specific backend (e.g. SQLite). >>>> The reason I say this is b/c BDB in it's time was the go-to way >>>> of storing simple index data, but that is no longer feasible for >>>> very large data sets. Who's to say something similar won't >>>> happen to SQLite, or that it is the best option available? >>> >>> Right now I would think SQLite is one of the best (if not the >>> best) option. If supporting the old back ends is important for >>> cross-project compatibility, I'm willing to have another go >>> at using BDB in Biopython, but had limited success last >>> time I tried. >> >> No, I agree re: SQLite at the moment, it's probably the best option (fast, >> widely adopted, etc), though Jason mentioned (Tokyo|Kyoto)Cabinet also >> worked very well. I would rather not paint ourselves into a corner if the >> 'nice-and-shiny' next thing down the road performs better and gains wide >> adoption. >> >>>> Maybe we should focus on the data storage schema, as >>>> simple as it may be, then indicate the default backend >>>> must be SQLite but others are allowed (maybe with a >>>> mention that SQLite can be replaced by alternatives in >>>> the future if needed). >>> >>> It would make sense to talk about an SQL schema if >>> the "other options" would also be SQL based. But they >>> might not be... but certainly we should keep potential >>> alternative back ends in mind. >> >> It's probably necessary to allow for both possibilities (SQL and other). >> For instance, a move to SQLite will necessitate describing the table data >> with SQL anyway. >> >>>>> I hope to get BioRuby on board, they already have an OBDA >>>>> v1 support so that shouldn't be too hard. >>>>> >>>>> Right now I don't recall if BioJava has/had OBDA v1 support, >>>>> and if they did if it was affected in their recent move to BioJava >>>>> v3 (I understand from their mailing list that some older lower >>>>> priority functionality has not all been ported yet). >>>> >>>> I wouldn't be surprised at that, OBDA kind of lingered for a >>>> while, and I'm not sure how widely adopted it became >>>> (maybe others can shed light on that?) >>> >>> Well, OBDA went beyond just indexing flat files - it also >>> tried to standard things like remote database access. >>> I don't think we every really had that side working in >>> Biopython, so I am less familiar with it. I know EMBOSS >>> has something fairly extensive for online databases, >>> but have not checked if it uses the OBDA style or their >>> own. >> >> Right, but I wonder if that may have been one problem with the original >> OBDA specification, that it was perhaps overly ambitious out-the-gate. >> >>> For now I was only planning to tackle indexing sequence >>> files in this "OBDA redux". >> >> That's a good and simpler start; the rest (remote access) fall in >> naturally once that is in place. >> >>>>> Also EMBOSS are likely to be interested, certainly Peter Rice >>>>> was interested in the SQLite indexing we're already using in >>>>> Biopython for sequence files (i.e. what is effectively the >>>>> prototype for OBDA v2). >>>>> >>>>> Note that in addition to simple indexing of text files, we are >>>>> already using the same simple offset + length approach for >>>>> indexing binary files (e.g. SFF). >>>> >>>> I think that's the general idea, that is how all bioperl data >>>> was indexed, before with the Bio::Index modules and with >>>> the OBDA implementations as well. >>> >>> Good. >>> >>>>> On the immediate practical side, I think I can edit the >>>>> current OBDA website of http://obda.open-bio.org/ >>>>> via /home/websites/obda.open-bio.org/html on the >>>>> server. >>>> >>>> See below w/ regards to my thoughts on the wiki. >>>> >>>>> We need to work out where the current OBDA indexing >>>>> specification lives (CVS or SVN?) and perhaps move >>>>> that to github. We may need a general OBF organisation >>>>> account on git hub for this and any other cross-project >>>>> repositories. >>>> >>>> +1 to a move to github, but maybe this belongs in an >>>> OBF-specific organization. >>> >>> Yes, definitely under an OBF github account (not under >>> Biopython, BioPerl, etc). >>> >>>> And maybe we should take advantage of the simple >>>> wiki or project homepage that GitHub offers and move >>>> everything (docs) there. >>> >>> That could work. We'd have to go through all the old >>> documentation and relocate it, then we could make the >>> obda.open-bio.org domain point at the github pages. >> >> Yes, I think that's the idea. >> >>>>> I see there is already an OBDA project on RedMine, >>>>> (Chris can you add me to that please?) >>>>> https://redmine.open-bio.org/projects/obda >>>>> >>>>> Peter >>>> >>>> Done (last night actually, but I didn't have time to respond >>>> immediately). >>>> >>>> chris >>> >>> Thanks, >>> >>> Peter >> >> np. >> >> -c >> >> > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l From cjfields at illinois.edu Thu Nov 17 14:13:40 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 17 Nov 2011 14:13:40 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References:

<12E3B71D-6E61-41AD-A956-A1FC2076AF24@illinois.edu> <5BE2A95E-103D-4FEE-B8C4-3754CCF67507@illinois.edu> Message-ID: <735E98B4-B8EB-4735-9C26-7A6E077F606C@illinois.edu> On Nov 16, 2011, at 2:19 PM, Jason Stajich wrote: > Not to overlly advocate for the NOSQL as I think for our purposes the jury is still out. So I think it is worth benchmarking - NOSQL and SQL-based systems will have dfferent overheads. > > I know when I have tried to store 100M -> 500M records in SQLite the performance degrades whereas I was able to store that range of keys in NOSQL db without problem. +1. This will only get worse, with the projections for upcoming HiSeq upgrades, it is possible 1-2 channel runs would hit that limit. > I don't know if there is a generic API for the NOSQL systems which would help for standarization. > > Jason Stajich > jason at bioperl.org Not that I know of, though one could probably come up with an AnyDBM wrapper for those, if the modules aren't already compliant with that API. Tokyo/KyotoCabinet have perl, python, ruby, and java interfaces with the distribution. chris > On Mon, Nov 14, 2011 at 1:47 PM, Fields, Christopher J wrote: > On Nov 14, 2011, at 12:14 PM, Peter Cock wrote: > > > Hi Chris, > > > > [Did you mean to CC BioPerl-l? Should I have?] > > > > On Mon, Nov 14, 2011 at 5:59 PM, Fields, Christopher J > > wrote: > >> On Nov 13, 2011, at 6:24 AM, Peter Cock wrote: > >> > >>> So, Chris and I seem in general agreement that an OBDA v2 > >>> using SQLite but based on essentially the same approach as > >>> the BDB or flat file based OBDA v1 is a good idea. i.e. Tables > >>> mapping record identifiers to file offsets in the original sequence > >>> files. > >> > >> The worry I have is adhering to a specific backend (e.g. SQLite). > >> The reason I say this is b/c BDB in it's time was the go-to way > >> of storing simple index data, but that is no longer feasible for > >> very large data sets. Who's to say something similar won't > >> happen to SQLite, or that it is the best option available? > > > > Right now I would think SQLite is one of the best (if not the > > best) option. If supporting the old back ends is important for > > cross-project compatibility, I'm willing to have another go > > at using BDB in Biopython, but had limited success last > > time I tried. > > No, I agree re: SQLite at the moment, it's probably the best option (fast, widely adopted, etc), though Jason mentioned (Tokyo|Kyoto)Cabinet also worked very well. I would rather not paint ourselves into a corner if the 'nice-and-shiny' next thing down the road performs better and gains wide adoption. > > >> Maybe we should focus on the data storage schema, as > >> simple as it may be, then indicate the default backend > >> must be SQLite but others are allowed (maybe with a > >> mention that SQLite can be replaced by alternatives in > >> the future if needed). > > > > It would make sense to talk about an SQL schema if > > the "other options" would also be SQL based. But they > > might not be... but certainly we should keep potential > > alternative back ends in mind. > > It's probably necessary to allow for both possibilities (SQL and other). For instance, a move to SQLite will necessitate describing the table data with SQL anyway. > > >>> I hope to get BioRuby on board, they already have an OBDA > >>> v1 support so that shouldn't be too hard. > >>> > >>> Right now I don't recall if BioJava has/had OBDA v1 support, > >>> and if they did if it was affected in their recent move to BioJava > >>> v3 (I understand from their mailing list that some older lower > >>> priority functionality has not all been ported yet). > >> > >> I wouldn't be surprised at that, OBDA kind of lingered for a > >> while, and I'm not sure how widely adopted it became > >> (maybe others can shed light on that?) > > > > Well, OBDA went beyond just indexing flat files - it also > > tried to standard things like remote database access. > > I don't think we every really had that side working in > > Biopython, so I am less familiar with it. I know EMBOSS > > has something fairly extensive for online databases, > > but have not checked if it uses the OBDA style or their > > own. > > Right, but I wonder if that may have been one problem with the original OBDA specification, that it was perhaps overly ambitious out-the-gate. > > > For now I was only planning to tackle indexing sequence > > files in this "OBDA redux". > > That's a good and simpler start; the rest (remote access) fall in naturally once that is in place. > > >>> Also EMBOSS are likely to be interested, certainly Peter Rice > >>> was interested in the SQLite indexing we're already using in > >>> Biopython for sequence files (i.e. what is effectively the > >>> prototype for OBDA v2). > >>> > >>> Note that in addition to simple indexing of text files, we are > >>> already using the same simple offset + length approach for > >>> indexing binary files (e.g. SFF). > >> > >> I think that's the general idea, that is how all bioperl data > >> was indexed, before with the Bio::Index modules and with > >> the OBDA implementations as well. > > > > Good. > > > >>> On the immediate practical side, I think I can edit the > >>> current OBDA website of http://obda.open-bio.org/ > >>> via /home/websites/obda.open-bio.org/html on the > >>> server. > >> > >> See below w/ regards to my thoughts on the wiki. > >> > >>> We need to work out where the current OBDA indexing > >>> specification lives (CVS or SVN?) and perhaps move > >>> that to github. We may need a general OBF organisation > >>> account on git hub for this and any other cross-project > >>> repositories. > >> > >> +1 to a move to github, but maybe this belongs in an > >> OBF-specific organization. > > > > Yes, definitely under an OBF github account (not under > > Biopython, BioPerl, etc). > > > >> And maybe we should take advantage of the simple > >> wiki or project homepage that GitHub offers and move > >> everything (docs) there. > > > > That could work. We'd have to go through all the old > > documentation and relocate it, then we could make the > > obda.open-bio.org domain point at the github pages. > > Yes, I think that's the idea. > > >>> I see there is already an OBDA project on RedMine, > >>> (Chris can you add me to that please?) > >>> https://redmine.open-bio.org/projects/obda > >>> > >>> Peter > >> > >> Done (last night actually, but I didn't have time to respond > >> immediately). > >> > >> chris > > > > Thanks, > > > > Peter > > np. > > -c > > From p.j.a.cock at googlemail.com Thu Nov 17 14:39:49 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 17 Nov 2011 14:39:49 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: <735E98B4-B8EB-4735-9C26-7A6E077F606C@illinois.edu> References:

<12E3B71D-6E61-41AD-A956-A1FC2076AF24@illinois.edu> <5BE2A95E-103D-4FEE-B8C4-3754CCF67507@illinois.edu> <735E98B4-B8EB-4735-9C26-7A6E077F606C@illinois.edu> Message-ID: On Thu, Nov 17, 2011 at 2:13 PM, Fields, Christopher J wrote: > On Nov 16, 2011, at 2:19 PM, Jason Stajich wrote: > >> Not to overlly advocate for the NOSQL as I think for our purposes the jury >> is still out. So I think it is worth benchmarking - NOSQL and SQL-based >> systems will have dfferent overheads. >> >> I know when I have tried to store 100M -> 500M records in SQLite the >> performance degrades whereas I was able to store that range of keys >> in NOSQL db without problem. > > +1. ?This will only get worse, with the projections for upcoming HiSeq > upgrades, it is possible 1-2 channel runs would hit that limit. That's a useful scale to aim to cover in profiling then, 100M to 500M records. Jason, do you have any more details about the slowdown you found with SQLite? For this use case we want to write the index once, and read it many times. I found it is quicker to populate the offset table before creating the index - perhaps you were seeing the index being updated while adding records? Peter From pjotr.public41 at thebird.nl Thu Nov 17 17:11:44 2011 From: pjotr.public41 at thebird.nl (Pjotr Prins) Date: Thu, 17 Nov 2011 18:11:44 +0100 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References:

<12E3B71D-6E61-41AD-A956-A1FC2076AF24@illinois.edu> <5BE2A95E-103D-4FEE-B8C4-3754CCF67507@illinois.edu> <735E98B4-B8EB-4735-9C26-7A6E077F606C@illinois.edu> Message-ID: <20111117171144.GA17301@thebird.nl> On Thu, Nov 17, 2011 at 02:39:49PM +0000, Peter Cock wrote: > > +1. ?This will only get worse, with the projections for upcoming HiSeq > > upgrades, it is possible 1-2 channel runs would hit that limit. > > That's a useful scale to aim to cover in profiling then, 100M to 500M > records. Jason, do you have any more details about the slowdown > you found with SQLite? For this use case we want to write the index > once, and read it many times. I found it is quicker to populate the > offset table before creating the index - perhaps you were seeing the > index being updated while adding records? I have also found that hammering SQLite quickly deteriorates performance. Rather too quickly in fact. Don't forget that SQL is inherently slower that 'simple' indexers. Also SQLite is a convenience library, rather than a library designed for optimized performance. We used to run sleepycat/bdb for that reason, now it is Tokyo/Kyoto cabinet. In the (rather) near future we will be looking at parallel feeds from multiple machines, to keep it somewhat interesting. Hadoop has indexing support. In fact, Hadoop should be ideal for indexed sequence information, though I have not used it. Still, when the time comes, I am kinda interested in parallelized NoSQL solutions for scaling up. Hadoop kills me because of its complexity. I hate complexity (one reason I have tried to avoid SQL servers). BTW 500M records takes significant RAM for an in-memory index. Quite a number of solutions, to retain their performance, have to have the indexes in memory. 500M records now, will grow to 500G records soon. Just a thing to keep in mind. I would opt for a non-RAM solution. Pj. From bonnal at ingm.org Fri Nov 18 09:35:56 2011 From: bonnal at ingm.org (Raoul Bonnal) Date: Fri, 18 Nov 2011 10:35:56 +0100 Subject: [Open-bio-l] OBDA redux? In-Reply-To: <20111117171144.GA17301@thebird.nl> Message-ID: Dear all, Would be possible to have a test dataset and clear requirements, functionalities? Not a huge doc, just few points for benchmarking. On 17/11/11 18.11, "Pjotr Prins" wrote: > On Thu, Nov 17, 2011 at 02:39:49PM +0000, Peter Cock wrote: >>> +1. ?This will only get worse, with the projections for upcoming HiSeq >>> upgrades, it is possible 1-2 channel runs would hit that limit. >> >> That's a useful scale to aim to cover in profiling then, 100M to 500M >> records. Jason, do you have any more details about the slowdown >> you found with SQLite? For this use case we want to write the index >> once, and read it many times. I found it is quicker to populate the >> offset table before creating the index - perhaps you were seeing the >> index being updated while adding records? > > I have also found that hammering SQLite quickly deteriorates > performance. Rather too quickly in fact. Don't forget that SQL is > inherently slower that 'simple' indexers. Also SQLite is a convenience > library, rather than a library designed for optimized performance. We > used to run sleepycat/bdb for that reason, now it is Tokyo/Kyoto > cabinet. > > In the (rather) near future we will be looking at parallel feeds from > multiple machines, to keep it somewhat interesting. Hadoop has > indexing support. In fact, Hadoop should be ideal for indexed sequence > information, though I have not used it. Still, when the time comes, I > am kinda interested in parallelized NoSQL solutions for scaling up. > Hadoop kills me because of its complexity. I hate complexity (one > reason I have tried to avoid SQL servers). > > BTW 500M records takes significant RAM for an in-memory index. Quite a > number of solutions, to retain their performance, have to have the > indexes in memory. 500M records now, will grow to 500G records soon. > Just a thing to keep in mind. I would opt for a non-RAM solution. > > Pj. > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l From p.j.a.cock at googlemail.com Fri Nov 18 10:20:54 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 18 Nov 2011 10:20:54 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References: <20111117171144.GA17301@thebird.nl> Message-ID: On Fri, Nov 18, 2011 at 9:35 AM, Raoul Bonnal wrote: > Dear all, > Would be possible to have a test dataset and clear requirements, > functionalities? Not a huge doc, just few points for benchmarking. I was thinking of using the UniProt SProt and TrEMBL datasets as test cases (FASTA, plain text "swiss", and UniProt-XML format). These have 532,792 and 17,651,715 records each (in the version I have on disk - they've just released an update), which is a good size, but not in the scale where we might start to worry about SQLite scaling. ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/ So, we'd also want some thing else like some big FASTQ files with 100M -> 500M records (or more). Perhaps we'll have to combine a couple of SRA data files together for that, which is fine. Also a full GenBank download would be good, e.g. the EST dataset files gbest1.seq.gz to gbest209.seq.gz would make a good test of indexing multiple files together as a single database: ftp://ftp.ncbi.nih.gov/genbank/ Peter From bonnal at ingm.org Fri Nov 18 10:55:48 2011 From: bonnal at ingm.org (Raoul Bonnal) Date: Fri, 18 Nov 2011 11:55:48 +0100 Subject: [Open-bio-l] OBDA redux? In-Reply-To: Message-ID: On 18/11/11 11.20, "Peter Cock" wrote: > On Fri, Nov 18, 2011 at 9:35 AM, Raoul Bonnal wrote: >> Dear all, >> Would be possible to have a test dataset and clear requirements, >> functionalities? Not a huge doc, just few points for benchmarking. > > I was thinking of using the UniProt SProt and TrEMBL datasets > as test cases (FASTA, plain text "swiss", and UniProt-XML format). > These have 532,792 and 17,651,715 records each (in the version > I have on disk - they've just released an update), which is a good > size, but not in the scale where we might start to worry about > SQLite scaling. > ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/comple > te/ > > So, we'd also want some thing else like some big FASTQ files with > 100M -> 500M records (or more). Perhaps we'll have to combine a > couple of SRA data files together for that, which is fine. > > Also a full GenBank download would be good, e.g. the EST dataset > files gbest1.seq.gz to gbest209.seq.gz would make a good test of > indexing multiple files together as a single database: > ftp://ftp.ncbi.nih.gov/genbank/ > It's a stating point. And which are the information you want to extract once you have your index ? -- Ra From p.j.a.cock at googlemail.com Fri Nov 18 11:21:04 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 18 Nov 2011 11:21:04 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References: Message-ID: On Fri, Nov 18, 2011 at 10:55 AM, Raoul Bonnal wrote: > On 18/11/11 11.20, "Peter Cock" wrote: >> On Fri, Nov 18, 2011 at 9:35 AM, Raoul Bonnal wrote: >>> Dear all, >>> Would be possible to have a test dataset and clear requirements, >>> functionalities? Not a huge doc, just few points for benchmarking. >> >> I was thinking of using the UniProt SProt and TrEMBL datasets >> as test cases (FASTA, plain text "swiss", and UniProt-XML format). >> These have 532,792 and 17,651,715 records each (in the version >> I have on disk - they've just released an update), which is a good >> size, but not in the scale where we might start to worry about >> SQLite scaling. >> ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/ >> >> So, we'd also want some thing else like some big FASTQ files with >> 100M -> 500M records (or more). Perhaps we'll have to combine a >> couple of SRA data files together for that, which is fine. >> >> Also a full GenBank download would be good, e.g. the EST dataset >> files gbest1.seq.gz to gbest209.seq.gz would make a good test of >> indexing multiple files together as a single database: >> ftp://ftp.ncbi.nih.gov/genbank/ >> > It's a stating point. > > And which are the information you want to extract once you > have your index ? > Biopython and BioPerl have their SeqIO parsers hooked up to indexing code. This means you can access a record via its ID, and it is parsed for you on demand - just like if you'd iterated over the file in order parsing the records one by one. Biopython (not sure about BioPerl) can also just fetch the raw text of that record. I presume BioRuby has something similar using the OBDA flatfile / BDB indexes? Peter From cjfields at illinois.edu Fri Nov 18 13:45:14 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Fri, 18 Nov 2011 13:45:14 +0000 Subject: [Open-bio-l] OBDA redux? In-Reply-To: References:

Message-ID: On Nov 18, 2011, at 5:21 AM, Peter Cock wrote: > On Fri, Nov 18, 2011 at 10:55 AM, Raoul Bonnal wrote: >> ... >> And which are the information you want to extract once you >> have your index ? >> > > Biopython and BioPerl have their SeqIO parsers hooked up > to indexing code. This means you can access a record via its > ID, and it is parsed for you on demand - just like if you'd > iterated over the file in order parsing the records one by one. > > Biopython (not sure about BioPerl) can also just fetch the raw > text of that record. Re: BioPerl, I'm not sure about the OBDA implementations, but I know the older Bio::Index modules allow this. I would be surprised if the OBDA-specific code didn't, but adding this should be easy. chris From p.j.a.cock at googlemail.com Wed Nov 30 10:41:37 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Nov 2011 10:41:37 +0000 Subject: [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) Message-ID: On Wed, Nov 30, 2011 at 10:30 AM, Peter Rice wrote: > On 11/29/2011 07:09 PM, Fields, Christopher J wrote: >> >> On Nov 29, 2011, at 12:35 PM, Peter Cock wrote: >>> >>> Doesn't BioPerl just use the Staden libraries for this internally? >> >> Yes, and it uses an old version as well (via bioperl-ext). ?Much of this >> effort was to go into the biolib initiative for creating cross-lang bindings >> using swig, but that seems to be silent at the moment. ?I'm surprised Python >> doesn't have io_lib bindings. > > BioLib is just swig wrappers around the existing Bio* interfaces and > code, so it will not help in this case if the projects are too divergent. > > Could we set up a Bio* collection of data formats with examples and > note which projects can handle each one? > > We do not need any one project to cover everything - we can reasonably > expect users to use some other project to interconvert formats if there are > gaps. > > regards, > > Peter Rice > EMBOSS Team Good plan. I suggest we make a repository on github, perhaps bio-data or something like that, under the recently created OBF account, https://github.com/OBF Peter R - do you have a GitHub account yet? If so we (me, Chris Field, etc) can give you access to the OBF org account. For licensing, where we are free to choose the licence, I would like to go with something as liberal as possible to allow the files to be used by any OSS project (or closed source project), (e.g. Public Domain, CC0, MIT/BSD) rather than something more principled but restricted like CC-BY or CC-BY-ND. However, as we know from recent Debian packaging discussion about test cases taken from UniProt, licensing and copyright of samples from a database is complicated. Here we must at least keep careful records about where data came from. Peter From pmr at ebi.ac.uk Wed Nov 30 11:04:49 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 30 Nov 2011 11:04:49 +0000 Subject: [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: References: Message-ID: <4ED60DD1.8080005@ebi.ac.uk> On 11/30/2011 10:41 AM, Peter Cock wrote: > On Wed, Nov 30, 2011 at 10:30 AM, Peter Rice wrote: >> >> BioLib is just swig wrappers around the existing Bio* interfaces and >> code, so it will not help in this case if the projects are too divergent. >> >> Could we set up a Bio* collection of data formats with examples and >> note which projects can handle each one? >> >> We do not need any one project to cover everything - we can reasonably >> expect users to use some other project to interconvert formats if there are >> gaps. > > Good plan. I suggest we make a repository on github, perhaps > bio-data or something like that, under the recently created OBF > account, https://github.com/OBF > > Peter R - do you have a GitHub account yet? If so we (me, > Chris Field, etc) can give you access to the OBF org account. No ... rather a pain that EMBOSS got used. I've register under some other name: EMBOSSTEAM and created an EMBOSS project under it. Looks like git import requires subversion for any automation. Preumably I need a fresh EMBOSS checkout from CVS and then commit everything by hand ... best done after the release 6.5.0 code freeze. > For licensing, where we are free to choose the licence, I would > like to go with something as liberal as possible to allow the > files to be used by any OSS project (or closed source project), > (e.g. Public Domain, CC0, MIT/BSD) rather than something > more principled but restricted like CC-BY or CC-BY-ND. Public domain would be my choice - we don't want to cause conflicts if any data is imported into other projects (e.g. as test cases) > However, as we know from recent Debian packaging > discussion about test cases taken from UniProt, licensing > and copyright of samples from a database is complicated. > Here we must at least keep careful records about where > data came from. For that reason we probably should fake all the files for the public database formats. regards, Peter Rice EMBOSS team From p.j.a.cock at googlemail.com Wed Nov 30 11:14:44 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Nov 2011 11:14:44 +0000 Subject: [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: <4ED60DD1.8080005@ebi.ac.uk> References: <4ED60DD1.8080005@ebi.ac.uk> Message-ID: On Wed, Nov 30, 2011 at 11:04 AM, Peter Rice wrote: > On 11/30/2011 10:41 AM, Peter Cock wrote: >> >> On Wed, Nov 30, 2011 at 10:30 AM, Peter Rice ?wrote: >>> >>> BioLib is just swig wrappers around the existing Bio* interfaces and >>> code, so it will not help in this case if the projects are too divergent. >>> >>> Could we set up a Bio* collection of data formats with examples and >>> note which projects can handle each one? >>> >>> We do not need any one project to cover everything - we can reasonably >>> expect users to use some other project to interconvert formats if there >>> are >>> gaps. >> >> Good plan. I suggest we make a repository on github, perhaps >> bio-data or something like that, under the recently created OBF >> account, https://github.com/OBF >> >> Peter R - do you have a GitHub account yet? If so we (me, >> Chris Field, etc) can give you access to the OBF org account. > > No ... rather a pain that EMBOSS got used. I've register under some other > name: EMBOSSTEAM and created an EMBOSS project under it. > > Looks like git import requires subversion for any automation. > Preumably I need a fresh EMBOSS checkout from CVS and > then commit everything by hand ... best done after the release > 6.5.0 code freeze. If you are talking about converting the EMBOSS CVS into git, we can help with that having done it for Biopython. As part of this it is possible to map CVS user names to github users. I meant do you personally have a github account? >> For licensing, where we are free to choose the licence, I would >> like to go with something as liberal as possible to allow the >> files to be used by any OSS project (or closed source project), >> (e.g. Public Domain, CC0, MIT/BSD) rather than something >> more principled but restricted like CC-BY or CC-BY-ND. > > Public domain would be my choice - we don't want to cause > conflicts if any data is imported into other projects (e.g. as > test cases) Yes, public domain would be simplest where possible. >> However, as we know from recent Debian packaging >> discussion about test cases taken from UniProt, licensing >> and copyright of samples from a database is complicated. >> Here we must at least keep careful records about where >> data came from. > > For that reason we probably should fake all the files for the > public database formats. Yes, a practical solution - although it has downsides of course. Peter From pmr at ebi.ac.uk Wed Nov 30 11:38:30 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 30 Nov 2011 11:38:30 +0000 Subject: [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: <20111130113250.GA32452@thebird.nl> References: <20111130113250.GA32452@thebird.nl> Message-ID: <4ED615B6.1080002@ebi.ac.uk> On 11/30/2011 11:32 AM, Pjotr Prins wrote: > Git is not very good for storing large data files, which we would want > to fetch partially. My suggestion would be to have a plain old file > repo, e.g. on S3, which can be mirrored by others. We had issues with large files in the EMBOSS release, and make those available via rsync to add to the developers CVS checkout. They include the NCBI taxonomy source and index files and the ontology source and index files. The next EMBOSS release will include http and ftp URLs as valid inputs for any data type, so EMBOSS could use remote files for format tests. I' look into how other repositories could be added. I had to add some extra qualifiers to allow queries and offsets to be specified, and rewrote the query language parsing to merge very similar code segments. regards, Peter Rice EMBOSS Team From pjotr.public41 at thebird.nl Wed Nov 30 11:32:50 2011 From: pjotr.public41 at thebird.nl (Pjotr Prins) Date: Wed, 30 Nov 2011 12:32:50 +0100 Subject: [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: References: Message-ID: <20111130113250.GA32452@thebird.nl> On Wed, Nov 30, 2011 at 10:41:37AM +0000, Peter Cock wrote: > > BioLib is just swig wrappers around the existing Bio* interfaces and > > code, so it will not help in this case if the projects are too divergent. It is a bit more than that. Mostly biolib is a multi-platform build system. Code-wise, most libraries are not immediately suitable for wrapping (SWIG of FFI), including EMBOSS, so adapters are required. I wrote an example for EMBOSS/transeq, which outperforms all other Bio* implementations (published in upcoming Springer book). BioLib also does automated document generation (parsing SWIG XML) and testing. The current BioLib went into maintenance mode, after my visit to Chris Fields. I see BioLib v1 as a proof-of-concept mostly, at this point, though I use it, and I know of others. A new high performance library is in the works - but these things move slowly. > Good plan. I suggest we make a repository on github, perhaps > bio-data or something like that, under the recently created OBF > account, https://github.com/OBF Git is not very good for storing large data files, which we would want to fetch partially. My suggestion would be to have a plain old file repo, e.g. on S3, which can be mirrored by others. Pj. From p.j.a.cock at googlemail.com Wed Nov 30 11:42:22 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Nov 2011 11:42:22 +0000 Subject: [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: <4ED615B6.1080002@ebi.ac.uk> References: <20111130113250.GA32452@thebird.nl> <4ED615B6.1080002@ebi.ac.uk> Message-ID: On Wed, Nov 30, 2011 at 11:38 AM, Peter Rice wrote: > On 11/30/2011 11:32 AM, Pjotr Prins wrote: > >> Git is not very good for storing large data files, which we would want >> to fetch partially. My suggestion would be to have a plain old file >> repo, e.g. on S3, which can be mirrored by others. > > We had issues with large files in the EMBOSS release, and make those > available via rsync to add to the developers CVS checkout. They include the > NCBI taxonomy source and index files and the ontology source and index > files. > > The next EMBOSS release will include http and ftp URLs as valid inputs for > any data type, so EMBOSS could use remote files for format tests. I' look > into how other repositories could be added. > > I had to add some extra qualifiers to allow queries and offsets to be > specified, and rewrote the query language parsing to merge very similar code > segments. > > regards, > > Peter Rice > EMBOSS Team How about an OBF hosted FTP site then if we want big data? I guess we'd mostly be adding files, and changes/deletions should be rare, so a full version tracking repository isn't essential if we are disciplined about updating README files or more formal meta data. Peter From pjotr.public41 at thebird.nl Wed Nov 30 11:45:04 2011 From: pjotr.public41 at thebird.nl (Pjotr Prins) Date: Wed, 30 Nov 2011 12:45:04 +0100 Subject: [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: References: <20111130113250.GA32452@thebird.nl> <4ED615B6.1080002@ebi.ac.uk> Message-ID: <20111130114504.GA1542@thebird.nl> On Wed, Nov 30, 2011 at 11:42:22AM +0000, Peter Cock wrote: > How about an OBF hosted FTP site then if we want big data? Yes :) > I guess we'd mostly be adding files, and changes/deletions > should be rare, so a full version tracking repository isn't > essential if we are disciplined about updating README files > or more formal meta data. We can still have the readme's and MD5s mirrored in a small repo. That would track changes/moving/renaming. Pj. From p.j.a.cock at googlemail.com Wed Nov 30 11:58:06 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Nov 2011 11:58:06 +0000 Subject: [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: <20111130114504.GA1542@thebird.nl> References: <20111130113250.GA32452@thebird.nl> <4ED615B6.1080002@ebi.ac.uk> <20111130114504.GA1542@thebird.nl> Message-ID: On Wed, Nov 30, 2011 at 11:45 AM, Pjotr Prins wrote: > On Wed, Nov 30, 2011 at 11:42:22AM +0000, Peter Cock wrote: >> How about an OBF hosted FTP site then if we want big data? > > Yes :) > >> I guess we'd mostly be adding files, and changes/deletions >> should be rare, so a full version tracking repository isn't >> essential if we are disciplined about updating README files >> or more formal meta data. > > We can still have the readme's and MD5s mirrored in a small repo. That > would track changes/moving/renaming. > > Pj. True, or even a hybrid where small files also live in a git repo, but for larger files we just store the URL and MD5? Peter From p.j.a.cock at googlemail.com Wed Nov 30 14:49:35 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Nov 2011 14:49:35 +0000 Subject: [Open-bio-l] [EMBOSS] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: <39471.84.92.187.247.1322651650.squirrel@webmail.ebi.ac.uk> References: <39471.84.92.187.247.1322651650.squirrel@webmail.ebi.ac.uk> Message-ID: I just checked with Jon and he was happy to forward this back to the list, and also added a couple of URLs that I'd asked about: http://bioportal.bioontology.org/ontologies/44600 http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=EDAM Peter On Wed, Nov 30, 2011 at 11:14 AM, Jon Ison wrote: > Hi Peter (and Peter) > > Just a quick note to say that all (well, nearly all) common bioinformatics data formats are > catalogued in the EDAM ontology: > > http://sourceforge.net/projects/edamontology/files > http://edamontology.sourceforge.net/ > > OK - there's bound to be some we've missed :) > > Anyhow, I thought it might help to structure any effort to document data formats (an effort which > I wholeheartedly approve of by the way). ?One thing I'd like to add to the EDAM "format" > definitions is a link to the format specification, or failing that, an example. > > Cheers both > > Jon > From cjfields at illinois.edu Wed Nov 30 16:53:41 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Wed, 30 Nov 2011 16:53:41 +0000 Subject: [Open-bio-l] [EMBOSS] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: References: <39471.84.92.187.247.1322651650.squirrel@webmail.ebi.ac.uk> Message-ID: That might be the best source to pull from. Does it archive old file examples (such as older SwissProt/GenBank/EMBL)? chris On Nov 30, 2011, at 8:49 AM, Peter Cock wrote: > I just checked with Jon and he was happy to forward this back to > the list, and also added a couple of URLs that I'd asked about: > > http://bioportal.bioontology.org/ontologies/44600 > http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName=EDAM > > Peter > > On Wed, Nov 30, 2011 at 11:14 AM, Jon Ison wrote: >> Hi Peter (and Peter) >> >> Just a quick note to say that all (well, nearly all) common bioinformatics data formats are >> catalogued in the EDAM ontology: >> >> http://sourceforge.net/projects/edamontology/files >> http://edamontology.sourceforge.net/ >> >> OK - there's bound to be some we've missed :) >> >> Anyhow, I thought it might help to structure any effort to document data formats (an effort which >> I wholeheartedly approve of by the way). One thing I'd like to add to the EDAM "format" >> definitions is a link to the format specification, or failing that, an example. >> >> Cheers both >> >> Jon >> > > _______________________________________________ > Open-Bio-l mailing list > Open-Bio-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/open-bio-l From cjfields at illinois.edu Wed Nov 30 16:54:52 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Wed, 30 Nov 2011 16:54:52 +0000 Subject: [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: References: <20111130113250.GA32452@thebird.nl> <4ED615B6.1080002@ebi.ac.uk> <20111130114504.GA1542@thebird.nl> Message-ID: <5C116AE6-F2B1-4635-9673-15F35AC9C71D@illinois.edu> On Nov 30, 2011, at 5:58 AM, Peter Cock wrote: > On Wed, Nov 30, 2011 at 11:45 AM, Pjotr Prins wrote: >> On Wed, Nov 30, 2011 at 11:42:22AM +0000, Peter Cock wrote: >>> How about an OBF hosted FTP site then if we want big data? >> >> Yes :) >> >>> I guess we'd mostly be adding files, and changes/deletions >>> should be rare, so a full version tracking repository isn't >>> essential if we are disciplined about updating README files >>> or more formal meta data. >> >> We can still have the readme's and MD5s mirrored in a small repo. That >> would track changes/moving/renaming. >> >> Pj. > > True, or even a hybrid where small files also live in a git > repo, but for larger files we just store the URL and MD5? > > Peter There was an initial push for this years ago IIRC, with the biodata repository, but it never took off. Not sure if the dev.open-bio.org CVS repo is even browsable anymore (I believe this was all synced to portal for browsing), but the old biodata CVS repo is still in /home/repositories/biodata (very little there, might as well start from scratch). chris