From biopython at maubp.freeserve.co.uk Tue Sep 1 06:23:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Sep 2009 11:23:13 +0100 Subject: [Open-bio-l] Status of OBDA and indexed flatfiles? In-Reply-To: References: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com> <8D2B0A79-0B34-4DD7-AC66-EEA0DBA30219@illinois.edu> <320fb6e00908310845n3322d2c6hc992f78b81c0e226@mail.gmail.com> Message-ID: <320fb6e00909010323t3168f0f0h8a9f369cfb38c5d9@mail.gmail.com> On Mon, Aug 31, 2009 at 7:22 PM, Chris Fields wrote: >> I didn't know if Bio::Index was using OBDA "under the hood" or not. >> Does this mean BioPerl has multiple indexing systems available? > > Yes. ?We have Bio::Index::*, Bio::DB::Flat (which I think is OBDA). ?There > is also the older Bio::DB::Fasta, which is actually still in wide use. ?Note > with Bio::Index::* we allow streaming of any report type (sequence, > alignment, analysis like BLAST, etc). > > We have talked about switching many of the Bio::Index::* sequence-based > one to OBDA but I haven't seen anyone take that up. > >> As I noted on Bug 2337 earlier today, Biopython used to have some >> sort of OBDA compliant indexing, but for unrelated reasons we have >> deprecated and removed that code. We're now revisiting this topic >> due in part to having to deal with ever larger data files - and I wanted >> to see if OBDA was still "alive" as a standard, and furthermore how >> well it had scaled for the other OBF projects. >> >> Peter > > I think it's still alive and being used, just not sure what the compliance > level is amongst the different Bio* projects. That's Chris - so at least BioPerl and BioRuby are still using it. That's good to know. So right now (according to Bug 2337) there may be a couple of compliance issues (or ambiguities in the spec)? I don't think I have the time to look at it right now, but an SQLite OBDA variant might be worth thinking about in future... Peter From biopython at maubp.freeserve.co.uk Tue Sep 1 11:28:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Sep 2009 16:28:06 +0100 Subject: [Open-bio-l] More FASTQ examples for cross project testing In-Reply-To: <320fb6e00908261504l18c3ee5em4c90b0818fc2c844@mail.gmail.com> References: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com> <320fb6e00908261504l18c3ee5em4c90b0818fc2c844@mail.gmail.com> Message-ID: <320fb6e00909010828l585c11cbya0f3a00c67de70f5@mail.gmail.com> On Wed, Aug 26, 2009 at 11:04 PM, Peter wrote: > > I didn't want to clog up the mailing list with attachments, but just > for the record, I've sent my first attempt at this to Peter (EMBOSS) > and Chris (BioPerl) for comment (and checking). I've emailed the latest test cases (off the mailing list) to Peter (EMBOSS), Chris (BioPerl), Michael (BioJava) and Naohisa (BioRuby). These files are also in Biopython's repository. I've just run these against bioperl-live SVN, and most of them work as I would expect. Note that the output of Solexa FASTQ files where the scores must be converted from PHRED values isn't working yet (Chris knows about this): http://lists.open-bio.org/pipermail/bioperl-l/2009-August/031064.html All the error_*.fastq files are correctly rejected by BioPerl, except those with invalid characters in the quality string (e.g. a delete) which are treated as a warning condition (rather than aborting with an exception): error_qual_del.fastq error_qual_escape.fastq error_qual_null.fastq error_qual_space.fastq error_qual_tab.fastq error_qual_unit_sep.fastq error_qual_vtab.fastq Presumably this is in line with (Bio)Perl norms? i.e. Make a best guess at what the file is trying trying to say, issue a warning, but continue? In Biopython (in line with Python norms), we don't try to guess. Giving an error and aborting is the only clear and unambiguous action. Would it suffice to agree that all the OBF projects will read these error_*.fastq files and either raise an exception (abort), or at least issue a warning? Peter From cjfields at illinois.edu Tue Sep 1 12:03:21 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 1 Sep 2009 11:03:21 -0500 Subject: [Open-bio-l] More FASTQ examples for cross project testing In-Reply-To: <320fb6e00909010828l585c11cbya0f3a00c67de70f5@mail.gmail.com> References: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com> <320fb6e00908261504l18c3ee5em4c90b0818fc2c844@mail.gmail.com> <320fb6e00909010828l585c11cbya0f3a00c67de70f5@mail.gmail.com> Message-ID: <08FA4BF3-30AD-487A-A369-174FD272E3C0@illinois.edu> On Sep 1, 2009, at 10:28 AM, Peter wrote: > On Wed, Aug 26, 2009 at 11:04 PM, Peter > wrote: >> >> I didn't want to clog up the mailing list with attachments, but just >> for the record, I've sent my first attempt at this to Peter (EMBOSS) >> and Chris (BioPerl) for comment (and checking). > > I've emailed the latest test cases (off the mailing list) to Peter > (EMBOSS), Chris (BioPerl), Michael (BioJava) and Naohisa > (BioRuby). These files are also in Biopython's repository. > > I've just run these against bioperl-live SVN, and most of them > work as I would expect. Note that the output of Solexa FASTQ > files where the scores must be converted from PHRED values > isn't working yet (Chris knows about this): > http://lists.open-bio.org/pipermail/bioperl-l/2009-August/031064.html > > All the error_*.fastq files are correctly rejected by BioPerl, except > those with invalid characters in the quality string (e.g. a delete) > which are treated as a warning condition (rather than aborting > with an exception): > > error_qual_del.fastq > error_qual_escape.fastq > error_qual_null.fastq > error_qual_space.fastq > error_qual_tab.fastq > error_qual_unit_sep.fastq > error_qual_vtab.fastq > > Presumably this is in line with (Bio)Perl norms? i.e. Make a best > guess > at what the file is trying trying to say, issue a warning, but > continue? > > In Biopython (in line with Python norms), we don't try to guess. > Giving > an error and aborting is the only clear and unambiguous action. > > Would it suffice to agree that all the OBF projects will read these > error_*.fastq files and either raise an exception (abort), or at least > issue a warning? > > Peter I would rather throw on those; I can easily change that behavior to do whatever the consensus is. chris From biopython at maubp.freeserve.co.uk Tue Sep 1 12:15:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Sep 2009 17:15:18 +0100 Subject: [Open-bio-l] More FASTQ examples for cross project testing In-Reply-To: <08FA4BF3-30AD-487A-A369-174FD272E3C0@illinois.edu> References: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com> <320fb6e00908261504l18c3ee5em4c90b0818fc2c844@mail.gmail.com> <320fb6e00909010828l585c11cbya0f3a00c67de70f5@mail.gmail.com> <08FA4BF3-30AD-487A-A369-174FD272E3C0@illinois.edu> Message-ID: <320fb6e00909010915r7d605774w243ea5fce84feb81@mail.gmail.com> On Tue, Sep 1, 2009 at 5:03 PM, Chris Fields wrote: > > On Sep 1, 2009, at 10:28 AM, Peter wrote: > >> All the error_*.fastq files are correctly rejected by BioPerl, except >> those with invalid characters in the quality string (e.g. a delete) >> which are treated as a warning condition (rather than aborting >> with an exception): >> >> error_qual_del.fastq >> error_qual_escape.fastq >> error_qual_null.fastq >> error_qual_space.fastq >> error_qual_tab.fastq >> error_qual_unit_sep.fastq >> error_qual_vtab.fastq >> >> Presumably this is in line with (Bio)Perl norms? i.e. Make a best guess >> at what the file is trying trying to say, issue a warning, but continue? >> >> In Biopython (in line with Python norms), we don't try to guess. Giving >> an error and aborting is the only clear and unambiguous action. >> >> Would it suffice to agree that all the OBF projects will read these >> error_*.fastq files and either raise an exception (abort), or at least >> issue a warning? > > I would rather throw on those; I can easily change that behavior to do > whatever the consensus is. > > chris If you (Chris) would prefer BioPerl to throw an exception and abort on these error cases, I would support that 100%. :) Peter From cjfields at illinois.edu Tue Sep 1 12:19:10 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 1 Sep 2009 11:19:10 -0500 Subject: [Open-bio-l] More FASTQ examples for cross project testing In-Reply-To: <320fb6e00909010915r7d605774w243ea5fce84feb81@mail.gmail.com> References: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com> <320fb6e00908261504l18c3ee5em4c90b0818fc2c844@mail.gmail.com> <320fb6e00909010828l585c11cbya0f3a00c67de70f5@mail.gmail.com> <08FA4BF3-30AD-487A-A369-174FD272E3C0@illinois.edu> <320fb6e00909010915r7d605774w243ea5fce84feb81@mail.gmail.com> Message-ID: On Sep 1, 2009, at 11:15 AM, Peter wrote: > On Tue, Sep 1, 2009 at 5:03 PM, Chris Fields > wrote: >> >> On Sep 1, 2009, at 10:28 AM, Peter wrote: >> >>> ... >>> Would it suffice to agree that all the OBF projects will read these >>> error_*.fastq files and either raise an exception (abort), or at >>> least >>> issue a warning? >> >> I would rather throw on those; I can easily change that behavior to >> do >> whatever the consensus is. >> >> chris > > If you (Chris) would prefer BioPerl to throw an exception and abort on > these error cases, I would support that 100%. :) > > Peter Well, if we're going through the trouble of detecting bad data, might as well let the user know in a meaningful way ;> chris From heuermh at acm.org Tue Sep 1 14:37:41 2009 From: heuermh at acm.org (Michael Heuer) Date: Tue, 1 Sep 2009 14:37:41 -0400 (EDT) Subject: [Open-bio-l] More FASTQ examples for cross project testing In-Reply-To: <320fb6e00909010915r7d605774w243ea5fce84feb81@mail.gmail.com> Message-ID: On Tue, 1 Sep 2009, Peter wrote: > On Tue, Sep 1, 2009 at 5:03 PM, Chris Fields wrote: > > > > On Sep 1, 2009, at 10:28 AM, Peter wrote: > > > >> All the error_*.fastq files are correctly rejected by BioPerl, except > >> those with invalid characters in the quality string (e.g. a delete) > >> which are treated as a warning condition (rather than aborting > >> with an exception): > >> > >> error_qual_del.fastq > >> error_qual_escape.fastq > >> error_qual_null.fastq > >> error_qual_space.fastq > >> error_qual_tab.fastq > >> error_qual_unit_sep.fastq > >> error_qual_vtab.fastq > >> > >> Presumably this is in line with (Bio)Perl norms? i.e. Make a best guess > >> at what the file is trying trying to say, issue a warning, but continue? > >> > >> In Biopython (in line with Python norms), we don't try to guess. Giving > >> an error and aborting is the only clear and unambiguous action. > >> > >> Would it suffice to agree that all the OBF projects will read these > >> error_*.fastq files and either raise an exception (abort), or at least > >> issue a warning? > > > > I would rather throw on those; I can easily change that behavior to do > > whatever the consensus is. > > > > chris > > If you (Chris) would prefer BioPerl to throw an exception and abort on > these error cases, I would support that 100%. :) I haven't got quite that far yet, this evening perhaps, BioJava will behave the same. michael From ngoto at gen-info.osaka-u.ac.jp Wed Sep 2 02:45:08 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 2 Sep 2009 15:45:08 +0900 Subject: [Open-bio-l] Status of OBDA and indexed flatfiles? In-Reply-To: <320fb6e00908310807j2fbc9710x3621a16a54ad5e45@mail.gmail.com> References: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com> <20090831140146.EAF851CBC594@idnmail.gen-info.osaka-u.ac.jp> <320fb6e00908310807j2fbc9710x3621a16a54ad5e45@mail.gmail.com> Message-ID: <20090902064510.13AF91CBC3C9@idnmail.gen-info.osaka-u.ac.jp> Hi, On Mon, 31 Aug 2009 16:07:28 +0100 Peter wrote: > On Mon, Aug 31, 2009 at 3:01 PM, Naohisa > GOTO wrote: > > Hi Peter, > > > >> Presumably BioPerl still uses these index files? What about the > >> other projects? I know EMBOSS has some indexing system for > >> example but I have no idea how it works internally. > > > > BioRuby still uses them. To gain performance, names and offsets are > > written to temporary files and using external sort program (default > > /usr/bin/sort). > > That makes sense. Have you tried this on very large files? e.g. > FASTA with 10 million short reads? Using BioRuby's br_bioflat.rb on a Linux server (CPU: Pentium D 3.4GHz, memory: 4GB, HDD: SATA 300GB), it takes about 43 minutes to create a flat-file index of 10,000,000 randomly generated FASTA sequences (each sequence length is 100-500 bp, total file size about 3 GB). To retrieve 10,000 sequences from the index takes 133 seconds on the same server. Naohisa Goto ng at bioruby.org / ngoto at gen-info.osaka-u.ac.jp From biopython at maubp.freeserve.co.uk Tue Sep 1 10:23:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Sep 2009 11:23:13 +0100 Subject: [Open-bio-l] Status of OBDA and indexed flatfiles? In-Reply-To: References: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com> <8D2B0A79-0B34-4DD7-AC66-EEA0DBA30219@illinois.edu> <320fb6e00908310845n3322d2c6hc992f78b81c0e226@mail.gmail.com> Message-ID: <320fb6e00909010323t3168f0f0h8a9f369cfb38c5d9@mail.gmail.com> On Mon, Aug 31, 2009 at 7:22 PM, Chris Fields wrote: >> I didn't know if Bio::Index was using OBDA "under the hood" or not. >> Does this mean BioPerl has multiple indexing systems available? > > Yes. ?We have Bio::Index::*, Bio::DB::Flat (which I think is OBDA). ?There > is also the older Bio::DB::Fasta, which is actually still in wide use. ?Note > with Bio::Index::* we allow streaming of any report type (sequence, > alignment, analysis like BLAST, etc). > > We have talked about switching many of the Bio::Index::* sequence-based > one to OBDA but I haven't seen anyone take that up. > >> As I noted on Bug 2337 earlier today, Biopython used to have some >> sort of OBDA compliant indexing, but for unrelated reasons we have >> deprecated and removed that code. We're now revisiting this topic >> due in part to having to deal with ever larger data files - and I wanted >> to see if OBDA was still "alive" as a standard, and furthermore how >> well it had scaled for the other OBF projects. >> >> Peter > > I think it's still alive and being used, just not sure what the compliance > level is amongst the different Bio* projects. That's Chris - so at least BioPerl and BioRuby are still using it. That's good to know. So right now (according to Bug 2337) there may be a couple of compliance issues (or ambiguities in the spec)? I don't think I have the time to look at it right now, but an SQLite OBDA variant might be worth thinking about in future... Peter From biopython at maubp.freeserve.co.uk Tue Sep 1 15:28:06 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Sep 2009 16:28:06 +0100 Subject: [Open-bio-l] More FASTQ examples for cross project testing In-Reply-To: <320fb6e00908261504l18c3ee5em4c90b0818fc2c844@mail.gmail.com> References: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com> <320fb6e00908261504l18c3ee5em4c90b0818fc2c844@mail.gmail.com> Message-ID: <320fb6e00909010828l585c11cbya0f3a00c67de70f5@mail.gmail.com> On Wed, Aug 26, 2009 at 11:04 PM, Peter wrote: > > I didn't want to clog up the mailing list with attachments, but just > for the record, I've sent my first attempt at this to Peter (EMBOSS) > and Chris (BioPerl) for comment (and checking). I've emailed the latest test cases (off the mailing list) to Peter (EMBOSS), Chris (BioPerl), Michael (BioJava) and Naohisa (BioRuby). These files are also in Biopython's repository. I've just run these against bioperl-live SVN, and most of them work as I would expect. Note that the output of Solexa FASTQ files where the scores must be converted from PHRED values isn't working yet (Chris knows about this): http://lists.open-bio.org/pipermail/bioperl-l/2009-August/031064.html All the error_*.fastq files are correctly rejected by BioPerl, except those with invalid characters in the quality string (e.g. a delete) which are treated as a warning condition (rather than aborting with an exception): error_qual_del.fastq error_qual_escape.fastq error_qual_null.fastq error_qual_space.fastq error_qual_tab.fastq error_qual_unit_sep.fastq error_qual_vtab.fastq Presumably this is in line with (Bio)Perl norms? i.e. Make a best guess at what the file is trying trying to say, issue a warning, but continue? In Biopython (in line with Python norms), we don't try to guess. Giving an error and aborting is the only clear and unambiguous action. Would it suffice to agree that all the OBF projects will read these error_*.fastq files and either raise an exception (abort), or at least issue a warning? Peter From cjfields at illinois.edu Tue Sep 1 16:03:21 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 1 Sep 2009 11:03:21 -0500 Subject: [Open-bio-l] More FASTQ examples for cross project testing In-Reply-To: <320fb6e00909010828l585c11cbya0f3a00c67de70f5@mail.gmail.com> References: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com> <320fb6e00908261504l18c3ee5em4c90b0818fc2c844@mail.gmail.com> <320fb6e00909010828l585c11cbya0f3a00c67de70f5@mail.gmail.com> Message-ID: <08FA4BF3-30AD-487A-A369-174FD272E3C0@illinois.edu> On Sep 1, 2009, at 10:28 AM, Peter wrote: > On Wed, Aug 26, 2009 at 11:04 PM, Peter > wrote: >> >> I didn't want to clog up the mailing list with attachments, but just >> for the record, I've sent my first attempt at this to Peter (EMBOSS) >> and Chris (BioPerl) for comment (and checking). > > I've emailed the latest test cases (off the mailing list) to Peter > (EMBOSS), Chris (BioPerl), Michael (BioJava) and Naohisa > (BioRuby). These files are also in Biopython's repository. > > I've just run these against bioperl-live SVN, and most of them > work as I would expect. Note that the output of Solexa FASTQ > files where the scores must be converted from PHRED values > isn't working yet (Chris knows about this): > http://lists.open-bio.org/pipermail/bioperl-l/2009-August/031064.html > > All the error_*.fastq files are correctly rejected by BioPerl, except > those with invalid characters in the quality string (e.g. a delete) > which are treated as a warning condition (rather than aborting > with an exception): > > error_qual_del.fastq > error_qual_escape.fastq > error_qual_null.fastq > error_qual_space.fastq > error_qual_tab.fastq > error_qual_unit_sep.fastq > error_qual_vtab.fastq > > Presumably this is in line with (Bio)Perl norms? i.e. Make a best > guess > at what the file is trying trying to say, issue a warning, but > continue? > > In Biopython (in line with Python norms), we don't try to guess. > Giving > an error and aborting is the only clear and unambiguous action. > > Would it suffice to agree that all the OBF projects will read these > error_*.fastq files and either raise an exception (abort), or at least > issue a warning? > > Peter I would rather throw on those; I can easily change that behavior to do whatever the consensus is. chris From biopython at maubp.freeserve.co.uk Tue Sep 1 16:15:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Sep 2009 17:15:18 +0100 Subject: [Open-bio-l] More FASTQ examples for cross project testing In-Reply-To: <08FA4BF3-30AD-487A-A369-174FD272E3C0@illinois.edu> References: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com> <320fb6e00908261504l18c3ee5em4c90b0818fc2c844@mail.gmail.com> <320fb6e00909010828l585c11cbya0f3a00c67de70f5@mail.gmail.com> <08FA4BF3-30AD-487A-A369-174FD272E3C0@illinois.edu> Message-ID: <320fb6e00909010915r7d605774w243ea5fce84feb81@mail.gmail.com> On Tue, Sep 1, 2009 at 5:03 PM, Chris Fields wrote: > > On Sep 1, 2009, at 10:28 AM, Peter wrote: > >> All the error_*.fastq files are correctly rejected by BioPerl, except >> those with invalid characters in the quality string (e.g. a delete) >> which are treated as a warning condition (rather than aborting >> with an exception): >> >> error_qual_del.fastq >> error_qual_escape.fastq >> error_qual_null.fastq >> error_qual_space.fastq >> error_qual_tab.fastq >> error_qual_unit_sep.fastq >> error_qual_vtab.fastq >> >> Presumably this is in line with (Bio)Perl norms? i.e. Make a best guess >> at what the file is trying trying to say, issue a warning, but continue? >> >> In Biopython (in line with Python norms), we don't try to guess. Giving >> an error and aborting is the only clear and unambiguous action. >> >> Would it suffice to agree that all the OBF projects will read these >> error_*.fastq files and either raise an exception (abort), or at least >> issue a warning? > > I would rather throw on those; I can easily change that behavior to do > whatever the consensus is. > > chris If you (Chris) would prefer BioPerl to throw an exception and abort on these error cases, I would support that 100%. :) Peter From cjfields at illinois.edu Tue Sep 1 16:19:10 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 1 Sep 2009 11:19:10 -0500 Subject: [Open-bio-l] More FASTQ examples for cross project testing In-Reply-To: <320fb6e00909010915r7d605774w243ea5fce84feb81@mail.gmail.com> References: <320fb6e00908250424g30ccc8e3m326cbee66332d92@mail.gmail.com> <320fb6e00908261504l18c3ee5em4c90b0818fc2c844@mail.gmail.com> <320fb6e00909010828l585c11cbya0f3a00c67de70f5@mail.gmail.com> <08FA4BF3-30AD-487A-A369-174FD272E3C0@illinois.edu> <320fb6e00909010915r7d605774w243ea5fce84feb81@mail.gmail.com> Message-ID: On Sep 1, 2009, at 11:15 AM, Peter wrote: > On Tue, Sep 1, 2009 at 5:03 PM, Chris Fields > wrote: >> >> On Sep 1, 2009, at 10:28 AM, Peter wrote: >> >>> ... >>> Would it suffice to agree that all the OBF projects will read these >>> error_*.fastq files and either raise an exception (abort), or at >>> least >>> issue a warning? >> >> I would rather throw on those; I can easily change that behavior to >> do >> whatever the consensus is. >> >> chris > > If you (Chris) would prefer BioPerl to throw an exception and abort on > these error cases, I would support that 100%. :) > > Peter Well, if we're going through the trouble of detecting bad data, might as well let the user know in a meaningful way ;> chris From heuermh at acm.org Tue Sep 1 18:37:41 2009 From: heuermh at acm.org (Michael Heuer) Date: Tue, 1 Sep 2009 14:37:41 -0400 (EDT) Subject: [Open-bio-l] More FASTQ examples for cross project testing In-Reply-To: <320fb6e00909010915r7d605774w243ea5fce84feb81@mail.gmail.com> Message-ID: On Tue, 1 Sep 2009, Peter wrote: > On Tue, Sep 1, 2009 at 5:03 PM, Chris Fields wrote: > > > > On Sep 1, 2009, at 10:28 AM, Peter wrote: > > > >> All the error_*.fastq files are correctly rejected by BioPerl, except > >> those with invalid characters in the quality string (e.g. a delete) > >> which are treated as a warning condition (rather than aborting > >> with an exception): > >> > >> error_qual_del.fastq > >> error_qual_escape.fastq > >> error_qual_null.fastq > >> error_qual_space.fastq > >> error_qual_tab.fastq > >> error_qual_unit_sep.fastq > >> error_qual_vtab.fastq > >> > >> Presumably this is in line with (Bio)Perl norms? i.e. Make a best guess > >> at what the file is trying trying to say, issue a warning, but continue? > >> > >> In Biopython (in line with Python norms), we don't try to guess. Giving > >> an error and aborting is the only clear and unambiguous action. > >> > >> Would it suffice to agree that all the OBF projects will read these > >> error_*.fastq files and either raise an exception (abort), or at least > >> issue a warning? > > > > I would rather throw on those; I can easily change that behavior to do > > whatever the consensus is. > > > > chris > > If you (Chris) would prefer BioPerl to throw an exception and abort on > these error cases, I would support that 100%. :) I haven't got quite that far yet, this evening perhaps, BioJava will behave the same. michael From ngoto at gen-info.osaka-u.ac.jp Wed Sep 2 06:45:08 2009 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 2 Sep 2009 15:45:08 +0900 Subject: [Open-bio-l] Status of OBDA and indexed flatfiles? In-Reply-To: <320fb6e00908310807j2fbc9710x3621a16a54ad5e45@mail.gmail.com> References: <320fb6e00908310507x7cfa51dav8afc560bffe2b3eb@mail.gmail.com> <20090831140146.EAF851CBC594@idnmail.gen-info.osaka-u.ac.jp> <320fb6e00908310807j2fbc9710x3621a16a54ad5e45@mail.gmail.com> Message-ID: <20090902064510.13AF91CBC3C9@idnmail.gen-info.osaka-u.ac.jp> Hi, On Mon, 31 Aug 2009 16:07:28 +0100 Peter wrote: > On Mon, Aug 31, 2009 at 3:01 PM, Naohisa > GOTO wrote: > > Hi Peter, > > > >> Presumably BioPerl still uses these index files? What about the > >> other projects? I know EMBOSS has some indexing system for > >> example but I have no idea how it works internally. > > > > BioRuby still uses them. To gain performance, names and offsets are > > written to temporary files and using external sort program (default > > /usr/bin/sort). > > That makes sense. Have you tried this on very large files? e.g. > FASTA with 10 million short reads? Using BioRuby's br_bioflat.rb on a Linux server (CPU: Pentium D 3.4GHz, memory: 4GB, HDD: SATA 300GB), it takes about 43 minutes to create a flat-file index of 10,000,000 randomly generated FASTA sequences (each sequence length is 100-500 bp, total file size about 3 GB). To retrieve 10,000 sequences from the index takes 133 seconds on the same server. Naohisa Goto ng at bioruby.org / ngoto at gen-info.osaka-u.ac.jp