From bernd.web at gmail.com Wed Nov 2 11:12:11 2011 From: bernd.web at gmail.com (Bernd Web) Date: Wed, 2 Nov 2011 16:12:11 +0100 Subject: [EMBOSS] fuzznuc pattern expansion In-Reply-To: <4EAC3285.7080501@ebi.ac.uk> References: <4EAC3285.7080501@ebi.ac.uk> Message-ID: Dear Peter, Thanks! It would indeed be great to have the option to seach on the ambiguity codes directly. Probably, I'd prefer the escape option, but you mean to implement both escaping and expansion to subsets? This actually might be good in case a user does not know the contents of the DNA file (ie which ambiguity codes are present). It might be good to report the pattern that was used in the matching. Would the (very high) speed of fuzznuc be affected by always exploding the to the subsets? For example, "N" would become "ACTGUMRWSYKVHDB". This could mean searches of patterns with high degeneracy would include a lot of ambiguity codes. Kind regards, Bernd On Sat, Oct 29, 2011 at 7:06 PM, Peter Rice wrote: > On 28/10/2011 18:03, Bernd Web wrote: >> >> Hi >> >> Using fuzznuc I get illegal pattern warnings. I realize what is going on: >> >> "You can use ambiguity codes for nucleic acid searches but not within >> [] or {} as they expand to bracketed counterparts. For example, "s" is >> expanded to "[GC]" therefore [S] would be expanded to [[GC]] which is >> illegal." >> >> However, what I cannot find it how to suppress this expansion. Is this >> possible? We actually need to have these ambiguity remain as they are >> within [] as the input sequences can contain R, Y, B, N themselves for >> example. Thus, [GCS] is a pattern we actually want to be able to use. > > That looks a reasonable suggestion. > > We can replace S with [GCS] directly. For the wider ambiguity codes, we can > replace them with the subsets: > > B [TGCBSYK] > D [TGADWRK] > H [TCAHWYM] > V [GCAVSRM] > > We can also allow 'C\S' to explicitly match CS in the input sequence by > escaping the S to skip the automatic expansion. > > These changes can be added to the next release. > > Thanks for the idea. > > Peter Rice > EMBOSS Team > > From pmr at ebi.ac.uk Wed Nov 2 13:37:58 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 02 Nov 2011 17:37:58 +0000 Subject: [EMBOSS] fuzznuc pattern expansion In-Reply-To: References: <4EAC3285.7080501@ebi.ac.uk> Message-ID: <4EB17FF6.9080504@ebi.ac.uk> Dear Bernd, On 02/11/2011 15:12, Bernd Web wrote: > Thanks! It would indeed be great to have the option to seach on the > ambiguity codes directly. Probably, I'd prefer the escape option, but > you mean to implement both escaping and expansion to subsets? Yes, we will implement both. Escaping is needed to find any ambiguity codes in a sequence. Expansion allows S to find G, C and S. > It might be good to report the pattern that was used in the matching. > Would the (very high) speed of fuzznuc be affected by always exploding > the to the subsets? For example, "N" would become "ACTGUMRWSYKVHDB". N is not a problem - it matches anything. The 2-letter ambiguity codes only expand to one extra letter, and 3-letter codes (B, D, H, V) are only very rarely used. regards, Peter Rice EMBOSS Team From bernd.web at gmail.com Mon Nov 7 07:24:36 2011 From: bernd.web at gmail.com (Bernd Web) Date: Mon, 7 Nov 2011 13:24:36 +0100 Subject: [EMBOSS] fuzznuc pattern expansion In-Reply-To: <4EB17FF6.9080504@ebi.ac.uk> References: <4EAC3285.7080501@ebi.ac.uk> <4EB17FF6.9080504@ebi.ac.uk> Message-ID: Dear all, Is or would it be possible to see the (numeric) position of the mismatches in the fuzznuc output file. E.g. the example output file shows mismatches, but not where there are located: http://emboss.sourceforge.net/apps/release/6.4/emboss/apps/fuzznuc.html#output.4 # pat2 1 cg(2)c(3)taaccctagc(3)ta 605 624 + pat2: cg(2)c(3)taaccctagc(3)ta 1 cggccctaaccctaacccta Clearly, we can find the position of mismatched by matching the supplied pattern with the reported match, but would not be preferred. Kind regards, Bernd On Wed, Nov 2, 2011 at 6:37 PM, Peter Rice wrote: > Dear Bernd, > > On 02/11/2011 15:12, Bernd Web wrote: > >> Thanks! It would indeed be great to have the option to seach on the >> ambiguity codes directly. Probably, I'd prefer the escape option, but >> you mean to implement both escaping and expansion to subsets? > > Yes, we will implement both. Escaping is needed to find any ambiguity codes > in a sequence. Expansion allows S to find G, C and S. > >> It might be good to report the pattern that was used in the matching. >> Would the (very high) speed of fuzznuc be affected by always exploding >> the to the subsets? For example, "N" would become "ACTGUMRWSYKVHDB". > > N is not a problem - it matches anything. The 2-letter ambiguity codes only > expand to one extra letter, and 3-letter codes (B, D, H, V) are only very > rarely used. > > regards, > > Peter Rice > EMBOSS Team > > From aengus.stewart at cancer.org.uk Thu Nov 17 13:11:31 2011 From: aengus.stewart at cancer.org.uk (Aengus Stewart) Date: Thu, 17 Nov 2011 18:11:31 +0000 Subject: [EMBOSS] Slight change in fuzznuc output Message-ID: <4EC54E53.2000004@cancer.org.uk> Errr.............bit surprised by this. the "Pattern_name" field has changed. Previously it was a reference - now it is the reference and the actual pattern............ intentional? I know the results are different - its just to illustrate the format change. pre 6.4.0.0 output #======================================= # # Sequence: chr1_145549491_145550484 from: 1 to: 200 # HitCount: 4 # # Pattern_name Mismatch Pattern # pattern1 3 CC[AT](6)GG # # Complement: Yes # #======================================= Start End Strand Pattern_name Mismatch Sequence 168 177 + pattern1 2 GCTATATAAG 169 178 + pattern1 1 CTATATAAGG 168 177 - pattern1 2 CTTATATAGC 169 178 - pattern1 1 CCTTATATAG #--------------------------------------- #--------------------------------------- post 6.4.0.0 output #======================================= # # Sequence: chr1_145549491_145550484 from: 1 to: 200 # HitCount: 2 # # Pattern_name Mismatch Pattern # pattern1 3 CC[AT](6)GG # # Complement: No # #======================================= Start End Strand Pattern_name Mismatch Sequence 101 110 + pattern1: CC[AT](6)GG 2 GCTATATAAG 102 111 + pattern1: CC[AT](6)GG 1 CTATATAAGG #--------------------------------------- #--------------------------------------- Cheers Aengus -- ----------------------------------------------------------------------- Aengus Stewart Tel: +44 (0)20 7269 3679 Head of Bioinformatics and BioStatistics CRUK London Research Institute Lincoln's Inn Fields, Holborn, London, WC2A 3LY, UK ----------------------------------------------------------------------- This electronic message contains information which may be privileged and confidential. The information is intended to be for the use of the individual(s) or entity named above. Be aware that any third party disclosure, distribution, copying or use of this communication, without prior permission, is strictly prohibited. NOTICE AND DISCLAIMER This e-mail (including any attachments) is intended for the above-named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose. We may monitor all incoming and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you. Cancer Research UK Registered in England and Wales Company Registered Number: 4325234. Registered Charity Number: 1089464 and Scotland SC041666 Registered Office Address: Angel Building, 407 St John Street, London EC1V 4AD. From p.j.a.cock at googlemail.com Tue Nov 29 12:13:46 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 29 Nov 2011 17:13:46 +0000 Subject: [EMBOSS] SCF files (Staden) Message-ID: Dear all, I am hoping to convert some ABI "Sanger" capillary files into SCF (i.e. *.ab1 to *.scf), and currently I've installed the Staden io_lib just to get their convert_trace tool. Does EMBOSS support SCF files as defined here: http://staden.sourceforge.net/formats.html I presume these are different from the "Staden" format mentioned here: http://emboss.sourceforge.net/docs/themes/SequenceFormats.html Thanks, Peter From pmr at ebi.ac.uk Tue Nov 29 12:36:04 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 29 Nov 2011 17:36:04 +0000 Subject: [EMBOSS] SCF files (Staden) In-Reply-To: References: Message-ID: <4ED51804.3000804@ebi.ac.uk> On 29/11/2011 17:13, Peter Cock wrote: > Dear all, > > I am hoping to convert some ABI "Sanger" capillary files > into SCF (i.e. *.ab1 to *.scf), and currently I've installed > the Staden io_lib just to get their convert_trace tool. > > Does EMBOSS support SCF files as defined here: > http://staden.sourceforge.net/formats.html > > I presume these are different from the "Staden" format mentioned here: > http://emboss.sourceforge.net/docs/themes/SequenceFormats.html "Staden" is GCG's text only staden format with an id in angle brackets. We also support experiment format. I'm looking into formats now - I'll add SCF if I can do it before Xmas. regards, Peter Rice From kellert at ohsu.edu Tue Nov 29 12:53:10 2011 From: kellert at ohsu.edu (Tom Keller) Date: Tue, 29 Nov 2011 09:53:10 -0800 Subject: [EMBOSS] SCF files (Staden) In-Reply-To: <4ED51804.3000804@ebi.ac.uk> References: <4ED51804.3000804@ebi.ac.uk> Message-ID: <0717AB0F-8DA7-4D0F-BC8E-AD747C2F22CB@ohsu.edu> Greetings, You can use BioPerl for this. EMBOSS and perl work well together. Here is a link: http://www.bioperl.org/wiki/HOWTO:SeqIO cheers, Tom MMI DNA Services Core Facility 503-494-2442 kellert at ohsu.edu Office: 6588 RJH (CROET/BasicScience) OHSU Shared Resources On Nov 29, 2011, at 9:36 AM, Peter Rice wrote: > On 29/11/2011 17:13, Peter Cock wrote: >> Dear all, >> >> I am hoping to convert some ABI "Sanger" capillary files >> into SCF (i.e. *.ab1 to *.scf), and currently I've installed >> the Staden io_lib just to get their convert_trace tool. >> >> Does EMBOSS support SCF files as defined here: >> http://staden.sourceforge.net/formats.html >> >> I presume these are different from the "Staden" format mentioned here: >> http://emboss.sourceforge.net/docs/themes/SequenceFormats.html > > "Staden" is GCG's text only staden format with an id in angle brackets. > We also support experiment format. > > I'm looking into formats now - I'll add SCF if I can do it before Xmas. > > regards, > > Peter Rice > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss From p.j.a.cock at googlemail.com Tue Nov 29 13:35:07 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 29 Nov 2011 18:35:07 +0000 Subject: [EMBOSS] SCF files (Staden) In-Reply-To: <0717AB0F-8DA7-4D0F-BC8E-AD747C2F22CB@ohsu.edu> References: <4ED51804.3000804@ebi.ac.uk> <0717AB0F-8DA7-4D0F-BC8E-AD747C2F22CB@ohsu.edu> Message-ID: On Tue, Nov 29, 2011 at 5:53 PM, Tom Keller wrote: > Greetings, > You can use BioPerl for this. EMBOSS and perl work well together. > Here is a link: http://www.bioperl.org/wiki/HOWTO:SeqIO > > cheers, > Tom Doesn't BioPerl just use the Staden libraries for this internally? Either way, Perl and I don't work so well together, so I'm happier using Staden's trace_convert directly ;) Peter From cjfields at illinois.edu Tue Nov 29 14:09:42 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 29 Nov 2011 19:09:42 +0000 Subject: [EMBOSS] SCF files (Staden) In-Reply-To: References: <4ED51804.3000804@ebi.ac.uk> <0717AB0F-8DA7-4D0F-BC8E-AD747C2F22CB@ohsu.edu> Message-ID: On Nov 29, 2011, at 12:35 PM, Peter Cock wrote: > On Tue, Nov 29, 2011 at 5:53 PM, Tom Keller wrote: >> Greetings, >> You can use BioPerl for this. EMBOSS and perl work well together. >> Here is a link: http://www.bioperl.org/wiki/HOWTO:SeqIO >> >> cheers, >> Tom > > Doesn't BioPerl just use the Staden libraries for this internally? Yes, and it uses an old version as well (via bioperl-ext). Much of this effort was to go into the biolib initiative for creating cross-lang bindings using swig, but that seems to be silent at the moment. I'm surprised Python doesn't have io_lib bindings. > Either way, Perl and I don't work so well together, so I'm happier > using Staden's trace_convert directly ;) > > Peter Heh. chris From pmr at ebi.ac.uk Wed Nov 30 05:30:38 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 30 Nov 2011 10:30:38 +0000 Subject: [EMBOSS] SCF files (Staden) In-Reply-To: References: <4ED51804.3000804@ebi.ac.uk> <0717AB0F-8DA7-4D0F-BC8E-AD747C2F22CB@ohsu.edu> Message-ID: <4ED605CE.4080903@ebi.ac.uk> On 11/29/2011 07:09 PM, Fields, Christopher J wrote: > On Nov 29, 2011, at 12:35 PM, Peter Cock wrote: >> >> Doesn't BioPerl just use the Staden libraries for this internally? > > Yes, and it uses an old version as well (via bioperl-ext). Much of this effort was to go into the biolib initiative for creating cross-lang bindings using swig, but that seems to be silent at the moment. I'm surprised Python doesn't have io_lib bindings. BioLib is just swig wrappers around the existing Bio* interfaces and code, so it will not help in this case if the projects are too divergent. Could we set up a Bio* collection of data formats with examples and note which projects can handle each one? We do not need any one project to cover everything - we can reasonably expect users to use some other project to interconvert formats if there are gaps. regards, Peter Rice EMBOSS Team From p.j.a.cock at googlemail.com Wed Nov 30 05:41:37 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Nov 2011 10:41:37 +0000 Subject: [EMBOSS] Common Sample Data Collection, was: SCF files (Staden) Message-ID: On Wed, Nov 30, 2011 at 10:30 AM, Peter Rice wrote: > On 11/29/2011 07:09 PM, Fields, Christopher J wrote: >> >> On Nov 29, 2011, at 12:35 PM, Peter Cock wrote: >>> >>> Doesn't BioPerl just use the Staden libraries for this internally? >> >> Yes, and it uses an old version as well (via bioperl-ext). ?Much of this >> effort was to go into the biolib initiative for creating cross-lang bindings >> using swig, but that seems to be silent at the moment. ?I'm surprised Python >> doesn't have io_lib bindings. > > BioLib is just swig wrappers around the existing Bio* interfaces and > code, so it will not help in this case if the projects are too divergent. > > Could we set up a Bio* collection of data formats with examples and > note which projects can handle each one? > > We do not need any one project to cover everything - we can reasonably > expect users to use some other project to interconvert formats if there are > gaps. > > regards, > > Peter Rice > EMBOSS Team Good plan. I suggest we make a repository on github, perhaps bio-data or something like that, under the recently created OBF account, https://github.com/OBF Peter R - do you have a GitHub account yet? If so we (me, Chris Field, etc) can give you access to the OBF org account. For licensing, where we are free to choose the licence, I would like to go with something as liberal as possible to allow the files to be used by any OSS project (or closed source project), (e.g. Public Domain, CC0, MIT/BSD) rather than something more principled but restricted like CC-BY or CC-BY-ND. However, as we know from recent Debian packaging discussion about test cases taken from UniProt, licensing and copyright of samples from a database is complicated. Here we must at least keep careful records about where data came from. Peter From pmr at ebi.ac.uk Wed Nov 30 06:04:49 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 30 Nov 2011 11:04:49 +0000 Subject: [EMBOSS] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: References: Message-ID: <4ED60DD1.8080005@ebi.ac.uk> On 11/30/2011 10:41 AM, Peter Cock wrote: > On Wed, Nov 30, 2011 at 10:30 AM, Peter Rice wrote: >> >> BioLib is just swig wrappers around the existing Bio* interfaces and >> code, so it will not help in this case if the projects are too divergent. >> >> Could we set up a Bio* collection of data formats with examples and >> note which projects can handle each one? >> >> We do not need any one project to cover everything - we can reasonably >> expect users to use some other project to interconvert formats if there are >> gaps. > > Good plan. I suggest we make a repository on github, perhaps > bio-data or something like that, under the recently created OBF > account, https://github.com/OBF > > Peter R - do you have a GitHub account yet? If so we (me, > Chris Field, etc) can give you access to the OBF org account. No ... rather a pain that EMBOSS got used. I've register under some other name: EMBOSSTEAM and created an EMBOSS project under it. Looks like git import requires subversion for any automation. Preumably I need a fresh EMBOSS checkout from CVS and then commit everything by hand ... best done after the release 6.5.0 code freeze. > For licensing, where we are free to choose the licence, I would > like to go with something as liberal as possible to allow the > files to be used by any OSS project (or closed source project), > (e.g. Public Domain, CC0, MIT/BSD) rather than something > more principled but restricted like CC-BY or CC-BY-ND. Public domain would be my choice - we don't want to cause conflicts if any data is imported into other projects (e.g. as test cases) > However, as we know from recent Debian packaging > discussion about test cases taken from UniProt, licensing > and copyright of samples from a database is complicated. > Here we must at least keep careful records about where > data came from. For that reason we probably should fake all the files for the public database formats. regards, Peter Rice EMBOSS team From p.j.a.cock at googlemail.com Wed Nov 30 06:14:44 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Nov 2011 11:14:44 +0000 Subject: [EMBOSS] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: <4ED60DD1.8080005@ebi.ac.uk> References: <4ED60DD1.8080005@ebi.ac.uk> Message-ID: On Wed, Nov 30, 2011 at 11:04 AM, Peter Rice wrote: > On 11/30/2011 10:41 AM, Peter Cock wrote: >> >> On Wed, Nov 30, 2011 at 10:30 AM, Peter Rice ?wrote: >>> >>> BioLib is just swig wrappers around the existing Bio* interfaces and >>> code, so it will not help in this case if the projects are too divergent. >>> >>> Could we set up a Bio* collection of data formats with examples and >>> note which projects can handle each one? >>> >>> We do not need any one project to cover everything - we can reasonably >>> expect users to use some other project to interconvert formats if there >>> are >>> gaps. >> >> Good plan. I suggest we make a repository on github, perhaps >> bio-data or something like that, under the recently created OBF >> account, https://github.com/OBF >> >> Peter R - do you have a GitHub account yet? If so we (me, >> Chris Field, etc) can give you access to the OBF org account. > > No ... rather a pain that EMBOSS got used. I've register under some other > name: EMBOSSTEAM and created an EMBOSS project under it. > > Looks like git import requires subversion for any automation. > Preumably I need a fresh EMBOSS checkout from CVS and > then commit everything by hand ... best done after the release > 6.5.0 code freeze. If you are talking about converting the EMBOSS CVS into git, we can help with that having done it for Biopython. As part of this it is possible to map CVS user names to github users. I meant do you personally have a github account? >> For licensing, where we are free to choose the licence, I would >> like to go with something as liberal as possible to allow the >> files to be used by any OSS project (or closed source project), >> (e.g. Public Domain, CC0, MIT/BSD) rather than something >> more principled but restricted like CC-BY or CC-BY-ND. > > Public domain would be my choice - we don't want to cause > conflicts if any data is imported into other projects (e.g. as > test cases) Yes, public domain would be simplest where possible. >> However, as we know from recent Debian packaging >> discussion about test cases taken from UniProt, licensing >> and copyright of samples from a database is complicated. >> Here we must at least keep careful records about where >> data came from. > > For that reason we probably should fake all the files for the > public database formats. Yes, a practical solution - although it has downsides of course. Peter From pmr at ebi.ac.uk Wed Nov 30 06:38:30 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 30 Nov 2011 11:38:30 +0000 Subject: [EMBOSS] [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: <20111130113250.GA32452@thebird.nl> References: <20111130113250.GA32452@thebird.nl> Message-ID: <4ED615B6.1080002@ebi.ac.uk> On 11/30/2011 11:32 AM, Pjotr Prins wrote: > Git is not very good for storing large data files, which we would want > to fetch partially. My suggestion would be to have a plain old file > repo, e.g. on S3, which can be mirrored by others. We had issues with large files in the EMBOSS release, and make those available via rsync to add to the developers CVS checkout. They include the NCBI taxonomy source and index files and the ontology source and index files. The next EMBOSS release will include http and ftp URLs as valid inputs for any data type, so EMBOSS could use remote files for format tests. I' look into how other repositories could be added. I had to add some extra qualifiers to allow queries and offsets to be specified, and rewrote the query language parsing to merge very similar code segments. regards, Peter Rice EMBOSS Team From p.j.a.cock at googlemail.com Wed Nov 30 06:42:22 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Nov 2011 11:42:22 +0000 Subject: [EMBOSS] [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: <4ED615B6.1080002@ebi.ac.uk> References: <20111130113250.GA32452@thebird.nl> <4ED615B6.1080002@ebi.ac.uk> Message-ID: On Wed, Nov 30, 2011 at 11:38 AM, Peter Rice wrote: > On 11/30/2011 11:32 AM, Pjotr Prins wrote: > >> Git is not very good for storing large data files, which we would want >> to fetch partially. My suggestion would be to have a plain old file >> repo, e.g. on S3, which can be mirrored by others. > > We had issues with large files in the EMBOSS release, and make those > available via rsync to add to the developers CVS checkout. They include the > NCBI taxonomy source and index files and the ontology source and index > files. > > The next EMBOSS release will include http and ftp URLs as valid inputs for > any data type, so EMBOSS could use remote files for format tests. I' look > into how other repositories could be added. > > I had to add some extra qualifiers to allow queries and offsets to be > specified, and rewrote the query language parsing to merge very similar code > segments. > > regards, > > Peter Rice > EMBOSS Team How about an OBF hosted FTP site then if we want big data? I guess we'd mostly be adding files, and changes/deletions should be rare, so a full version tracking repository isn't essential if we are disciplined about updating README files or more formal meta data. Peter From p.j.a.cock at googlemail.com Wed Nov 30 06:58:06 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Nov 2011 11:58:06 +0000 Subject: [EMBOSS] [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: <20111130114504.GA1542@thebird.nl> References: <20111130113250.GA32452@thebird.nl> <4ED615B6.1080002@ebi.ac.uk> <20111130114504.GA1542@thebird.nl> Message-ID: On Wed, Nov 30, 2011 at 11:45 AM, Pjotr Prins wrote: > On Wed, Nov 30, 2011 at 11:42:22AM +0000, Peter Cock wrote: >> How about an OBF hosted FTP site then if we want big data? > > Yes :) > >> I guess we'd mostly be adding files, and changes/deletions >> should be rare, so a full version tracking repository isn't >> essential if we are disciplined about updating README files >> or more formal meta data. > > We can still have the readme's and MD5s mirrored in a small repo. That > would track changes/moving/renaming. > > Pj. True, or even a hybrid where small files also live in a git repo, but for larger files we just store the URL and MD5? Peter From cjfields at illinois.edu Wed Nov 30 11:54:52 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Wed, 30 Nov 2011 16:54:52 +0000 Subject: [EMBOSS] [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: References: <20111130113250.GA32452@thebird.nl> <4ED615B6.1080002@ebi.ac.uk> <20111130114504.GA1542@thebird.nl> Message-ID: <5C116AE6-F2B1-4635-9673-15F35AC9C71D@illinois.edu> On Nov 30, 2011, at 5:58 AM, Peter Cock wrote: > On Wed, Nov 30, 2011 at 11:45 AM, Pjotr Prins wrote: >> On Wed, Nov 30, 2011 at 11:42:22AM +0000, Peter Cock wrote: >>> How about an OBF hosted FTP site then if we want big data? >> >> Yes :) >> >>> I guess we'd mostly be adding files, and changes/deletions >>> should be rare, so a full version tracking repository isn't >>> essential if we are disciplined about updating README files >>> or more formal meta data. >> >> We can still have the readme's and MD5s mirrored in a small repo. That >> would track changes/moving/renaming. >> >> Pj. > > True, or even a hybrid where small files also live in a git > repo, but for larger files we just store the URL and MD5? > > Peter There was an initial push for this years ago IIRC, with the biodata repository, but it never took off. Not sure if the dev.open-bio.org CVS repo is even browsable anymore (I believe this was all synced to portal for browsing), but the old biodata CVS repo is still in /home/repositories/biodata (very little there, might as well start from scratch). chris From bernd.web at gmail.com Wed Nov 2 15:12:11 2011 From: bernd.web at gmail.com (Bernd Web) Date: Wed, 2 Nov 2011 16:12:11 +0100 Subject: [EMBOSS] fuzznuc pattern expansion In-Reply-To: <4EAC3285.7080501@ebi.ac.uk> References: <4EAC3285.7080501@ebi.ac.uk> Message-ID: Dear Peter, Thanks! It would indeed be great to have the option to seach on the ambiguity codes directly. Probably, I'd prefer the escape option, but you mean to implement both escaping and expansion to subsets? This actually might be good in case a user does not know the contents of the DNA file (ie which ambiguity codes are present). It might be good to report the pattern that was used in the matching. Would the (very high) speed of fuzznuc be affected by always exploding the to the subsets? For example, "N" would become "ACTGUMRWSYKVHDB". This could mean searches of patterns with high degeneracy would include a lot of ambiguity codes. Kind regards, Bernd On Sat, Oct 29, 2011 at 7:06 PM, Peter Rice wrote: > On 28/10/2011 18:03, Bernd Web wrote: >> >> Hi >> >> Using fuzznuc I get illegal pattern warnings. I realize what is going on: >> >> "You can use ambiguity codes for nucleic acid searches but not within >> [] or {} as they expand to bracketed counterparts. For example, "s" is >> expanded to "[GC]" therefore [S] would be expanded to [[GC]] which is >> illegal." >> >> However, what I cannot find it how to suppress this expansion. Is this >> possible? We actually need to have these ambiguity remain as they are >> within [] as the input sequences can contain R, Y, B, N themselves for >> example. Thus, [GCS] is a pattern we actually want to be able to use. > > That looks a reasonable suggestion. > > We can replace S with [GCS] directly. For the wider ambiguity codes, we can > replace them with the subsets: > > B [TGCBSYK] > D [TGADWRK] > H [TCAHWYM] > V [GCAVSRM] > > We can also allow 'C\S' to explicitly match CS in the input sequence by > escaping the S to skip the automatic expansion. > > These changes can be added to the next release. > > Thanks for the idea. > > Peter Rice > EMBOSS Team > > From pmr at ebi.ac.uk Wed Nov 2 17:37:58 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 02 Nov 2011 17:37:58 +0000 Subject: [EMBOSS] fuzznuc pattern expansion In-Reply-To: References: <4EAC3285.7080501@ebi.ac.uk> Message-ID: <4EB17FF6.9080504@ebi.ac.uk> Dear Bernd, On 02/11/2011 15:12, Bernd Web wrote: > Thanks! It would indeed be great to have the option to seach on the > ambiguity codes directly. Probably, I'd prefer the escape option, but > you mean to implement both escaping and expansion to subsets? Yes, we will implement both. Escaping is needed to find any ambiguity codes in a sequence. Expansion allows S to find G, C and S. > It might be good to report the pattern that was used in the matching. > Would the (very high) speed of fuzznuc be affected by always exploding > the to the subsets? For example, "N" would become "ACTGUMRWSYKVHDB". N is not a problem - it matches anything. The 2-letter ambiguity codes only expand to one extra letter, and 3-letter codes (B, D, H, V) are only very rarely used. regards, Peter Rice EMBOSS Team From bernd.web at gmail.com Mon Nov 7 12:24:36 2011 From: bernd.web at gmail.com (Bernd Web) Date: Mon, 7 Nov 2011 13:24:36 +0100 Subject: [EMBOSS] fuzznuc pattern expansion In-Reply-To: <4EB17FF6.9080504@ebi.ac.uk> References: <4EAC3285.7080501@ebi.ac.uk> <4EB17FF6.9080504@ebi.ac.uk> Message-ID: Dear all, Is or would it be possible to see the (numeric) position of the mismatches in the fuzznuc output file. E.g. the example output file shows mismatches, but not where there are located: http://emboss.sourceforge.net/apps/release/6.4/emboss/apps/fuzznuc.html#output.4 # pat2 1 cg(2)c(3)taaccctagc(3)ta 605 624 + pat2: cg(2)c(3)taaccctagc(3)ta 1 cggccctaaccctaacccta Clearly, we can find the position of mismatched by matching the supplied pattern with the reported match, but would not be preferred. Kind regards, Bernd On Wed, Nov 2, 2011 at 6:37 PM, Peter Rice wrote: > Dear Bernd, > > On 02/11/2011 15:12, Bernd Web wrote: > >> Thanks! It would indeed be great to have the option to seach on the >> ambiguity codes directly. Probably, I'd prefer the escape option, but >> you mean to implement both escaping and expansion to subsets? > > Yes, we will implement both. Escaping is needed to find any ambiguity codes > in a sequence. Expansion allows S to find G, C and S. > >> It might be good to report the pattern that was used in the matching. >> Would the (very high) speed of fuzznuc be affected by always exploding >> the to the subsets? For example, "N" would become "ACTGUMRWSYKVHDB". > > N is not a problem - it matches anything. The 2-letter ambiguity codes only > expand to one extra letter, and 3-letter codes (B, D, H, V) are only very > rarely used. > > regards, > > Peter Rice > EMBOSS Team > > From aengus.stewart at cancer.org.uk Thu Nov 17 18:11:31 2011 From: aengus.stewart at cancer.org.uk (Aengus Stewart) Date: Thu, 17 Nov 2011 18:11:31 +0000 Subject: [EMBOSS] Slight change in fuzznuc output Message-ID: <4EC54E53.2000004@cancer.org.uk> Errr.............bit surprised by this. the "Pattern_name" field has changed. Previously it was a reference - now it is the reference and the actual pattern............ intentional? I know the results are different - its just to illustrate the format change. pre 6.4.0.0 output #======================================= # # Sequence: chr1_145549491_145550484 from: 1 to: 200 # HitCount: 4 # # Pattern_name Mismatch Pattern # pattern1 3 CC[AT](6)GG # # Complement: Yes # #======================================= Start End Strand Pattern_name Mismatch Sequence 168 177 + pattern1 2 GCTATATAAG 169 178 + pattern1 1 CTATATAAGG 168 177 - pattern1 2 CTTATATAGC 169 178 - pattern1 1 CCTTATATAG #--------------------------------------- #--------------------------------------- post 6.4.0.0 output #======================================= # # Sequence: chr1_145549491_145550484 from: 1 to: 200 # HitCount: 2 # # Pattern_name Mismatch Pattern # pattern1 3 CC[AT](6)GG # # Complement: No # #======================================= Start End Strand Pattern_name Mismatch Sequence 101 110 + pattern1: CC[AT](6)GG 2 GCTATATAAG 102 111 + pattern1: CC[AT](6)GG 1 CTATATAAGG #--------------------------------------- #--------------------------------------- Cheers Aengus -- ----------------------------------------------------------------------- Aengus Stewart Tel: +44 (0)20 7269 3679 Head of Bioinformatics and BioStatistics CRUK London Research Institute Lincoln's Inn Fields, Holborn, London, WC2A 3LY, UK ----------------------------------------------------------------------- This electronic message contains information which may be privileged and confidential. The information is intended to be for the use of the individual(s) or entity named above. Be aware that any third party disclosure, distribution, copying or use of this communication, without prior permission, is strictly prohibited. NOTICE AND DISCLAIMER This e-mail (including any attachments) is intended for the above-named person(s). If you are not the intended recipient, notify the sender immediately, delete this email from your system and do not disclose or use for any purpose. We may monitor all incoming and outgoing emails in line with current legislation. We have taken steps to ensure that this email and attachments are free from any virus, but it remains your responsibility to ensure that viruses do not adversely affect you. Cancer Research UK Registered in England and Wales Company Registered Number: 4325234. Registered Charity Number: 1089464 and Scotland SC041666 Registered Office Address: Angel Building, 407 St John Street, London EC1V 4AD. From p.j.a.cock at googlemail.com Tue Nov 29 17:13:46 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 29 Nov 2011 17:13:46 +0000 Subject: [EMBOSS] SCF files (Staden) Message-ID: Dear all, I am hoping to convert some ABI "Sanger" capillary files into SCF (i.e. *.ab1 to *.scf), and currently I've installed the Staden io_lib just to get their convert_trace tool. Does EMBOSS support SCF files as defined here: http://staden.sourceforge.net/formats.html I presume these are different from the "Staden" format mentioned here: http://emboss.sourceforge.net/docs/themes/SequenceFormats.html Thanks, Peter From pmr at ebi.ac.uk Tue Nov 29 17:36:04 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Tue, 29 Nov 2011 17:36:04 +0000 Subject: [EMBOSS] SCF files (Staden) In-Reply-To: References: Message-ID: <4ED51804.3000804@ebi.ac.uk> On 29/11/2011 17:13, Peter Cock wrote: > Dear all, > > I am hoping to convert some ABI "Sanger" capillary files > into SCF (i.e. *.ab1 to *.scf), and currently I've installed > the Staden io_lib just to get their convert_trace tool. > > Does EMBOSS support SCF files as defined here: > http://staden.sourceforge.net/formats.html > > I presume these are different from the "Staden" format mentioned here: > http://emboss.sourceforge.net/docs/themes/SequenceFormats.html "Staden" is GCG's text only staden format with an id in angle brackets. We also support experiment format. I'm looking into formats now - I'll add SCF if I can do it before Xmas. regards, Peter Rice From kellert at ohsu.edu Tue Nov 29 17:53:10 2011 From: kellert at ohsu.edu (Tom Keller) Date: Tue, 29 Nov 2011 09:53:10 -0800 Subject: [EMBOSS] SCF files (Staden) In-Reply-To: <4ED51804.3000804@ebi.ac.uk> References: <4ED51804.3000804@ebi.ac.uk> Message-ID: <0717AB0F-8DA7-4D0F-BC8E-AD747C2F22CB@ohsu.edu> Greetings, You can use BioPerl for this. EMBOSS and perl work well together. Here is a link: http://www.bioperl.org/wiki/HOWTO:SeqIO cheers, Tom MMI DNA Services Core Facility 503-494-2442 kellert at ohsu.edu Office: 6588 RJH (CROET/BasicScience) OHSU Shared Resources On Nov 29, 2011, at 9:36 AM, Peter Rice wrote: > On 29/11/2011 17:13, Peter Cock wrote: >> Dear all, >> >> I am hoping to convert some ABI "Sanger" capillary files >> into SCF (i.e. *.ab1 to *.scf), and currently I've installed >> the Staden io_lib just to get their convert_trace tool. >> >> Does EMBOSS support SCF files as defined here: >> http://staden.sourceforge.net/formats.html >> >> I presume these are different from the "Staden" format mentioned here: >> http://emboss.sourceforge.net/docs/themes/SequenceFormats.html > > "Staden" is GCG's text only staden format with an id in angle brackets. > We also support experiment format. > > I'm looking into formats now - I'll add SCF if I can do it before Xmas. > > regards, > > Peter Rice > _______________________________________________ > EMBOSS mailing list > EMBOSS at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/emboss From p.j.a.cock at googlemail.com Tue Nov 29 18:35:07 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 29 Nov 2011 18:35:07 +0000 Subject: [EMBOSS] SCF files (Staden) In-Reply-To: <0717AB0F-8DA7-4D0F-BC8E-AD747C2F22CB@ohsu.edu> References: <4ED51804.3000804@ebi.ac.uk> <0717AB0F-8DA7-4D0F-BC8E-AD747C2F22CB@ohsu.edu> Message-ID: On Tue, Nov 29, 2011 at 5:53 PM, Tom Keller wrote: > Greetings, > You can use BioPerl for this. EMBOSS and perl work well together. > Here is a link: http://www.bioperl.org/wiki/HOWTO:SeqIO > > cheers, > Tom Doesn't BioPerl just use the Staden libraries for this internally? Either way, Perl and I don't work so well together, so I'm happier using Staden's trace_convert directly ;) Peter From cjfields at illinois.edu Tue Nov 29 19:09:42 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 29 Nov 2011 19:09:42 +0000 Subject: [EMBOSS] SCF files (Staden) In-Reply-To: References: <4ED51804.3000804@ebi.ac.uk> <0717AB0F-8DA7-4D0F-BC8E-AD747C2F22CB@ohsu.edu> Message-ID: On Nov 29, 2011, at 12:35 PM, Peter Cock wrote: > On Tue, Nov 29, 2011 at 5:53 PM, Tom Keller wrote: >> Greetings, >> You can use BioPerl for this. EMBOSS and perl work well together. >> Here is a link: http://www.bioperl.org/wiki/HOWTO:SeqIO >> >> cheers, >> Tom > > Doesn't BioPerl just use the Staden libraries for this internally? Yes, and it uses an old version as well (via bioperl-ext). Much of this effort was to go into the biolib initiative for creating cross-lang bindings using swig, but that seems to be silent at the moment. I'm surprised Python doesn't have io_lib bindings. > Either way, Perl and I don't work so well together, so I'm happier > using Staden's trace_convert directly ;) > > Peter Heh. chris From pmr at ebi.ac.uk Wed Nov 30 10:30:38 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 30 Nov 2011 10:30:38 +0000 Subject: [EMBOSS] SCF files (Staden) In-Reply-To: References: <4ED51804.3000804@ebi.ac.uk> <0717AB0F-8DA7-4D0F-BC8E-AD747C2F22CB@ohsu.edu> Message-ID: <4ED605CE.4080903@ebi.ac.uk> On 11/29/2011 07:09 PM, Fields, Christopher J wrote: > On Nov 29, 2011, at 12:35 PM, Peter Cock wrote: >> >> Doesn't BioPerl just use the Staden libraries for this internally? > > Yes, and it uses an old version as well (via bioperl-ext). Much of this effort was to go into the biolib initiative for creating cross-lang bindings using swig, but that seems to be silent at the moment. I'm surprised Python doesn't have io_lib bindings. BioLib is just swig wrappers around the existing Bio* interfaces and code, so it will not help in this case if the projects are too divergent. Could we set up a Bio* collection of data formats with examples and note which projects can handle each one? We do not need any one project to cover everything - we can reasonably expect users to use some other project to interconvert formats if there are gaps. regards, Peter Rice EMBOSS Team From p.j.a.cock at googlemail.com Wed Nov 30 10:41:37 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Nov 2011 10:41:37 +0000 Subject: [EMBOSS] Common Sample Data Collection, was: SCF files (Staden) Message-ID: On Wed, Nov 30, 2011 at 10:30 AM, Peter Rice wrote: > On 11/29/2011 07:09 PM, Fields, Christopher J wrote: >> >> On Nov 29, 2011, at 12:35 PM, Peter Cock wrote: >>> >>> Doesn't BioPerl just use the Staden libraries for this internally? >> >> Yes, and it uses an old version as well (via bioperl-ext). ?Much of this >> effort was to go into the biolib initiative for creating cross-lang bindings >> using swig, but that seems to be silent at the moment. ?I'm surprised Python >> doesn't have io_lib bindings. > > BioLib is just swig wrappers around the existing Bio* interfaces and > code, so it will not help in this case if the projects are too divergent. > > Could we set up a Bio* collection of data formats with examples and > note which projects can handle each one? > > We do not need any one project to cover everything - we can reasonably > expect users to use some other project to interconvert formats if there are > gaps. > > regards, > > Peter Rice > EMBOSS Team Good plan. I suggest we make a repository on github, perhaps bio-data or something like that, under the recently created OBF account, https://github.com/OBF Peter R - do you have a GitHub account yet? If so we (me, Chris Field, etc) can give you access to the OBF org account. For licensing, where we are free to choose the licence, I would like to go with something as liberal as possible to allow the files to be used by any OSS project (or closed source project), (e.g. Public Domain, CC0, MIT/BSD) rather than something more principled but restricted like CC-BY or CC-BY-ND. However, as we know from recent Debian packaging discussion about test cases taken from UniProt, licensing and copyright of samples from a database is complicated. Here we must at least keep careful records about where data came from. Peter From pmr at ebi.ac.uk Wed Nov 30 11:04:49 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 30 Nov 2011 11:04:49 +0000 Subject: [EMBOSS] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: References: Message-ID: <4ED60DD1.8080005@ebi.ac.uk> On 11/30/2011 10:41 AM, Peter Cock wrote: > On Wed, Nov 30, 2011 at 10:30 AM, Peter Rice wrote: >> >> BioLib is just swig wrappers around the existing Bio* interfaces and >> code, so it will not help in this case if the projects are too divergent. >> >> Could we set up a Bio* collection of data formats with examples and >> note which projects can handle each one? >> >> We do not need any one project to cover everything - we can reasonably >> expect users to use some other project to interconvert formats if there are >> gaps. > > Good plan. I suggest we make a repository on github, perhaps > bio-data or something like that, under the recently created OBF > account, https://github.com/OBF > > Peter R - do you have a GitHub account yet? If so we (me, > Chris Field, etc) can give you access to the OBF org account. No ... rather a pain that EMBOSS got used. I've register under some other name: EMBOSSTEAM and created an EMBOSS project under it. Looks like git import requires subversion for any automation. Preumably I need a fresh EMBOSS checkout from CVS and then commit everything by hand ... best done after the release 6.5.0 code freeze. > For licensing, where we are free to choose the licence, I would > like to go with something as liberal as possible to allow the > files to be used by any OSS project (or closed source project), > (e.g. Public Domain, CC0, MIT/BSD) rather than something > more principled but restricted like CC-BY or CC-BY-ND. Public domain would be my choice - we don't want to cause conflicts if any data is imported into other projects (e.g. as test cases) > However, as we know from recent Debian packaging > discussion about test cases taken from UniProt, licensing > and copyright of samples from a database is complicated. > Here we must at least keep careful records about where > data came from. For that reason we probably should fake all the files for the public database formats. regards, Peter Rice EMBOSS team From p.j.a.cock at googlemail.com Wed Nov 30 11:14:44 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Nov 2011 11:14:44 +0000 Subject: [EMBOSS] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: <4ED60DD1.8080005@ebi.ac.uk> References: <4ED60DD1.8080005@ebi.ac.uk> Message-ID: On Wed, Nov 30, 2011 at 11:04 AM, Peter Rice wrote: > On 11/30/2011 10:41 AM, Peter Cock wrote: >> >> On Wed, Nov 30, 2011 at 10:30 AM, Peter Rice ?wrote: >>> >>> BioLib is just swig wrappers around the existing Bio* interfaces and >>> code, so it will not help in this case if the projects are too divergent. >>> >>> Could we set up a Bio* collection of data formats with examples and >>> note which projects can handle each one? >>> >>> We do not need any one project to cover everything - we can reasonably >>> expect users to use some other project to interconvert formats if there >>> are >>> gaps. >> >> Good plan. I suggest we make a repository on github, perhaps >> bio-data or something like that, under the recently created OBF >> account, https://github.com/OBF >> >> Peter R - do you have a GitHub account yet? If so we (me, >> Chris Field, etc) can give you access to the OBF org account. > > No ... rather a pain that EMBOSS got used. I've register under some other > name: EMBOSSTEAM and created an EMBOSS project under it. > > Looks like git import requires subversion for any automation. > Preumably I need a fresh EMBOSS checkout from CVS and > then commit everything by hand ... best done after the release > 6.5.0 code freeze. If you are talking about converting the EMBOSS CVS into git, we can help with that having done it for Biopython. As part of this it is possible to map CVS user names to github users. I meant do you personally have a github account? >> For licensing, where we are free to choose the licence, I would >> like to go with something as liberal as possible to allow the >> files to be used by any OSS project (or closed source project), >> (e.g. Public Domain, CC0, MIT/BSD) rather than something >> more principled but restricted like CC-BY or CC-BY-ND. > > Public domain would be my choice - we don't want to cause > conflicts if any data is imported into other projects (e.g. as > test cases) Yes, public domain would be simplest where possible. >> However, as we know from recent Debian packaging >> discussion about test cases taken from UniProt, licensing >> and copyright of samples from a database is complicated. >> Here we must at least keep careful records about where >> data came from. > > For that reason we probably should fake all the files for the > public database formats. Yes, a practical solution - although it has downsides of course. Peter From pmr at ebi.ac.uk Wed Nov 30 11:38:30 2011 From: pmr at ebi.ac.uk (Peter Rice) Date: Wed, 30 Nov 2011 11:38:30 +0000 Subject: [EMBOSS] [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: <20111130113250.GA32452@thebird.nl> References: <20111130113250.GA32452@thebird.nl> Message-ID: <4ED615B6.1080002@ebi.ac.uk> On 11/30/2011 11:32 AM, Pjotr Prins wrote: > Git is not very good for storing large data files, which we would want > to fetch partially. My suggestion would be to have a plain old file > repo, e.g. on S3, which can be mirrored by others. We had issues with large files in the EMBOSS release, and make those available via rsync to add to the developers CVS checkout. They include the NCBI taxonomy source and index files and the ontology source and index files. The next EMBOSS release will include http and ftp URLs as valid inputs for any data type, so EMBOSS could use remote files for format tests. I' look into how other repositories could be added. I had to add some extra qualifiers to allow queries and offsets to be specified, and rewrote the query language parsing to merge very similar code segments. regards, Peter Rice EMBOSS Team From p.j.a.cock at googlemail.com Wed Nov 30 11:42:22 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Nov 2011 11:42:22 +0000 Subject: [EMBOSS] [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: <4ED615B6.1080002@ebi.ac.uk> References: <20111130113250.GA32452@thebird.nl> <4ED615B6.1080002@ebi.ac.uk> Message-ID: On Wed, Nov 30, 2011 at 11:38 AM, Peter Rice wrote: > On 11/30/2011 11:32 AM, Pjotr Prins wrote: > >> Git is not very good for storing large data files, which we would want >> to fetch partially. My suggestion would be to have a plain old file >> repo, e.g. on S3, which can be mirrored by others. > > We had issues with large files in the EMBOSS release, and make those > available via rsync to add to the developers CVS checkout. They include the > NCBI taxonomy source and index files and the ontology source and index > files. > > The next EMBOSS release will include http and ftp URLs as valid inputs for > any data type, so EMBOSS could use remote files for format tests. I' look > into how other repositories could be added. > > I had to add some extra qualifiers to allow queries and offsets to be > specified, and rewrote the query language parsing to merge very similar code > segments. > > regards, > > Peter Rice > EMBOSS Team How about an OBF hosted FTP site then if we want big data? I guess we'd mostly be adding files, and changes/deletions should be rare, so a full version tracking repository isn't essential if we are disciplined about updating README files or more formal meta data. Peter From p.j.a.cock at googlemail.com Wed Nov 30 11:58:06 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 30 Nov 2011 11:58:06 +0000 Subject: [EMBOSS] [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: <20111130114504.GA1542@thebird.nl> References: <20111130113250.GA32452@thebird.nl> <4ED615B6.1080002@ebi.ac.uk> <20111130114504.GA1542@thebird.nl> Message-ID: On Wed, Nov 30, 2011 at 11:45 AM, Pjotr Prins wrote: > On Wed, Nov 30, 2011 at 11:42:22AM +0000, Peter Cock wrote: >> How about an OBF hosted FTP site then if we want big data? > > Yes :) > >> I guess we'd mostly be adding files, and changes/deletions >> should be rare, so a full version tracking repository isn't >> essential if we are disciplined about updating README files >> or more formal meta data. > > We can still have the readme's and MD5s mirrored in a small repo. That > would track changes/moving/renaming. > > Pj. True, or even a hybrid where small files also live in a git repo, but for larger files we just store the URL and MD5? Peter From cjfields at illinois.edu Wed Nov 30 16:54:52 2011 From: cjfields at illinois.edu (Fields, Christopher J) Date: Wed, 30 Nov 2011 16:54:52 +0000 Subject: [EMBOSS] [Open-bio-l] Common Sample Data Collection, was: SCF files (Staden) In-Reply-To: References: <20111130113250.GA32452@thebird.nl> <4ED615B6.1080002@ebi.ac.uk> <20111130114504.GA1542@thebird.nl> Message-ID: <5C116AE6-F2B1-4635-9673-15F35AC9C71D@illinois.edu> On Nov 30, 2011, at 5:58 AM, Peter Cock wrote: > On Wed, Nov 30, 2011 at 11:45 AM, Pjotr Prins wrote: >> On Wed, Nov 30, 2011 at 11:42:22AM +0000, Peter Cock wrote: >>> How about an OBF hosted FTP site then if we want big data? >> >> Yes :) >> >>> I guess we'd mostly be adding files, and changes/deletions >>> should be rare, so a full version tracking repository isn't >>> essential if we are disciplined about updating README files >>> or more formal meta data. >> >> We can still have the readme's and MD5s mirrored in a small repo. That >> would track changes/moving/renaming. >> >> Pj. > > True, or even a hybrid where small files also live in a git > repo, but for larger files we just store the URL and MD5? > > Peter There was an initial push for this years ago IIRC, with the biodata repository, but it never took off. Not sure if the dev.open-bio.org CVS repo is even browsable anymore (I believe this was all synced to portal for browsing), but the old biodata CVS repo is still in /home/repositories/biodata (very little there, might as well start from scratch). chris