From ngoto at gen-info.osaka-u.ac.jp Mon Feb 1 06:28:40 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Mon, 1 Feb 2010 20:28:40 +0900 Subject: [BioRuby] Thread-safety of alignment In-Reply-To: References: <20100120135046.158711CBC511@idnmail.gen-info.osaka-u.ac.jp> <20100126150004.8B4AA1CBC3EC@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20100201112840.CB33F1CBC4F9@idnmail.gen-info.osaka-u.ac.jp> Hi Andrew Grimm, This is not the Tempfile bug, but the problem of fork with threads. Currently, to call a command safely without escaping of command-line string, bioruby internally uses fork in UNIX. When fork is executed, whole data of the process is copied to the child process, including Ruby threads and finalizers of Tempfile objects. In the child process, when the running thread is switched, unexpected behavior may occur, for example, executing another ClustalW with the same output file name of the parent process, temporary files are removed by finalizers of child process when the ClustalW is regarded as finished in the child process but not in the parent process. The patch below can fix, or can reduce the problem. ------------------------------------------------------------------- diff --git a/lib/bio/command.rb b/lib/bio/command.rb index 4f3ac94..ebd9cc5 100644 --- a/lib/bio/command.rb +++ b/lib/bio/command.rb @@ -196,12 +196,15 @@ module Command def call_command_fork(cmd, options = {}) dir = options[:chdir] cmd = safe_command_line_array(cmd) + tc, Thread.critical = Thread.critical, true IO.popen("-", "r+") do |io| if io then # parent + Thread.critical = tc yield io else # child + GC.disable # chdir to options[:chdir] if available begin Dir.chdir(dir) if dir ------------------------------------------------------------------- Note that the patch does not work with Ruby 1.9 because Thread.critical is removed in Ruby 1.9. In Ruby 1.9.1, IO.popen is improved to get command-line as an array without calling shell, and the problem of string escaping is completely resolved. Changes supporting Ruby 1.9.1 will soon be available in my GitHub repository. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Wed, 27 Jan 2010 14:07:18 +1100 Andrew Grimm wrote: > Hi Naohisa Goto, > > On Wed, Jan 27, 2010 at 2:00 AM, Naohisa GOTO > wrote: > > Hi Andrew, > > > > On Tue, 26 Jan 2010 23:12:35 +1100 > > Andrew Grimm wrote: > > > >> Hi Naohisa Goto, > >> > >> I tried creating a new factory in each thread, but I sometimes (but > >> not always) have errors. > > > > Please show ruby version and BioRuby version. > > % ruby -v > > % ruby -rbio -e 'puts Bio::BIORUBY_VERSION_ID' > > (If you are using BioRuby 1.2.1 or earlier, > > % ruby -rbio -e 'p Bio::BIORUBY_VERSION' > > ) > > > > I'm running ruby 1.8.7 (2008-08-11 patchlevel 72) and bioruby 1.4.0. > > >> Is the code in http://github.com/agrimm/bioruby-alignment-threading-replication/blob/master/test/test_multithreaded_alignment.rb > >> correct? Does it cause problems for anyone else? > > > > The "rescue RuntimeError" in line 15 may hide problems. > > In my environment, it seems that the RuntimeError is raised > > in lib/bio/alignment.rb. The error message I observed > > without the rescue was > > "alignment result is inconsistent with input data", > > and output file created by Clustalw was unexpectedly empty. > > It might be a bug of Tempfile in Ruby, but not sure. > > > > With Ruby 1.8.7, errors are observed in some times. > > % ruby -v > > ruby 1.8.7 (2010-01-10 patchlevel 249) [i686-linux] > > ruby 1.8.7 (2009-04-08 patchlevel 160) [i686-linux] > > ruby 1.8.7 (2008-08-11 patchlevel 72) [i486-linux] > > > > With Ruby 1.9.1-p378, no errors when I executed several times. > > % ruby -v > > ruby 1.9.1p378 (2010-01-10 revision 26273) [i686-linux] > > > > I suspect errors may occur on earlier versions of ruby 1.9.1. > > >> Some of the errors I get include the ones seen at http://gist.github.com/286775 > > > > The message "ERROR: Multiple sequences found with same name > > (found 0 at least twice)!" is reported by ClustalW, and > > it indicates incorrect input file sequence names. Maybe > > two file contents are unexpectedly concatenated or mixed > > possibly due to a bug of Tempfile, but not sure. > > > >> It's possible that the issues are caused by problems in tempfile > >> itself (which may have been fixed in August 2009 according to the > >> changelog). > > > > Another possibility is resource limits of the machine: > > the number of child processes, total memory size, etc. > > If exceeding limits, new child clustalw process could > > not be started, or running clustalw processes might be > > killed. This also causes void or truncated result files, > > and leads to ruby-level errors. > > > > Thanks for that suggestion. I re-ran the test using only 5 threads in > the new gist http://gist.github.com/287499 > > > Thanks, > > > > Naohisa Goto > > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > > >> > >> Thanks, > >> > >> Andrew > >> > >> On Thu, Jan 21, 2010 at 12:50 AM, Naohisa GOTO > >> wrote: > >> > Hi, > >> > > >> > On Wed, 20 Jan 2010 23:09:19 +1100 > >> > Andrew Grimm wrote: > >> > > >> >> Is alignment intended to be thread-safe in bioruby? If so, should I > >> >> use the same alignment factory between threads, or a separate one in > >> >> each thread? > >> > > >> > It is not confirmed to be thread-safe, so it is safe to use > >> > separate one in each thread. > >> > > >> > Currently, in BioRuby, manipulating the same object from different > >> > threads is not intended. When manipulating the same object from > >> > different threads is needed, using mutex is recommended. > >> > > >> > For library developers, it is encouraged to write thread-safe > >> > code if possible, but not mandatory. > >> > > >> > Naohisa Goto > >> > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > >> > > >> >> > >> >> Andrew From mitlox at op.pl Fri Feb 5 06:19:43 2010 From: mitlox at op.pl (xyz) Date: Fri, 05 Feb 2010 21:19:43 +1000 Subject: [BioRuby] Bioruby and jrubyc problem Message-ID: <4B6BFECF.3090809@op.pl> Hallo, I installed Bioruby with jruby setup.rb. then I run this script with jruby -------------- #!/usr/bin/env jruby require 'bio' # creating a Bio::Sequence::NA object containing ambiguous alphabets ambiguous_seq = Bio::Sequence::NA.new("atgcyrwskmbdhvn") # show the contents and class of the DNA sequence object p ambiguous_seq # => "atgcyrwskmbdhvn" p ambiguous_seq.class # => Bio::Sequence::NA # convert the sequence to a Regexp object p ambiguous_seq.to_re # => /atgc[tc][ag][at][gc][tg][ac][tgc][atg][atc][agc][atgc]/ p ambiguous_seq.to_re.class # => Regexp # example to match an ambiguous sequence to the rigid sequence att_or_atc = Bio::Sequence::NA.new("aty").to_re puts "match" if att_or_atc.match(Bio::Sequence::NA.new("att")) if Bio::Sequence::NA.new("atc") =~ att_or_atc puts "also match" end -------------- without any problems. After this I run it with Java and I have got following problem: jrubyc s01.rb an then java -cp /home/mitlox/jruby-1.4.0/lib/jruby.jar:. s01 Exception in thread "main" s01.rb:3:in `require': no such file to load -- bio (LoadError) from s01.rb:3 ...internal jruby stack elided... from Kernel.require(s01.rb:3) from (unknown).(unknown)(:1) What did I wrong? Best regards, From mauricio at open-bio.org Fri Feb 5 10:48:30 2010 From: mauricio at open-bio.org (Mauricio Herrera Cuadra) Date: Fri, 05 Feb 2010 09:48:30 -0600 Subject: [BioRuby] Fwd: Changes to NCBI BLAST and E-utilities. Message-ID: <4B6C3DCE.2070808@open-bio.org> Forwarding to the proper lists... -------- Original Message -------- Subject: [O|B|F Helpdesk #889] Changes to NCBI BLAST and E-utilities. Date: Fri, 5 Feb 2010 10:08:51 -0500 From: mcginnis via RT Reply-To: support at helpdesk.open-bio.org To: chris at bioteam.net, heikki at sanbi.ac.za, hlapp at gmx.net, jason at bioperl.org, mauricio at open-bio.org Fri Feb 05 10:08:51 2010: Request 889 was acted upon. Transaction: Ticket created by mcginnis at ncbi.nlm.nih.gov Queue: support at open-bio.org Subject: Changes to NCBI BLAST and E-utilities. Owner: Nobody Requestors: mcginnis at ncbi.nlm.nih.gov Status: new Ticket Dear Colleague: There are two changes I'd like to make you aware of. As you may or may not have noticed, we have been working on a new C++ version of the BLAST binaries. In the coming months we will be moving the C++ binaries into prominence and (slowly) phasing out the C toolkit binaries. There are many changes not least of which is a move to individual binaries for each program (blastn, blastp, etc). We are not sure how many of your users use BioPerl with the BLAST binaries, my understanding is that many use BioPerl to to remote BLAST. However, there isa change to the BLAST results in Text and presumably HTML. This could have an effect on any parsers which scrape these formats and do not use XML. For obvious reason, we want to support only the XML format for parsing, but we thought we should give you heads up on this. blast 2.2.22 Query: 3307 ------------------------------------------------------------ 3307 Sbjct: 390 GSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGP 449 blast 2.2.22+ Query ------------------------------------------------------------ Sbjct 390 GSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGP 449 A single line of gaps lacks the Query numbering in the blast+ output. The C version of blast has numbering in this case. Sample alignment shown below. According to users the blast+ output without the numbering breaks bioperl parsers. Wehave heard forma few but I think they may be older parsers? The second issue is a policy concerning E-utilities. This was announced on the utilities-announce at ncbi.nlm.nih.gov mail-list but you may not have seen it. As part of an ongoing effort to ensure efficient access to the Entrez Utilities (E-utilities) by all users, NCBI has decided to change the usage policy for the E-utilities effective June 1, 2010. Effective on June 1, 2010, all E-utility requests, either using standard URLs or SOAP, must contain non-null values for both the &tool and &email parameters. Any E-utility request made after June 1, 2010 that does not contain values for both parameters will return an error explaining that these parameters must be included in E-utility requests. The value of the &tool parameter should be a URI-safe string that is the name of the software package, script or web page producing the E-utility request. The value of the &email parameter should be a valid e-mail address for the appropriate contact person or group responsible for maintaining the tool producing the E-utility request. NCBI uses these parameters to contact users whose use of the E-utilities violates the standard usage policies described athttp://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html#UserSystemRequirements. These usage policies are designed to prevent excessive requests from a small group of users from reducing or eliminating the wider community's access to the E-utilities. NCBI will attempt to contact a user at the e-mail address provided in the &email parameter prior to blocking access to the E-utilities. NCBI realizes that this policy change will require many of our users to change their code. Based on past experience, we anticipate that most of our users should be able to make the necessary changes before the June 1, 2010 deadline. If you have any concerns about making these changes by that date, or if you have any questions about these policies, please contact eutilities at ncbi.nlm.nih.gov. Thank you for your understanding and cooperation in helping us continue to deliver a reliable and efficient web service. I think you already adhere to this policy but should a user's script not meet these requirements, than the script will fail and requests will be turned away with an error message. Scott D. McGinnis M.A. NCBI/NLM/NIH 45 Center Drive, MSC 6511 Bldg 45, Room 4AN.44C Bethesda, MD 20892 mcginnis at ncbi.nlm.nih.gov From k at bioruby.org Mon Feb 8 20:03:55 2010 From: k at bioruby.org (Toshiaki Katayama) Date: Tue, 9 Feb 2010 10:03:55 +0900 Subject: [BioRuby] Fwd: Pass the word: All open-bio sites/servers may be unavailable for a few hours this week References: <4B701DD4.60801@sonsorol.org> Message-ID: <27E7CE0A-487A-43CC-B95A-F48F801702C9@bioruby.org> FYI Begin forwarded message: > ???: Chris Dagdigian > ??: 2010?2?8? 23:21:08JST > ??: OBF Board , Bioroot , Kam Dahlquist , Peter , Martin Senger , ajb at ebi.ac.uk, Toshiaki Katayama , Mark Wilkinson > ??: Pass the word: All open-bio sites/servers may be unavailable for a few hours this week > > Hi folks, > > Long overdue server/system/IT houskeeping note here. I decided not to do a mass OBF list email so feel free to pass this message along to the people and lists that need to know. I'm hoping the transition might only be noticed by a few people and we'll be back up before the majority see anything different. > > > ### > The short story is that we need to rip the existing open-bio servers out of their current datacenter and drive them over to a new colocation facility a few miles away. The servers should be down for no more than an hour or two but DNS changes may take much longer to propagate throughout the internet. > > We won't be able to give much notice for the downtime, it could be as early as tomorrow afternoon (Tuesday the 9th). The server transplant needs to be coordinated around some other work that I can't talk about. > #### > > > The longer story is below for those that are interested. > > My employer (www.bioteam.net) has long been donating the physical costs of hosting the Open Bio servers. > > We had been doing this in a Boston area datacenter where we rented a 6x8 foot private cage. The price of all this space and associated electricity, bandwidth etc. costs thousands of dollars per month (for the entire cage, not just OBF stuff...) > > For several reasons, mostly business related, BioTeam is switching to a collocation provider. We've has already migrated the majority of the corporate systems which means that the OBF servers are sitting in a mostly empty cage that still costs thousands of $USD per month to maintain. > > I had been hoping to coordinate this migration with the purchase of new server hardware for OBF but time has run out - the OBF servers need to move sooner rather than later. > > The only visible change for the OBF community will be new IP addresses on all our servers and sites. That (and a few hours of downtime) should be the only systems of the hosting transplant. > > > **EMBOSS, MOBY AND BIORUBY** > I believe we control DNS for all of our domains except for ftp.emboss.org and possibly some of the bioruby sites. There are also some moby service DNS records that we need to be careful with. I've CC'd Alan, Mark and Toshiaki on this email. I can let you know what the new IP addresses will be and can coordinate on switching DNS over. > ** > > > There is a chance that things could not go totally smoothly - we may have website or other configuration files with old embedded IP addresses etc. that we'll have to find and fix as needed. > > Please email me directly or send email to helpdesk at open-bio.org to report any problems. > > -Chris > > > > > > > > > > > > > > ?? ?? -- ???? ?????? ??????????? ??????????? ?? ?108-0071 ???????? 4-6-1 tel://+81-03-5449-5614 fax://+81-03-5449-5434 http://kanehisa.hgc.jp/ (Kanehisa Laboratory) http://www.hgc.jp/ (Human Genome Center) http://bioruby.org/ (BioRuby Project) http://open-bio.jp/ (Open Bio Japan) http://kumamushi.org/ (Tardigrada Genome Project) http://kumamushi.net/ (Kumamushi Info) http://togodb.dbcls.jp/ (TogoDB) http://togows.dbcls.jp/ (TogoWS) http://das.hgc.jp/ (KEGG DAS) http://www.genome.jp/kegg/soap/ (KEGG API) From georgkam at gmail.com Fri Feb 12 03:35:32 2010 From: georgkam at gmail.com (George Githinji) Date: Fri, 12 Feb 2010 11:35:32 +0300 Subject: [BioRuby] removing primers and corresponding quality data from sequences Message-ID: <55915f821002120035r7a4f5810o209d28e5a366a7d7@mail.gmail.com> Hi All, I have a list of sequences and corresponding quality files for the same data. I would like to remove the primers as well as the corresponding quality information. The approach that i am using is proving to be dirty and buggy, For example given: 1.A list of sequences in fasta file format 2.A list of 4 possible primer patterns. (no idea which sequence might contain which primer) 3.A list of quality data in phred format for each sequence, The task is to remove the possible primers from the sequences and anything before or after the primer. Each sequence has at least 2 combination of primes. one on the 5' and the other on the 3' end. Return a list of sequences with primer ends removed and the corresponding quality data for the primers removed. What would be a nice way to approach this problem. -- --------------- Sincerely George PhD Student KEMRI/Wellcome-Trust Research Program Skype: george_g2 Blog: http://biorelated.wordpress.com/ From georgkam at gmail.com Fri Feb 12 03:57:54 2010 From: georgkam at gmail.com (George Githinji) Date: Fri, 12 Feb 2010 11:57:54 +0300 Subject: [BioRuby] removing primers and corresponding quality data from sequences In-Reply-To: <55915f821002120056u46827792v838059ef110fa945@mail.gmail.com> References: <55915f821002120035r7a4f5810o209d28e5a366a7d7@mail.gmail.com> <55915f821002120056u46827792v838059ef110fa945@mail.gmail.com> Message-ID: <55915f821002120057n1d7ab14dnc41aee20dea3da4f@mail.gmail.com> Hi I would like to remove both the primer and the portion before the 5' end and one after the 3' end def primers ['G*CACG[A|C]AGTTT[C|T]GC','GC[G|A]AAACT[T|G]CGTGC','G*CCCATTC[G|C]TCGAACCA','TGGTTCGA[C|G]GAATGGGC'] #primers.collect! { |primer| create_regexp(primer) } end def bioentries(reads_file) Bio::FlatFile.auto(reads_file){ |f| f.map {|entry| entry} } end def remove_primers(file_name) reg1 = Regexp.new(primers[0]) bioentries(file_name).map do |entry| # puts ">#{entry.definition}" #puts entry.seq puts entry.seq.gsub(reg1,'') end end would remove the primers but not the portion before the 5' end Secondly, it does not give me the corresponding co-ordinates so that i can remove the associated quality data for the removed file third the approach seems 'dirty' On Fri, Feb 12, 2010 at 11:56 AM, George Githinji wrote: > Hi would like to remove both the primer and the portion before the 5' > end and one after the 3' end > > def primers > ? ['G*CACG[A|C]AGTTT[C|T]GC','GC[G|A]AAACT[T|G]CGTGC','G*CCCATTC[G|C]TCGAACCA','TGGTTCGA[C|G]GAATGGGC'] > ? #primers.collect! { |primer| create_regexp(primer) } > ?end > > ?def bioentries(reads_file) > ? ?Bio::FlatFile.auto(reads_file){ |f| f.map {|entry| entry} } > ?end > > def remove_primers(file_name) > ? reg1 = Regexp.new(primers[0]) > ? ?bioentries(file_name).map do |entry| > ? ? # puts ">#{entry.definition}" > ? ? ?#puts entry.seq > > ? ? puts ?entry.seq.gsub(reg1,'') > > ?end > end > > would remove the primers but not the portion before the 5' ?end > > Secondly, it does not give me the corresponding co-ordinates so that i > can remove the associated quality data for the removed file > > third the approach seems ?'dirty' > > > > On Fri, Feb 12, 2010 at 11:46 AM, Andrew Grimm wrote: >> I can't really help, but is it primers that you want removed, or the >> portion of sequence that's before the 5' primer or after the 3' >> primer? >> >> Andrew >> >> On Fri, Feb 12, 2010 at 7:35 PM, George Githinji wrote: >>> Hi All, >>> I have a list of sequences and corresponding quality files for the >>> same data. I would like to remove the primers as well as the >>> corresponding quality information. >>> The approach that i am using is proving to be dirty and buggy, >>> >>> For example given: >>> 1.A list of sequences in fasta file format >>> 2.A list of 4 possible primer patterns. (no idea which sequence might >>> contain which primer) >>> 3.A list of quality data in phred format for each sequence, >>> >>> The task is to remove the possible primers from the sequences and >>> anything before or after the primer. >>> Each sequence has at least 2 combination of primes. one on the 5' and >>> the other on the 3' end. >>> >>> Return a list of sequences with primer ends removed and the >>> corresponding quality data for the primers removed. >>> >>> What would be a nice way to approach this problem. >>> >>> >>> >>> >>> -- >>> --------------- >>> Sincerely >>> George >>> PhD Student >>> KEMRI/Wellcome-Trust Research Program >>> Skype: george_g2 >>> Blog: http://biorelated.wordpress.com/ >>> _______________________________________________ >>> BioRuby Project - http://www.bioruby.org/ >>> BioRuby mailing list >>> BioRuby at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioruby >>> >> > > > > -- > --------------- > Sincerely > George > PhD Student > KEMRI/Wellcome-Trust Research Program > Skype: george_g2 > Blog: http://biorelated.wordpress.com/ > -- --------------- Sincerely George PhD Student KEMRI/Wellcome-Trust Research Program Skype: george_g2 Blog: http://biorelated.wordpress.com/ From ngoto at gen-info.osaka-u.ac.jp Wed Feb 17 09:37:49 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 17 Feb 2010 23:37:49 +0900 Subject: [BioRuby] removing primers and corresponding quality data from sequences In-Reply-To: <55915f821002120057n1d7ab14dnc41aee20dea3da4f@mail.gmail.com> References: <55915f821002120035r7a4f5810o209d28e5a366a7d7@mail.gmail.com> <55915f821002120056u46827792v838059ef110fa945@mail.gmail.com> <55915f821002120057n1d7ab14dnc41aee20dea3da4f@mail.gmail.com> Message-ID: <20100217143751.05A651CBC63A@idnmail.gen-info.osaka-u.ac.jp> Hi, On Fri, 12 Feb 2010 11:57:54 +0300 George Githinji wrote: > Hi > > I would like to remove both the primer and the portion before the 5' > end and one after the 3' end > > def primers > ['G*CACG[A|C]AGTTT[C|T]GC','GC[G|A]AAACT[T|G]CGTGC','G*CCCATTC[G|C]TCGAACCA','TGGTTCGA[C|G]GAATGGGC'] > #primers.collect! { |primer| create_regexp(primer) } > end The above regular expressions might be different from what you really want. For example, /G*C/ matches with "C", "GC", "GGC", "GGGC", "GGGGC", ..., and /[C|T]/ matches with "C", "|", or "T". Please check the syntax of regular expression in Ruby. > > def bioentries(reads_file) > Bio::FlatFile.auto(reads_file){ |f| f.map {|entry| entry} } > end > > def remove_primers(file_name) > reg1 = Regexp.new(primers[0]) > bioentries(file_name).map do |entry| > # puts ">#{entry.definition}" > #puts entry.seq > > puts entry.seq.gsub(reg1,'') > > end > end > > would remove the primers but not the portion before the 5' end > > Secondly, it does not give me the corresponding co-ordinates so that i > can remove the associated quality data for the removed file > > third the approach seems 'dirty' One of the simplest approach is to mask the primer sequences with "X" (or any special character you want) without changing the original sequence length. I suppose many software for cutting vector sequences would also do so. #puts entry.seq.gsub(reg1,'') seq = Bio::Sequence::NA.new(entry.seq) # regs contains regular expressions in an array, # for example: regs = [ /ACGTACGT/, /ATATATAT/ ] # Note that primer sequences are expected to be # completely different from each others. # regs.each do |reg| seq.gsub!(reg) { |x| "X" * x.length } end # After that, all 5' bases before "X" are replaced # with "X". seq.sub!(/\A[^X]+X/) { |x| "X" * x.length } # All 3' bases after "X" are also replaced with "X". seq.sub!(/X[^X]+\z/) { |x| "X" * x.length } # Then, start and end positions of the unmasked region # can be obtained. start_pos = seq.index(/[^X]/) end_pos = seq.rindex(/[^X]/) Be careful that the code ignores any error checks. If one of the 5' or 3' primers are not detected in a sequence, whole of the sequence will be filled with "X". If both 5' and 3' primers are not found, the sequence will be kept unchanged. In addition, the above code ignores partial primer sequences in the 3' end (and sometimes in the 5' end). Sequencing errors are also ignored. Sincerely, Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > On Fri, Feb 12, 2010 at 11:56 AM, George Githinji wrote: > > Hi would like to remove both the primer and the portion before the 5' > > end and one after the 3' end > > > > def primers > > ['G*CACG[A|C]AGTTT[C|T]GC','GC[G|A]AAACT[T|G]CGTGC','G*CCCATTC[G|C]TCGAACCA','TGGTTCGA[C|G]GAATGGGC'] > > #primers.collect! { |primer| create_regexp(primer) } > > end > > > > def bioentries(reads_file) > > Bio::FlatFile.auto(reads_file){ |f| f.map {|entry| entry} } > > end > > > > def remove_primers(file_name) > > reg1 = Regexp.new(primers[0]) > > bioentries(file_name).map do |entry| > > # puts ">#{entry.definition}" > > #puts entry.seq > > > > puts entry.seq.gsub(reg1,'') > > > > end > > end > > > > would remove the primers but not the portion before the 5' end > > > > Secondly, it does not give me the corresponding co-ordinates so that i > > can remove the associated quality data for the removed file > > > > third the approach seems 'dirty' > > > > > > > > On Fri, Feb 12, 2010 at 11:46 AM, Andrew Grimm wrote: > >> I can't really help, but is it primers that you want removed, or the > >> portion of sequence that's before the 5' primer or after the 3' > >> primer? > >> > >> Andrew > >> > >> On Fri, Feb 12, 2010 at 7:35 PM, George Githinji wrote: > >>> Hi All, > >>> I have a list of sequences and corresponding quality files for the > >>> same data. I would like to remove the primers as well as the > >>> corresponding quality information. > >>> The approach that i am using is proving to be dirty and buggy, > >>> > >>> For example given: > >>> 1.A list of sequences in fasta file format > >>> 2.A list of 4 possible primer patterns. (no idea which sequence might > >>> contain which primer) > >>> 3.A list of quality data in phred format for each sequence, > >>> > >>> The task is to remove the possible primers from the sequences and > >>> anything before or after the primer. > >>> Each sequence has at least 2 combination of primes. one on the 5' and > >>> the other on the 3' end. > >>> > >>> Return a list of sequences with primer ends removed and the > >>> corresponding quality data for the primers removed. > >>> > >>> What would be a nice way to approach this problem. > >>> > >>> > >>> > >>> > >>> -- > >>> --------------- > >>> Sincerely > >>> George > >>> PhD Student > >>> KEMRI/Wellcome-Trust Research Program > >>> Skype: george_g2 > >>> Blog: http://biorelated.wordpress.com/ > >>> _______________________________________________ > >>> BioRuby Project - http://www.bioruby.org/ > >>> BioRuby mailing list > >>> BioRuby at lists.open-bio.org > >>> http://lists.open-bio.org/mailman/listinfo/bioruby > >>> > >> > > > > > > > > -- > > --------------- > > Sincerely > > George > > PhD Student > > KEMRI/Wellcome-Trust Research Program > > Skype: george_g2 > > Blog: http://biorelated.wordpress.com/ > > > > > > -- > --------------- > Sincerely > George > PhD Student > KEMRI/Wellcome-Trust Research Program > Skype: george_g2 > Blog: http://biorelated.wordpress.com/ > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From georgkam at gmail.com Fri Feb 19 00:41:25 2010 From: georgkam at gmail.com (George Githinji) Date: Fri, 19 Feb 2010 08:41:25 +0300 Subject: [BioRuby] removing primers and corresponding quality data from sequences In-Reply-To: <20100217143751.05A651CBC63A@idnmail.gen-info.osaka-u.ac.jp> References: <55915f821002120035r7a4f5810o209d28e5a366a7d7@mail.gmail.com> <55915f821002120056u46827792v838059ef110fa945@mail.gmail.com> <55915f821002120057n1d7ab14dnc41aee20dea3da4f@mail.gmail.com> <20100217143751.05A651CBC63A@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <55915f821002182141y1e9735b0qed0c832bcdd643c0@mail.gmail.com> Thank you so much Naohisa! Found the approach quite useful. It would be good not to mask the whole sequence in only one primer is present though. Very grateful! On Wed, Feb 17, 2010 at 5:37 PM, Naohisa GOTO wrote: > Hi, > > On Fri, 12 Feb 2010 11:57:54 +0300 > George Githinji wrote: > >> Hi >> >> I would like to remove both the primer and the portion before the 5' >> end and one after the 3' end >> >> def primers >> ? ['G*CACG[A|C]AGTTT[C|T]GC','GC[G|A]AAACT[T|G]CGTGC','G*CCCATTC[G|C]TCGAACCA','TGGTTCGA[C|G]GAATGGGC'] >> ? #primers.collect! { |primer| create_regexp(primer) } >> ?end > > The above regular expressions might be different from what > you really want. For example, /G*C/ matches with "C", "GC", > "GGC", "GGGC", "GGGGC", ..., and /[C|T]/ matches with "C", "|", > or "T". Please check the syntax of regular expression in Ruby. > >> >> ?def bioentries(reads_file) >> ? ?Bio::FlatFile.auto(reads_file){ |f| f.map {|entry| entry} } >> ?end >> >> def remove_primers(file_name) >> ? reg1 = Regexp.new(primers[0]) >> ? ?bioentries(file_name).map do |entry| >> ? ? # puts ">#{entry.definition}" >> ? ? ?#puts entry.seq >> >> ? ? puts ?entry.seq.gsub(reg1,'') >> >> ?end >> end >> >> would remove the primers but not the portion before the 5' ?end >> >> Secondly, it does not give me the corresponding co-ordinates so that i >> can remove the associated quality data for the removed file >> >> third the approach seems ?'dirty' > > One of the simplest approach is to mask the primer sequences > with "X" (or any special character you want) without changing > the original sequence length. I suppose many software for > cutting vector sequences would also do so. > > ? ? ?#puts ?entry.seq.gsub(reg1,'') > > ? ? ?seq = Bio::Sequence::NA.new(entry.seq) > > ? ? ?# regs contains regular expressions in an array, > ? ? ?# for example: regs = [ /ACGTACGT/, /ATATATAT/ ] > ? ? ?# Note that primer sequences are expected to be > ? ? ?# completely different from each others. > ? ? ?# > ? ? ?regs.each do |reg| > ? ? ? ?seq.gsub!(reg) { |x| "X" * x.length } > ? ? ?end > > ? ? ?# After that, all 5' bases before "X" are replaced > ? ? ?# with "X". > > ? ? ?seq.sub!(/\A[^X]+X/) { |x| "X" * x.length } > > ? ? ?# All 3' bases after "X" are also replaced with "X". > > ? ? ?seq.sub!(/X[^X]+\z/) { |x| "X" * x.length } > > ? ? ?# Then, start and end positions of the unmasked region > ? ? ?# can be obtained. > > ? ? ?start_pos = seq.index(/[^X]/) > ? ? ?end_pos = seq.rindex(/[^X]/) > > Be careful that the code ignores any error checks. > If one of the 5' or 3' primers are not detected in a sequence, > whole of the sequence will be filled with "X". If both 5' and 3' > primers are not found, the sequence will be kept unchanged. > > In addition, the above code ignores partial primer sequences > in the 3' end (and sometimes in the 5' end). Sequencing errors > are also ignored. > > Sincerely, > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > >> >> On Fri, Feb 12, 2010 at 11:56 AM, George Githinji wrote: >> > Hi would like to remove both the primer and the portion before the 5' >> > end and one after the 3' end >> > >> > def primers >> > ? ['G*CACG[A|C]AGTTT[C|T]GC','GC[G|A]AAACT[T|G]CGTGC','G*CCCATTC[G|C]TCGAACCA','TGGTTCGA[C|G]GAATGGGC'] >> > ? #primers.collect! { |primer| create_regexp(primer) } >> > ?end >> > >> > ?def bioentries(reads_file) >> > ? ?Bio::FlatFile.auto(reads_file){ |f| f.map {|entry| entry} } >> > ?end >> > >> > def remove_primers(file_name) >> > ? reg1 = Regexp.new(primers[0]) >> > ? ?bioentries(file_name).map do |entry| >> > ? ? # puts ">#{entry.definition}" >> > ? ? ?#puts entry.seq >> > >> > ? ? puts ?entry.seq.gsub(reg1,'') >> > >> > ?end >> > end >> > >> > would remove the primers but not the portion before the 5' ?end >> > >> > Secondly, it does not give me the corresponding co-ordinates so that i >> > can remove the associated quality data for the removed file >> > >> > third the approach seems ?'dirty' >> > >> > >> > >> > On Fri, Feb 12, 2010 at 11:46 AM, Andrew Grimm wrote: >> >> I can't really help, but is it primers that you want removed, or the >> >> portion of sequence that's before the 5' primer or after the 3' >> >> primer? >> >> >> >> Andrew >> >> >> >> On Fri, Feb 12, 2010 at 7:35 PM, George Githinji wrote: >> >>> Hi All, >> >>> I have a list of sequences and corresponding quality files for the >> >>> same data. I would like to remove the primers as well as the >> >>> corresponding quality information. >> >>> The approach that i am using is proving to be dirty and buggy, >> >>> >> >>> For example given: >> >>> 1.A list of sequences in fasta file format >> >>> 2.A list of 4 possible primer patterns. (no idea which sequence might >> >>> contain which primer) >> >>> 3.A list of quality data in phred format for each sequence, >> >>> >> >>> The task is to remove the possible primers from the sequences and >> >>> anything before or after the primer. >> >>> Each sequence has at least 2 combination of primes. one on the 5' and >> >>> the other on the 3' end. >> >>> >> >>> Return a list of sequences with primer ends removed and the >> >>> corresponding quality data for the primers removed. >> >>> >> >>> What would be a nice way to approach this problem. >> >>> >> >>> >> >>> >> >>> >> >>> -- >> >>> --------------- >> >>> Sincerely >> >>> George >> >>> PhD Student >> >>> KEMRI/Wellcome-Trust Research Program >> >>> Skype: george_g2 >> >>> Blog: http://biorelated.wordpress.com/ >> >>> _______________________________________________ >> >>> BioRuby Project - http://www.bioruby.org/ >> >>> BioRuby mailing list >> >>> BioRuby at lists.open-bio.org >> >>> http://lists.open-bio.org/mailman/listinfo/bioruby >> >>> >> >> >> > >> > >> > >> > -- >> > --------------- >> > Sincerely >> > George >> > PhD Student >> > KEMRI/Wellcome-Trust Research Program >> > Skype: george_g2 >> > Blog: http://biorelated.wordpress.com/ >> > >> >> >> >> -- >> --------------- >> Sincerely >> George >> PhD Student >> KEMRI/Wellcome-Trust Research Program >> Skype: george_g2 >> Blog: http://biorelated.wordpress.com/ >> >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > -- --------------- Sincerely George KEMRI/Wellcome-Trust Research Program Skype: george_g2 Blog: http://biorelated.wordpress.com/ From ngoto at gen-info.osaka-u.ac.jp Mon Feb 1 11:28:40 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Mon, 1 Feb 2010 20:28:40 +0900 Subject: [BioRuby] Thread-safety of alignment In-Reply-To: References: <20100120135046.158711CBC511@idnmail.gen-info.osaka-u.ac.jp> <20100126150004.8B4AA1CBC3EC@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <20100201112840.CB33F1CBC4F9@idnmail.gen-info.osaka-u.ac.jp> Hi Andrew Grimm, This is not the Tempfile bug, but the problem of fork with threads. Currently, to call a command safely without escaping of command-line string, bioruby internally uses fork in UNIX. When fork is executed, whole data of the process is copied to the child process, including Ruby threads and finalizers of Tempfile objects. In the child process, when the running thread is switched, unexpected behavior may occur, for example, executing another ClustalW with the same output file name of the parent process, temporary files are removed by finalizers of child process when the ClustalW is regarded as finished in the child process but not in the parent process. The patch below can fix, or can reduce the problem. ------------------------------------------------------------------- diff --git a/lib/bio/command.rb b/lib/bio/command.rb index 4f3ac94..ebd9cc5 100644 --- a/lib/bio/command.rb +++ b/lib/bio/command.rb @@ -196,12 +196,15 @@ module Command def call_command_fork(cmd, options = {}) dir = options[:chdir] cmd = safe_command_line_array(cmd) + tc, Thread.critical = Thread.critical, true IO.popen("-", "r+") do |io| if io then # parent + Thread.critical = tc yield io else # child + GC.disable # chdir to options[:chdir] if available begin Dir.chdir(dir) if dir ------------------------------------------------------------------- Note that the patch does not work with Ruby 1.9 because Thread.critical is removed in Ruby 1.9. In Ruby 1.9.1, IO.popen is improved to get command-line as an array without calling shell, and the problem of string escaping is completely resolved. Changes supporting Ruby 1.9.1 will soon be available in my GitHub repository. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org On Wed, 27 Jan 2010 14:07:18 +1100 Andrew Grimm wrote: > Hi Naohisa Goto, > > On Wed, Jan 27, 2010 at 2:00 AM, Naohisa GOTO > wrote: > > Hi Andrew, > > > > On Tue, 26 Jan 2010 23:12:35 +1100 > > Andrew Grimm wrote: > > > >> Hi Naohisa Goto, > >> > >> I tried creating a new factory in each thread, but I sometimes (but > >> not always) have errors. > > > > Please show ruby version and BioRuby version. > > % ruby -v > > % ruby -rbio -e 'puts Bio::BIORUBY_VERSION_ID' > > (If you are using BioRuby 1.2.1 or earlier, > > % ruby -rbio -e 'p Bio::BIORUBY_VERSION' > > ) > > > > I'm running ruby 1.8.7 (2008-08-11 patchlevel 72) and bioruby 1.4.0. > > >> Is the code in http://github.com/agrimm/bioruby-alignment-threading-replication/blob/master/test/test_multithreaded_alignment.rb > >> correct? Does it cause problems for anyone else? > > > > The "rescue RuntimeError" in line 15 may hide problems. > > In my environment, it seems that the RuntimeError is raised > > in lib/bio/alignment.rb. The error message I observed > > without the rescue was > > "alignment result is inconsistent with input data", > > and output file created by Clustalw was unexpectedly empty. > > It might be a bug of Tempfile in Ruby, but not sure. > > > > With Ruby 1.8.7, errors are observed in some times. > > % ruby -v > > ruby 1.8.7 (2010-01-10 patchlevel 249) [i686-linux] > > ruby 1.8.7 (2009-04-08 patchlevel 160) [i686-linux] > > ruby 1.8.7 (2008-08-11 patchlevel 72) [i486-linux] > > > > With Ruby 1.9.1-p378, no errors when I executed several times. > > % ruby -v > > ruby 1.9.1p378 (2010-01-10 revision 26273) [i686-linux] > > > > I suspect errors may occur on earlier versions of ruby 1.9.1. > > >> Some of the errors I get include the ones seen at http://gist.github.com/286775 > > > > The message "ERROR: Multiple sequences found with same name > > (found 0 at least twice)!" is reported by ClustalW, and > > it indicates incorrect input file sequence names. Maybe > > two file contents are unexpectedly concatenated or mixed > > possibly due to a bug of Tempfile, but not sure. > > > >> It's possible that the issues are caused by problems in tempfile > >> itself (which may have been fixed in August 2009 according to the > >> changelog). > > > > Another possibility is resource limits of the machine: > > the number of child processes, total memory size, etc. > > If exceeding limits, new child clustalw process could > > not be started, or running clustalw processes might be > > killed. This also causes void or truncated result files, > > and leads to ruby-level errors. > > > > Thanks for that suggestion. I re-ran the test using only 5 threads in > the new gist http://gist.github.com/287499 > > > Thanks, > > > > Naohisa Goto > > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > > >> > >> Thanks, > >> > >> Andrew > >> > >> On Thu, Jan 21, 2010 at 12:50 AM, Naohisa GOTO > >> wrote: > >> > Hi, > >> > > >> > On Wed, 20 Jan 2010 23:09:19 +1100 > >> > Andrew Grimm wrote: > >> > > >> >> Is alignment intended to be thread-safe in bioruby? If so, should I > >> >> use the same alignment factory between threads, or a separate one in > >> >> each thread? > >> > > >> > It is not confirmed to be thread-safe, so it is safe to use > >> > separate one in each thread. > >> > > >> > Currently, in BioRuby, manipulating the same object from different > >> > threads is not intended. When manipulating the same object from > >> > different threads is needed, using mutex is recommended. > >> > > >> > For library developers, it is encouraged to write thread-safe > >> > code if possible, but not mandatory. > >> > > >> > Naohisa Goto > >> > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > >> > > >> >> > >> >> Andrew From mitlox at op.pl Fri Feb 5 11:19:43 2010 From: mitlox at op.pl (xyz) Date: Fri, 05 Feb 2010 21:19:43 +1000 Subject: [BioRuby] Bioruby and jrubyc problem Message-ID: <4B6BFECF.3090809@op.pl> Hallo, I installed Bioruby with jruby setup.rb. then I run this script with jruby -------------- #!/usr/bin/env jruby require 'bio' # creating a Bio::Sequence::NA object containing ambiguous alphabets ambiguous_seq = Bio::Sequence::NA.new("atgcyrwskmbdhvn") # show the contents and class of the DNA sequence object p ambiguous_seq # => "atgcyrwskmbdhvn" p ambiguous_seq.class # => Bio::Sequence::NA # convert the sequence to a Regexp object p ambiguous_seq.to_re # => /atgc[tc][ag][at][gc][tg][ac][tgc][atg][atc][agc][atgc]/ p ambiguous_seq.to_re.class # => Regexp # example to match an ambiguous sequence to the rigid sequence att_or_atc = Bio::Sequence::NA.new("aty").to_re puts "match" if att_or_atc.match(Bio::Sequence::NA.new("att")) if Bio::Sequence::NA.new("atc") =~ att_or_atc puts "also match" end -------------- without any problems. After this I run it with Java and I have got following problem: jrubyc s01.rb an then java -cp /home/mitlox/jruby-1.4.0/lib/jruby.jar:. s01 Exception in thread "main" s01.rb:3:in `require': no such file to load -- bio (LoadError) from s01.rb:3 ...internal jruby stack elided... from Kernel.require(s01.rb:3) from (unknown).(unknown)(:1) What did I wrong? Best regards, From mauricio at open-bio.org Fri Feb 5 15:48:30 2010 From: mauricio at open-bio.org (Mauricio Herrera Cuadra) Date: Fri, 05 Feb 2010 09:48:30 -0600 Subject: [BioRuby] Fwd: Changes to NCBI BLAST and E-utilities. Message-ID: <4B6C3DCE.2070808@open-bio.org> Forwarding to the proper lists... -------- Original Message -------- Subject: [O|B|F Helpdesk #889] Changes to NCBI BLAST and E-utilities. Date: Fri, 5 Feb 2010 10:08:51 -0500 From: mcginnis via RT Reply-To: support at helpdesk.open-bio.org To: chris at bioteam.net, heikki at sanbi.ac.za, hlapp at gmx.net, jason at bioperl.org, mauricio at open-bio.org Fri Feb 05 10:08:51 2010: Request 889 was acted upon. Transaction: Ticket created by mcginnis at ncbi.nlm.nih.gov Queue: support at open-bio.org Subject: Changes to NCBI BLAST and E-utilities. Owner: Nobody Requestors: mcginnis at ncbi.nlm.nih.gov Status: new Ticket Dear Colleague: There are two changes I'd like to make you aware of. As you may or may not have noticed, we have been working on a new C++ version of the BLAST binaries. In the coming months we will be moving the C++ binaries into prominence and (slowly) phasing out the C toolkit binaries. There are many changes not least of which is a move to individual binaries for each program (blastn, blastp, etc). We are not sure how many of your users use BioPerl with the BLAST binaries, my understanding is that many use BioPerl to to remote BLAST. However, there isa change to the BLAST results in Text and presumably HTML. This could have an effect on any parsers which scrape these formats and do not use XML. For obvious reason, we want to support only the XML format for parsing, but we thought we should give you heads up on this. blast 2.2.22 Query: 3307 ------------------------------------------------------------ 3307 Sbjct: 390 GSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGP 449 blast 2.2.22+ Query ------------------------------------------------------------ Sbjct 390 GSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGP 449 A single line of gaps lacks the Query numbering in the blast+ output. The C version of blast has numbering in this case. Sample alignment shown below. According to users the blast+ output without the numbering breaks bioperl parsers. Wehave heard forma few but I think they may be older parsers? The second issue is a policy concerning E-utilities. This was announced on the utilities-announce at ncbi.nlm.nih.gov mail-list but you may not have seen it. As part of an ongoing effort to ensure efficient access to the Entrez Utilities (E-utilities) by all users, NCBI has decided to change the usage policy for the E-utilities effective June 1, 2010. Effective on June 1, 2010, all E-utility requests, either using standard URLs or SOAP, must contain non-null values for both the &tool and &email parameters. Any E-utility request made after June 1, 2010 that does not contain values for both parameters will return an error explaining that these parameters must be included in E-utility requests. The value of the &tool parameter should be a URI-safe string that is the name of the software package, script or web page producing the E-utility request. The value of the &email parameter should be a valid e-mail address for the appropriate contact person or group responsible for maintaining the tool producing the E-utility request. NCBI uses these parameters to contact users whose use of the E-utilities violates the standard usage policies described athttp://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html#UserSystemRequirements. These usage policies are designed to prevent excessive requests from a small group of users from reducing or eliminating the wider community's access to the E-utilities. NCBI will attempt to contact a user at the e-mail address provided in the &email parameter prior to blocking access to the E-utilities. NCBI realizes that this policy change will require many of our users to change their code. Based on past experience, we anticipate that most of our users should be able to make the necessary changes before the June 1, 2010 deadline. If you have any concerns about making these changes by that date, or if you have any questions about these policies, please contact eutilities at ncbi.nlm.nih.gov. Thank you for your understanding and cooperation in helping us continue to deliver a reliable and efficient web service. I think you already adhere to this policy but should a user's script not meet these requirements, than the script will fail and requests will be turned away with an error message. Scott D. McGinnis M.A. NCBI/NLM/NIH 45 Center Drive, MSC 6511 Bldg 45, Room 4AN.44C Bethesda, MD 20892 mcginnis at ncbi.nlm.nih.gov From k at bioruby.org Tue Feb 9 01:03:55 2010 From: k at bioruby.org (Toshiaki Katayama) Date: Tue, 9 Feb 2010 10:03:55 +0900 Subject: [BioRuby] Fwd: Pass the word: All open-bio sites/servers may be unavailable for a few hours this week References: <4B701DD4.60801@sonsorol.org> Message-ID: <27E7CE0A-487A-43CC-B95A-F48F801702C9@bioruby.org> FYI Begin forwarded message: > ???: Chris Dagdigian > ??: 2010?2?8? 23:21:08JST > ??: OBF Board , Bioroot , Kam Dahlquist , Peter , Martin Senger , ajb at ebi.ac.uk, Toshiaki Katayama , Mark Wilkinson > ??: Pass the word: All open-bio sites/servers may be unavailable for a few hours this week > > Hi folks, > > Long overdue server/system/IT houskeeping note here. I decided not to do a mass OBF list email so feel free to pass this message along to the people and lists that need to know. I'm hoping the transition might only be noticed by a few people and we'll be back up before the majority see anything different. > > > ### > The short story is that we need to rip the existing open-bio servers out of their current datacenter and drive them over to a new colocation facility a few miles away. The servers should be down for no more than an hour or two but DNS changes may take much longer to propagate throughout the internet. > > We won't be able to give much notice for the downtime, it could be as early as tomorrow afternoon (Tuesday the 9th). The server transplant needs to be coordinated around some other work that I can't talk about. > #### > > > The longer story is below for those that are interested. > > My employer (www.bioteam.net) has long been donating the physical costs of hosting the Open Bio servers. > > We had been doing this in a Boston area datacenter where we rented a 6x8 foot private cage. The price of all this space and associated electricity, bandwidth etc. costs thousands of dollars per month (for the entire cage, not just OBF stuff...) > > For several reasons, mostly business related, BioTeam is switching to a collocation provider. We've has already migrated the majority of the corporate systems which means that the OBF servers are sitting in a mostly empty cage that still costs thousands of $USD per month to maintain. > > I had been hoping to coordinate this migration with the purchase of new server hardware for OBF but time has run out - the OBF servers need to move sooner rather than later. > > The only visible change for the OBF community will be new IP addresses on all our servers and sites. That (and a few hours of downtime) should be the only systems of the hosting transplant. > > > **EMBOSS, MOBY AND BIORUBY** > I believe we control DNS for all of our domains except for ftp.emboss.org and possibly some of the bioruby sites. There are also some moby service DNS records that we need to be careful with. I've CC'd Alan, Mark and Toshiaki on this email. I can let you know what the new IP addresses will be and can coordinate on switching DNS over. > ** > > > There is a chance that things could not go totally smoothly - we may have website or other configuration files with old embedded IP addresses etc. that we'll have to find and fix as needed. > > Please email me directly or send email to helpdesk at open-bio.org to report any problems. > > -Chris > > > > > > > > > > > > > > ?? ?? -- ???? ?????? ??????????? ??????????? ?? ?108-0071 ???????? 4-6-1 tel://+81-03-5449-5614 fax://+81-03-5449-5434 http://kanehisa.hgc.jp/ (Kanehisa Laboratory) http://www.hgc.jp/ (Human Genome Center) http://bioruby.org/ (BioRuby Project) http://open-bio.jp/ (Open Bio Japan) http://kumamushi.org/ (Tardigrada Genome Project) http://kumamushi.net/ (Kumamushi Info) http://togodb.dbcls.jp/ (TogoDB) http://togows.dbcls.jp/ (TogoWS) http://das.hgc.jp/ (KEGG DAS) http://www.genome.jp/kegg/soap/ (KEGG API) From georgkam at gmail.com Fri Feb 12 08:35:32 2010 From: georgkam at gmail.com (George Githinji) Date: Fri, 12 Feb 2010 11:35:32 +0300 Subject: [BioRuby] removing primers and corresponding quality data from sequences Message-ID: <55915f821002120035r7a4f5810o209d28e5a366a7d7@mail.gmail.com> Hi All, I have a list of sequences and corresponding quality files for the same data. I would like to remove the primers as well as the corresponding quality information. The approach that i am using is proving to be dirty and buggy, For example given: 1.A list of sequences in fasta file format 2.A list of 4 possible primer patterns. (no idea which sequence might contain which primer) 3.A list of quality data in phred format for each sequence, The task is to remove the possible primers from the sequences and anything before or after the primer. Each sequence has at least 2 combination of primes. one on the 5' and the other on the 3' end. Return a list of sequences with primer ends removed and the corresponding quality data for the primers removed. What would be a nice way to approach this problem. -- --------------- Sincerely George PhD Student KEMRI/Wellcome-Trust Research Program Skype: george_g2 Blog: http://biorelated.wordpress.com/ From georgkam at gmail.com Fri Feb 12 08:57:54 2010 From: georgkam at gmail.com (George Githinji) Date: Fri, 12 Feb 2010 11:57:54 +0300 Subject: [BioRuby] removing primers and corresponding quality data from sequences In-Reply-To: <55915f821002120056u46827792v838059ef110fa945@mail.gmail.com> References: <55915f821002120035r7a4f5810o209d28e5a366a7d7@mail.gmail.com> <55915f821002120056u46827792v838059ef110fa945@mail.gmail.com> Message-ID: <55915f821002120057n1d7ab14dnc41aee20dea3da4f@mail.gmail.com> Hi I would like to remove both the primer and the portion before the 5' end and one after the 3' end def primers ['G*CACG[A|C]AGTTT[C|T]GC','GC[G|A]AAACT[T|G]CGTGC','G*CCCATTC[G|C]TCGAACCA','TGGTTCGA[C|G]GAATGGGC'] #primers.collect! { |primer| create_regexp(primer) } end def bioentries(reads_file) Bio::FlatFile.auto(reads_file){ |f| f.map {|entry| entry} } end def remove_primers(file_name) reg1 = Regexp.new(primers[0]) bioentries(file_name).map do |entry| # puts ">#{entry.definition}" #puts entry.seq puts entry.seq.gsub(reg1,'') end end would remove the primers but not the portion before the 5' end Secondly, it does not give me the corresponding co-ordinates so that i can remove the associated quality data for the removed file third the approach seems 'dirty' On Fri, Feb 12, 2010 at 11:56 AM, George Githinji wrote: > Hi would like to remove both the primer and the portion before the 5' > end and one after the 3' end > > def primers > ? ['G*CACG[A|C]AGTTT[C|T]GC','GC[G|A]AAACT[T|G]CGTGC','G*CCCATTC[G|C]TCGAACCA','TGGTTCGA[C|G]GAATGGGC'] > ? #primers.collect! { |primer| create_regexp(primer) } > ?end > > ?def bioentries(reads_file) > ? ?Bio::FlatFile.auto(reads_file){ |f| f.map {|entry| entry} } > ?end > > def remove_primers(file_name) > ? reg1 = Regexp.new(primers[0]) > ? ?bioentries(file_name).map do |entry| > ? ? # puts ">#{entry.definition}" > ? ? ?#puts entry.seq > > ? ? puts ?entry.seq.gsub(reg1,'') > > ?end > end > > would remove the primers but not the portion before the 5' ?end > > Secondly, it does not give me the corresponding co-ordinates so that i > can remove the associated quality data for the removed file > > third the approach seems ?'dirty' > > > > On Fri, Feb 12, 2010 at 11:46 AM, Andrew Grimm wrote: >> I can't really help, but is it primers that you want removed, or the >> portion of sequence that's before the 5' primer or after the 3' >> primer? >> >> Andrew >> >> On Fri, Feb 12, 2010 at 7:35 PM, George Githinji wrote: >>> Hi All, >>> I have a list of sequences and corresponding quality files for the >>> same data. I would like to remove the primers as well as the >>> corresponding quality information. >>> The approach that i am using is proving to be dirty and buggy, >>> >>> For example given: >>> 1.A list of sequences in fasta file format >>> 2.A list of 4 possible primer patterns. (no idea which sequence might >>> contain which primer) >>> 3.A list of quality data in phred format for each sequence, >>> >>> The task is to remove the possible primers from the sequences and >>> anything before or after the primer. >>> Each sequence has at least 2 combination of primes. one on the 5' and >>> the other on the 3' end. >>> >>> Return a list of sequences with primer ends removed and the >>> corresponding quality data for the primers removed. >>> >>> What would be a nice way to approach this problem. >>> >>> >>> >>> >>> -- >>> --------------- >>> Sincerely >>> George >>> PhD Student >>> KEMRI/Wellcome-Trust Research Program >>> Skype: george_g2 >>> Blog: http://biorelated.wordpress.com/ >>> _______________________________________________ >>> BioRuby Project - http://www.bioruby.org/ >>> BioRuby mailing list >>> BioRuby at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/bioruby >>> >> > > > > -- > --------------- > Sincerely > George > PhD Student > KEMRI/Wellcome-Trust Research Program > Skype: george_g2 > Blog: http://biorelated.wordpress.com/ > -- --------------- Sincerely George PhD Student KEMRI/Wellcome-Trust Research Program Skype: george_g2 Blog: http://biorelated.wordpress.com/ From ngoto at gen-info.osaka-u.ac.jp Wed Feb 17 14:37:49 2010 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Wed, 17 Feb 2010 23:37:49 +0900 Subject: [BioRuby] removing primers and corresponding quality data from sequences In-Reply-To: <55915f821002120057n1d7ab14dnc41aee20dea3da4f@mail.gmail.com> References: <55915f821002120035r7a4f5810o209d28e5a366a7d7@mail.gmail.com> <55915f821002120056u46827792v838059ef110fa945@mail.gmail.com> <55915f821002120057n1d7ab14dnc41aee20dea3da4f@mail.gmail.com> Message-ID: <20100217143751.05A651CBC63A@idnmail.gen-info.osaka-u.ac.jp> Hi, On Fri, 12 Feb 2010 11:57:54 +0300 George Githinji wrote: > Hi > > I would like to remove both the primer and the portion before the 5' > end and one after the 3' end > > def primers > ['G*CACG[A|C]AGTTT[C|T]GC','GC[G|A]AAACT[T|G]CGTGC','G*CCCATTC[G|C]TCGAACCA','TGGTTCGA[C|G]GAATGGGC'] > #primers.collect! { |primer| create_regexp(primer) } > end The above regular expressions might be different from what you really want. For example, /G*C/ matches with "C", "GC", "GGC", "GGGC", "GGGGC", ..., and /[C|T]/ matches with "C", "|", or "T". Please check the syntax of regular expression in Ruby. > > def bioentries(reads_file) > Bio::FlatFile.auto(reads_file){ |f| f.map {|entry| entry} } > end > > def remove_primers(file_name) > reg1 = Regexp.new(primers[0]) > bioentries(file_name).map do |entry| > # puts ">#{entry.definition}" > #puts entry.seq > > puts entry.seq.gsub(reg1,'') > > end > end > > would remove the primers but not the portion before the 5' end > > Secondly, it does not give me the corresponding co-ordinates so that i > can remove the associated quality data for the removed file > > third the approach seems 'dirty' One of the simplest approach is to mask the primer sequences with "X" (or any special character you want) without changing the original sequence length. I suppose many software for cutting vector sequences would also do so. #puts entry.seq.gsub(reg1,'') seq = Bio::Sequence::NA.new(entry.seq) # regs contains regular expressions in an array, # for example: regs = [ /ACGTACGT/, /ATATATAT/ ] # Note that primer sequences are expected to be # completely different from each others. # regs.each do |reg| seq.gsub!(reg) { |x| "X" * x.length } end # After that, all 5' bases before "X" are replaced # with "X". seq.sub!(/\A[^X]+X/) { |x| "X" * x.length } # All 3' bases after "X" are also replaced with "X". seq.sub!(/X[^X]+\z/) { |x| "X" * x.length } # Then, start and end positions of the unmasked region # can be obtained. start_pos = seq.index(/[^X]/) end_pos = seq.rindex(/[^X]/) Be careful that the code ignores any error checks. If one of the 5' or 3' primers are not detected in a sequence, whole of the sequence will be filled with "X". If both 5' and 3' primers are not found, the sequence will be kept unchanged. In addition, the above code ignores partial primer sequences in the 3' end (and sometimes in the 5' end). Sequencing errors are also ignored. Sincerely, Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > On Fri, Feb 12, 2010 at 11:56 AM, George Githinji wrote: > > Hi would like to remove both the primer and the portion before the 5' > > end and one after the 3' end > > > > def primers > > ['G*CACG[A|C]AGTTT[C|T]GC','GC[G|A]AAACT[T|G]CGTGC','G*CCCATTC[G|C]TCGAACCA','TGGTTCGA[C|G]GAATGGGC'] > > #primers.collect! { |primer| create_regexp(primer) } > > end > > > > def bioentries(reads_file) > > Bio::FlatFile.auto(reads_file){ |f| f.map {|entry| entry} } > > end > > > > def remove_primers(file_name) > > reg1 = Regexp.new(primers[0]) > > bioentries(file_name).map do |entry| > > # puts ">#{entry.definition}" > > #puts entry.seq > > > > puts entry.seq.gsub(reg1,'') > > > > end > > end > > > > would remove the primers but not the portion before the 5' end > > > > Secondly, it does not give me the corresponding co-ordinates so that i > > can remove the associated quality data for the removed file > > > > third the approach seems 'dirty' > > > > > > > > On Fri, Feb 12, 2010 at 11:46 AM, Andrew Grimm wrote: > >> I can't really help, but is it primers that you want removed, or the > >> portion of sequence that's before the 5' primer or after the 3' > >> primer? > >> > >> Andrew > >> > >> On Fri, Feb 12, 2010 at 7:35 PM, George Githinji wrote: > >>> Hi All, > >>> I have a list of sequences and corresponding quality files for the > >>> same data. I would like to remove the primers as well as the > >>> corresponding quality information. > >>> The approach that i am using is proving to be dirty and buggy, > >>> > >>> For example given: > >>> 1.A list of sequences in fasta file format > >>> 2.A list of 4 possible primer patterns. (no idea which sequence might > >>> contain which primer) > >>> 3.A list of quality data in phred format for each sequence, > >>> > >>> The task is to remove the possible primers from the sequences and > >>> anything before or after the primer. > >>> Each sequence has at least 2 combination of primes. one on the 5' and > >>> the other on the 3' end. > >>> > >>> Return a list of sequences with primer ends removed and the > >>> corresponding quality data for the primers removed. > >>> > >>> What would be a nice way to approach this problem. > >>> > >>> > >>> > >>> > >>> -- > >>> --------------- > >>> Sincerely > >>> George > >>> PhD Student > >>> KEMRI/Wellcome-Trust Research Program > >>> Skype: george_g2 > >>> Blog: http://biorelated.wordpress.com/ > >>> _______________________________________________ > >>> BioRuby Project - http://www.bioruby.org/ > >>> BioRuby mailing list > >>> BioRuby at lists.open-bio.org > >>> http://lists.open-bio.org/mailman/listinfo/bioruby > >>> > >> > > > > > > > > -- > > --------------- > > Sincerely > > George > > PhD Student > > KEMRI/Wellcome-Trust Research Program > > Skype: george_g2 > > Blog: http://biorelated.wordpress.com/ > > > > > > -- > --------------- > Sincerely > George > PhD Student > KEMRI/Wellcome-Trust Research Program > Skype: george_g2 > Blog: http://biorelated.wordpress.com/ > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From georgkam at gmail.com Fri Feb 19 05:41:25 2010 From: georgkam at gmail.com (George Githinji) Date: Fri, 19 Feb 2010 08:41:25 +0300 Subject: [BioRuby] removing primers and corresponding quality data from sequences In-Reply-To: <20100217143751.05A651CBC63A@idnmail.gen-info.osaka-u.ac.jp> References: <55915f821002120035r7a4f5810o209d28e5a366a7d7@mail.gmail.com> <55915f821002120056u46827792v838059ef110fa945@mail.gmail.com> <55915f821002120057n1d7ab14dnc41aee20dea3da4f@mail.gmail.com> <20100217143751.05A651CBC63A@idnmail.gen-info.osaka-u.ac.jp> Message-ID: <55915f821002182141y1e9735b0qed0c832bcdd643c0@mail.gmail.com> Thank you so much Naohisa! Found the approach quite useful. It would be good not to mask the whole sequence in only one primer is present though. Very grateful! On Wed, Feb 17, 2010 at 5:37 PM, Naohisa GOTO wrote: > Hi, > > On Fri, 12 Feb 2010 11:57:54 +0300 > George Githinji wrote: > >> Hi >> >> I would like to remove both the primer and the portion before the 5' >> end and one after the 3' end >> >> def primers >> ? ['G*CACG[A|C]AGTTT[C|T]GC','GC[G|A]AAACT[T|G]CGTGC','G*CCCATTC[G|C]TCGAACCA','TGGTTCGA[C|G]GAATGGGC'] >> ? #primers.collect! { |primer| create_regexp(primer) } >> ?end > > The above regular expressions might be different from what > you really want. For example, /G*C/ matches with "C", "GC", > "GGC", "GGGC", "GGGGC", ..., and /[C|T]/ matches with "C", "|", > or "T". Please check the syntax of regular expression in Ruby. > >> >> ?def bioentries(reads_file) >> ? ?Bio::FlatFile.auto(reads_file){ |f| f.map {|entry| entry} } >> ?end >> >> def remove_primers(file_name) >> ? reg1 = Regexp.new(primers[0]) >> ? ?bioentries(file_name).map do |entry| >> ? ? # puts ">#{entry.definition}" >> ? ? ?#puts entry.seq >> >> ? ? puts ?entry.seq.gsub(reg1,'') >> >> ?end >> end >> >> would remove the primers but not the portion before the 5' ?end >> >> Secondly, it does not give me the corresponding co-ordinates so that i >> can remove the associated quality data for the removed file >> >> third the approach seems ?'dirty' > > One of the simplest approach is to mask the primer sequences > with "X" (or any special character you want) without changing > the original sequence length. I suppose many software for > cutting vector sequences would also do so. > > ? ? ?#puts ?entry.seq.gsub(reg1,'') > > ? ? ?seq = Bio::Sequence::NA.new(entry.seq) > > ? ? ?# regs contains regular expressions in an array, > ? ? ?# for example: regs = [ /ACGTACGT/, /ATATATAT/ ] > ? ? ?# Note that primer sequences are expected to be > ? ? ?# completely different from each others. > ? ? ?# > ? ? ?regs.each do |reg| > ? ? ? ?seq.gsub!(reg) { |x| "X" * x.length } > ? ? ?end > > ? ? ?# After that, all 5' bases before "X" are replaced > ? ? ?# with "X". > > ? ? ?seq.sub!(/\A[^X]+X/) { |x| "X" * x.length } > > ? ? ?# All 3' bases after "X" are also replaced with "X". > > ? ? ?seq.sub!(/X[^X]+\z/) { |x| "X" * x.length } > > ? ? ?# Then, start and end positions of the unmasked region > ? ? ?# can be obtained. > > ? ? ?start_pos = seq.index(/[^X]/) > ? ? ?end_pos = seq.rindex(/[^X]/) > > Be careful that the code ignores any error checks. > If one of the 5' or 3' primers are not detected in a sequence, > whole of the sequence will be filled with "X". If both 5' and 3' > primers are not found, the sequence will be kept unchanged. > > In addition, the above code ignores partial primer sequences > in the 3' end (and sometimes in the 5' end). Sequencing errors > are also ignored. > > Sincerely, > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > >> >> On Fri, Feb 12, 2010 at 11:56 AM, George Githinji wrote: >> > Hi would like to remove both the primer and the portion before the 5' >> > end and one after the 3' end >> > >> > def primers >> > ? ['G*CACG[A|C]AGTTT[C|T]GC','GC[G|A]AAACT[T|G]CGTGC','G*CCCATTC[G|C]TCGAACCA','TGGTTCGA[C|G]GAATGGGC'] >> > ? #primers.collect! { |primer| create_regexp(primer) } >> > ?end >> > >> > ?def bioentries(reads_file) >> > ? ?Bio::FlatFile.auto(reads_file){ |f| f.map {|entry| entry} } >> > ?end >> > >> > def remove_primers(file_name) >> > ? reg1 = Regexp.new(primers[0]) >> > ? ?bioentries(file_name).map do |entry| >> > ? ? # puts ">#{entry.definition}" >> > ? ? ?#puts entry.seq >> > >> > ? ? puts ?entry.seq.gsub(reg1,'') >> > >> > ?end >> > end >> > >> > would remove the primers but not the portion before the 5' ?end >> > >> > Secondly, it does not give me the corresponding co-ordinates so that i >> > can remove the associated quality data for the removed file >> > >> > third the approach seems ?'dirty' >> > >> > >> > >> > On Fri, Feb 12, 2010 at 11:46 AM, Andrew Grimm wrote: >> >> I can't really help, but is it primers that you want removed, or the >> >> portion of sequence that's before the 5' primer or after the 3' >> >> primer? >> >> >> >> Andrew >> >> >> >> On Fri, Feb 12, 2010 at 7:35 PM, George Githinji wrote: >> >>> Hi All, >> >>> I have a list of sequences and corresponding quality files for the >> >>> same data. I would like to remove the primers as well as the >> >>> corresponding quality information. >> >>> The approach that i am using is proving to be dirty and buggy, >> >>> >> >>> For example given: >> >>> 1.A list of sequences in fasta file format >> >>> 2.A list of 4 possible primer patterns. (no idea which sequence might >> >>> contain which primer) >> >>> 3.A list of quality data in phred format for each sequence, >> >>> >> >>> The task is to remove the possible primers from the sequences and >> >>> anything before or after the primer. >> >>> Each sequence has at least 2 combination of primes. one on the 5' and >> >>> the other on the 3' end. >> >>> >> >>> Return a list of sequences with primer ends removed and the >> >>> corresponding quality data for the primers removed. >> >>> >> >>> What would be a nice way to approach this problem. >> >>> >> >>> >> >>> >> >>> >> >>> -- >> >>> --------------- >> >>> Sincerely >> >>> George >> >>> PhD Student >> >>> KEMRI/Wellcome-Trust Research Program >> >>> Skype: george_g2 >> >>> Blog: http://biorelated.wordpress.com/ >> >>> _______________________________________________ >> >>> BioRuby Project - http://www.bioruby.org/ >> >>> BioRuby mailing list >> >>> BioRuby at lists.open-bio.org >> >>> http://lists.open-bio.org/mailman/listinfo/bioruby >> >>> >> >> >> > >> > >> > >> > -- >> > --------------- >> > Sincerely >> > George >> > PhD Student >> > KEMRI/Wellcome-Trust Research Program >> > Skype: george_g2 >> > Blog: http://biorelated.wordpress.com/ >> > >> >> >> >> -- >> --------------- >> Sincerely >> George >> PhD Student >> KEMRI/Wellcome-Trust Research Program >> Skype: george_g2 >> Blog: http://biorelated.wordpress.com/ >> >> _______________________________________________ >> BioRuby Project - http://www.bioruby.org/ >> BioRuby mailing list >> BioRuby at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioruby > > -- --------------- Sincerely George KEMRI/Wellcome-Trust Research Program Skype: george_g2 Blog: http://biorelated.wordpress.com/