From ngoto at gen-info.osaka-u.ac.jp Mon Sep 3 04:10:12 2012 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa Goto) Date: Mon, 03 Sep 2012 17:10:12 +0900 Subject: [BioRuby] Removal of Bio::DBGET and Bio::Ensembl wihch use discontinued API Message-ID: <20120903171010.A2E0.EEF6E030@gen-info.osaka-u.ac.jp> Hi all, I'd like to remove the following two old obsolete classes that use discontinued API access via the internet. Bio::DBGET in bio/io/dbget.rb (and sample/dbget): Reason: It does not work because it uses old original protocol that was discontinued about 8 years ago. Alternatives: The DBGET system is still available via the web. http://www.genome.jp/en/gn_dbget.html However, no API code is written in Ruby. Bio::Ensembl in bio/io/ensembl.rb (and test codes): Reason: It does not work after the renewal of Ensembl web site in 2008. Alternatives: bio-ensembl gem which supports current ensembl API. http://rubygems.org/gems/bio-ensembl Regards, -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From p.j.a.cock at googlemail.com Mon Sep 3 09:08:51 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Sep 2012 14:08:51 +0100 Subject: [BioRuby] [GSoC] GSoC final report In-Reply-To: <56EAF480-2694-4825-999E-14763A6A3ACB@drycafe.net> References: <20120820095559.GB2453@thebird.nl> <118F034CF4C3EF48A96F86CE585B94BF33B73325@CHIMBX5.ad.uillinois.edu> <56EAF480-2694-4825-999E-14763A6A3ACB@drycafe.net> Message-ID: On Mon, Aug 27, 2012 at 3:14 AM, Hilmar Lapp wrote: > Indeed, congratulations to all of OBF's 2012 GSoC students > and mentors - great job! > > It'd be great to have a summary blog post on the OBF news > blog - anyone up for composing that? > > -hilmar I agree it is a good idea. I'm in Japan for the 2012 BioHackathon, and have spoken with Pjotr, Raul and Francesco - I think we can work on a blog post together this week (I have editing rights). Brad - would you like to contribute/preview the text? Shall we ask your co-mentors too? Regards, Peter From ngoto at gen-info.osaka-u.ac.jp Tue Sep 4 04:40:04 2012 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa Goto) Date: Tue, 04 Sep 2012 17:40:04 +0900 Subject: [BioRuby] Remove classes that does not work: Bio::NCBI::SOAP, Bio::KEGG::Taxonomy Message-ID: <20120904173948.79E3.EEF6E030@gen-info.osaka-u.ac.jp> Hi all, I'd like to remove the following two classes that are currently broken and I think there are no hope to be fixed. Bio::NCBI::SOAP Bio::NCBI::SOAP (in lib/bio/io/ncbisoap.rb) always raises error during the parsing of WSDL files provided by NCBI. The error occurrs both with Ruby 1.8.X (with bundled SOAP4R) and Ruby 1.9.X (with soap4r-ruby1.9 gem). To solve the error, modifying SOAP4R may be needed. I think it is difficult. Fortunately, there is already an alternative class Bio::NCBI::REST, REST client class for NCBI EUtil web services. Bio::KEGG::Taxonomy Bio::KEGG::Taxonomy (in lib/bio/db/kegg/taxonomy.rb) raises error or the returned data seems to be broken. Running the sample script sample/demo_kegg_taxonomy.rb shows error or falls into infinite loop. Moreover, KEGG closes public FTP site and the file "taxonomy" can only be obtained by paid subscribers. So, I can not test the class with the latest data and thus I give up fixing. Of course, patches to solve the above problems are welcome. -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From ngoto at gen-info.osaka-u.ac.jp Thu Sep 6 05:14:04 2012 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa Goto) Date: Thu, 06 Sep 2012 18:14:04 +0900 Subject: [BioRuby] Remove broken Bio.method_missing Message-ID: <20120906181404.BDBE.EEF6E030@gen-info.osaka-u.ac.jp> Hi all, There is Bio.method_missing, the hook of undefined methods. In the existing code, Bio::Shell method corresponding to the given method name is called. The expected behavior is to provide shortcut of Bio::Shell methods with shorter name without typing "Shell". However, currently, most methods raises error, partly due to the bypass of initialization procedure. Our experience of writing and using BioRuby suggests that the use of method_missing should normally be avoided unless it is really necessary, partly because it tends to cause catastrophe especially when an exception is raised. In the case of Bio.method_missing, I think it is not necessary to use the method here. So, I remove Bio.method_missing. Alternatively, use Bio::Shell.xxxxx (xxxxx is a method name). -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From p.j.a.cock at googlemail.com Mon Sep 10 04:39:30 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 10 Sep 2012 09:39:30 +0100 Subject: [BioRuby] Most buildbot slaves down Message-ID: Hi all, For those of you actively monitoring the nightly BuildBot for Biopython and/or BioRuby, all the buildslaves at my institute are currently effectively offline. A new stricter firewall policy was introduced last week while I was away. I hope we'll have the necessary outgoing ports opened again soon. In the meantime, additional buildslaves hosted elsewhere would be very useful. The machines need to be online and are typically only used once every 24 hours for the scheduled builds. Non-Linux machines are particularly important for cross-platform testing (while for Linux the TravisCI testing seems to be working nicely overall). Any volunteers? Thanks, Peter From tiagoantao at gmail.com Mon Sep 10 04:50:41 2012 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 10 Sep 2012 09:50:41 +0100 Subject: [BioRuby] Most buildbot slaves down In-Reply-To: References: Message-ID: Hi, Not much helpful in the non-linux front, but I noticed that my machine was down for some reason, restarted it and it is doing at least a few of the builds. Tiago On Mon, Sep 10, 2012 at 9:39 AM, Peter Cock wrote: > Hi all, > > For those of you actively monitoring the nightly BuildBot > for Biopython and/or BioRuby, all the buildslaves at my > institute are currently effectively offline. A new stricter > firewall policy was introduced last week while I was away. > I hope we'll have the necessary outgoing ports opened > again soon. > > In the meantime, additional buildslaves hosted elsewhere > would be very useful. The machines need to be online > and are typically only used once every 24 hours for the > scheduled builds. Non-Linux machines are particularly > important for cross-platform testing (while for Linux > the TravisCI testing seems to be working nicely overall). > > Any volunteers? > > Thanks, > > Peter > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby -- "Liberty for wolves is death to the lambs" - Isaiah Berlin From joshearl1 at hotmail.com Thu Sep 13 13:50:34 2012 From: joshearl1 at hotmail.com (Josh Earl) Date: Thu, 13 Sep 2012 13:50:34 -0400 Subject: [BioRuby] Genbank file parsing question Message-ID: Hello all, I'm trying to use bioruby to write a small program that will allow me to rearrange the contigs in a genbank file (based on either a list of contig names, mauve output, or whatever). The idea is that the annotation service that we use (RAST - http://rast.nmpdr.org ) produces Genbank files, but not exactly the same format that Bioruby is expecting. They seem to outright break the Bio::FlatFile.new(Bio::GenBank, 'file') format, and confuse the Bio::GenBank.open('file') function. My question is, what should I do? Write my own parser, or try and fiddle with the Bioruby implementation or something else entirely? I'm fairly new to ruby, but I've been programming for a long time. ~josh P.S. Here is a short section of what the RAST GenBank file looks like (just a single short contig): LOCUS ctg7180000000028 4191 bp DNA linear UNK DEFINITION Contig ctg7180000000028 from Atopobium vaginae B758ACCESSION unknownFEATURES Location/Qualifiers source 1..4191 /mol_type="genomic DNA" /db_xref="taxon: 82135" /genome_md5="" /project="earl_82135" /genome_id="82135.3" /organism="Atopobium vaginae B758" CDS complement(10..1740) /translation="MKLAQLKMVCRGENAGFACIQLCEKPALLKVHAHTKDTNMPCPA RLVCLDELYGVRSSIDDAATGSWWVVIIPLLSVDCVVELSITRASQGLSSDTWSFVFG PHTSRYMSRLLTLRHPQAAALLRRIVHSAAYVHHQLNLIGIWNAAAQTSDSPDVPMRI WRFEARFTCDNSPAFEYFPISCCVLSSTGEPIRARVITLEEQTAAVPDDSAECVRRAV FSIALPHACTHAVVCARLNFKHVASLSCDGGDRARALQKASASTYEAFYTIFPAAAAA RIAEAERFSRDCAHDPHYERWFDEHAATSEQCAMQTRRYEEACACMEHREESTTHADQ PAQPVQLAQPAQPAQTSFDDALAHMGISVVLPVFSASTTLLARSINAMIHQSFPAWQL IVLDCTHMRTPNQQTDIARWLHSYTKTDARIMYVRMNVEQSQKNQAVAGESLSDSGAH AAGVAQPTDDIIDASRDHYAHHSSYLAYACSFIQNPYVYIMSEGAAPTPDALWHIAQT VAQHIAKGTPCDVVHVDEDELTPQGCTKPHVSYAASMIGLEGTNYLGHSLVLRTALLD ELRAPCDVAT" /product="hypothetical protein" CDS complement(1759..1875) /translation="MPRMYNAHADKVLQKRSKRRCTRLAPAEPHVVRVLLNL" /product="hypothetical protein" CDS complement(1844..2461) /db_xref="GO:0008830" /translation="MTQRTETSQAIQSGNFIFTPTSIRDVIIVDTKQYGDARGYFMET YKASDFAAGGISTTFVQDNQSSSTKGVLRGLHFQIEHPQAKLVRVVRGCVFDVAVDLR AGSETFGAWEGVELSAENHRQFYIPRGFAHGFFVLSDEAEFCYKCDDVYHPGDEGGLM WNDPDLAISWPAPCGCDSFSPSQVILSDKDTHHESFAAYVQRTRG" /product="dTDP-4-dehydrorhamnose 3,5-epimerase (EC 5.1.3.13)" /EC_number="5.1.3.13" CDS complement(2586..2741) /translation="MRTLSTYTSIFEQICISLGRALVLSCSMAARKIALADAPAGARC VALCAFP" /product="hypothetical protein" CDS complement(2798..3193) /translation="MSAILAIVPAYNEQQCIQQTIDELRRVCPGVDYLIVNDGSRDET AAICRRRHFNYINVPINCGLASGVQAGMKYAERNGYSAVVQFDADGQHKPEYIVPMYE HMQKTGADVVIGSRFVDDALLLVVCRHNC" /product="Glycosyltransferase involved in cell wall biogenesis (EC 2.4.-.-)" CDS 3238..3393 /translation="MRTLSTYTSIFEQICISLGRALVLSCSMAARKIALADAPAGARC VALCAFP" /product="hypothetical protein" CDS 3518..4135 /db_xref="GO:0008830" /translation="MTQRTETSQAIQSGNFIFTPTSIRDVIIVDTKQYGDARGYFMET YKASDFAAGGISTTFVQDNQSSSTKGVLRGLHFQIEHPQAKLVRVVRGCVFDVAVDLR AGSETFGAWEGVELSAENHRQFYIPRGFAHGFFVLSDEAEFCYKCDDVYHPGDEGGLM WNDPDLAISWPAPCGCDSFSPSQVILSDKDTHHESFAAYVQRTRG" /product="dTDP-4-dehydrorhamnose 3,5-epimerase (EC 5.1.3.13)" /EC_number="5.1.3.13"BASE COUNT 1077 a 1055 c 1036 g 1023 tORIGIN 1 aatcgcgctt catgttgcaa catcgcatgg cgcgcgaagt tcatccaaaa gcgcagttct 61 caacacgagt gaatgtccaa gatagttagt gccttcaagc ccaatcatgc ttgctgcata 121 actcacgtga ggctttgtgc agccttgggg cgtgagctca tcttcatcaa catgtacaac 181 atcgcagggt gtaccttttg ctatgtgctg tgctaccgtt tgtgcaatat gccacagggc 241 atcgggcgtg ggagctgccc cctcactcat aatgtaaacg tacgggtttt gtataaacga 301 acatgcatac gcaagatacg agctatggtg cgcataatgg tcgcgagatg catcgataat 361 atcatctgta ggctgtgcta cgccagcagc atgcgcgcct gaatctgaca atgattcccc 421 agctacagct tggtttttct gtgattgttc cacgttcata cgcacataca taatgcgcgc 481 atcggtctta gtatagctat gaagccagcg tgcaatatct gtttgttggt tgggagtgcg 541 catgtgtgta caatcgagca cgataagctg ccatgccgga aaactctgat gtatcatcgc 601 gttaatactg cgcgcaagca gtgtagtcga tgctgaaaaa acgggcagta ccaccgaaat 661 acccatgtgt gcaagcgcat catcaaacga tgtctgtgca ggttgtgcag gttgcgcaag 721 ctggacaggt tgtgcaggct ggtcagcatg cgtggtgctc tcttcgcggt gttccataca 781 cgcgcacgcc tcttcgtacc tgcgtgtttg catagcgcac tgctcagacg tagctgcatg 841 ctcatcaaac cagcgctcat agtgaggatc gtgagcgcaa tcgcgactaa aacgctcggc 901 ctcagcaatg cgcgcggccg cagcagcagg aaaaatggta taaaaggctt cgtaggtaga 961 tgcagacgct ttttgcaacg cgcgggctcg atctccaccg tcgcagctca acgatgccac 1021 atgcttaaag ttgaggcgcg cgcacacaac agcatgcgtg cacgcgtgcg gcaatgcaat 1081 cgaaaacacc gcacgacgaa cgcattcagc gctgtcatcc ggaacagctg ccgtttgctc 1141 ttctagcgta attacacgtg cacgtatggg ctcacctgta ctgctcagca cgcagcagct 1201 tataggaaaa tattcaaacg caggcgaatt gtcgcaggta aaccgcgctt caaatcgcca 1261 tatacgcatc ggcacatcgg gagagtcgct tgtttgcgca gcggcattcc aaataccaat 1321 aagattaagc tgatggtgca catatgcagc actgtgaacg atgcggcgta gcaacgcggc 1381 cgcttgagga tggcggagcg taagtaaacg cgacatgtag cgcgacgtat gaggaccaaa 1441 aacaaacgac cacgtatcag aactcagacc ctgcgacgca cgtgttatgc tgagctcaac 1501 cacacaatca acgctcaaca gcggaataat aacaacccac cacgaccccg ttgcagcatc 1561 atcaatgctc gaacgaacgc catagagctc gtctaaacac acaagacgcg caggacacgg 1621 catattcgta tctttggtat gtgcatgcac tttaagcaat gcgggctttt cgcacagctg 1681 aatgcaggca aaccctgcgt tttcgccgcg gcaaaccatt tttaattgcg cgagtttcat 1741 gcagtcccct tactgttgtt aaagattaag cagcacgcgc acaacatgcg gctctgcggg 1801 cgcaagtcgt gtgcagcgcc tttttgagcg cttttggagc accttatccg cgtgtgcgtt 1861 gtacatacgc ggcaaacgat tcatgatggg tatctttatc agacaaaata acctgcgagg 1921 gcgaaaagct atcgcagcca caaggcgcag gccagctaat agcaagatcg ggatcgttcc 1981 acataaggcc accttcatcg cctggatgat acacgtcgtc gcacttatag caaaattctg 2041 cctcatctga gagtacaaaa aatccgtgag caaagccgcg cggtatatag aattgtcgat 2101 gattttcggc cgataattca acgccttccc atgcaccaaa ggtctctgaa cctgcgcgca 2161 agtctaccgc aacatcaaac acacagccac gcacaacacg aacgagtttt gcttgagggt 2221 gttcaatctg aaaatgcagg ccacgaagca cgccttttgt ggagctcgat tggttatcct 2281 gtacaaacgt agtagaaata ccacccgcag caaaatcgga tgctttgtac gtttccataa 2341 agtacccgcg cgcgtcacca tactgtttgg tatcaacaat aataacgtcg cgaatagatg 2401 taggtgtaaa aataaaattg cccgattgaa tagcctgaga tgtttctgta cgctgtgtca 2461 taccttaact cctttagcgc gcgctccttt agcgcacaaa tatgcgctaa cgaattgtgc 2521 aatacctgca acggtttatt tcattgtagc gcacgaatat acgcctacga tatgttcata 2581 caatgctacg gaaatgcaca cagcgcgaca caccgtgcgc ccgcaggcgc atctgccagt 2641 gcaatcttcc ttgccgccat gctacacgaa agaacaagtg cgcgccccaa tgatatgcag 2701 atttgctcaa aaatagaggt atacgtgctc aaagtacgca atggtgcggt atacttgcac 2761 agatatacca actttatgga gaactatgtc tgcaatatta gcaattgtgc ctgcatacaa 2821 cgagcagcag tgcatcgtca acaaaacgcg aaccaataac aacgtcggca cctgtttttt 2881 gcatgtgctc gtacatgggc acgatgtact cgggtttgtg ctgaccatcg gcatcaaatt 2941 gcacaactgc agaatagcca ttgcgctctg catatttcat gccagcttga acgcccgaag 3001 caaggccgca gttaatgggc acatttatgt agttaaaatg gcgcctacgg catatagctg 3061 cggtttcgtc gcgcgagccg tcgtttacaa taaggtagtc tacgccgggg cacacgcggc 3121 gcagctcatc gattgtttgt tgtatgcact gctgctcgtt gtatgcaggc acaattgcta 3181 atattgcaga catagttctc cataaagttg gtatatctgt gcaagtatac cgcaccattg 3241 cgtactttga gcacgtatac ctctattttt gagcaaatct gcatatcatt ggggcgcgca 3301 cttgttcttt cgtgtagcat ggcggcaagg aagattgcac tggcagatgc gcctgcgggc 3361 gcacggtgtg tcgcgctgtg tgcatttccg tagcattgta tgaacatatc gtaggcgtat 3421 attcgtgcgc tacaatgaaa taaaccgttg caggtattgc acaattcgtt agcgcatatt 3481 tgtgcgctaa aggagcgcgc gctaaaggag ttaaggtatg acacagcgta cagaaacatc 3541 tcaggctatt caatcgggca attttatttt tacacctaca tctattcgcg acgttattat 3601 tgttgatacc aaacagtatg gtgacgcgcg cgggtacttt atggaaacgt acaaagcatc 3661 cgattttgct gcgggtggta tttctactac gtttgtacag gataaccaat cgagctccac 3721 aaaaggcgtg cttcgtggcc tgcattttca gattgaacac cctcaagcaa aactcgttcg 3781 tgttgtgcgt ggctgtgtgt ttgatgttgc ggtagacttg cgcgcaggtt cagagacctt 3841 tggtgcatgg gaaggcgttg aattatcggc cgaaaatcat cgacaattct atataccgcg 3901 cggctttgct cacggatttt ttgtactctc agatgaggca gaattttgct ataagtgcga 3961 cgacgtgtat catccaggcg atgaaggtgg ccttatgtgg aacgatcccg atcttgctat 4021 tagctggcct gcgccttgtg gctgcgatag cttttcgccc tcgcaggtta ttttgtctga 4081 taaagatacc catcatgaat cgtttgccgc gtatgtacaa cgcacacgcg gataaggtgc 4141 tccaaaagcg ctcaaaaagg cgctgcacac gacttgcgcc cggcagagcc g// Center for Genomic Sciences (412)-359-8341 From throwern at msu.edu Fri Sep 14 13:26:30 2012 From: throwern at msu.edu (Nick Thrower) Date: Fri, 14 Sep 2012 13:26:30 -0400 Subject: [BioRuby] Genbank file parsing question In-Reply-To: References: Message-ID: <988BDAA2-B026-429F-BCE4-06290F5AFEB9@msu.edu> Hi Josh, I've used the Bio gem to parse several Genbank files from NCBI. The snippet you provided looks like it should be handled correctly; except it is missing newlines. Could you provide more specific details about the errors you are receiving? -Nick -- Nick Thrower Information Technologist Michigan State University Great Lakes Bioenergy Research Center East Lansing MI 48824 On Sep 14, 2012, at 12:00 PM, bioruby-request at lists.open-bio.org wrote: > Send BioRuby mailing list submissions to > bioruby at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/bioruby > or, via email, send a message with subject or body 'help' to > bioruby-request at lists.open-bio.org > > You can reach the person managing the list at > bioruby-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of BioRuby digest..." > Today's Topics: > > 1. Genbank file parsing question (Josh Earl) > > From: Josh Earl > Date: September 13, 2012 1:50:34 PM EDT > To: > Subject: [BioRuby] Genbank file parsing question > > > > Hello all, > I'm trying to use bioruby to write a small program that will allow me to rearrange the contigs in a genbank file (based on either a list of contig names, mauve output, or whatever). The idea is that the annotation service that we use (RAST - > http://rast.nmpdr.org ) produces Genbank files, but not exactly the same format that Bioruby is expecting. They seem to outright break the Bio::FlatFile.new(Bio::GenBank, 'file') format, and confuse the Bio::GenBank.open('file') function. My question is, what should I do? Write my own parser, or try and fiddle with the Bioruby implementation or something else entirely? I'm fairly new to ruby, but I've been programming for a long time. > ~josh > > P.S. Here is a short section of what the RAST GenBank file looks like (just a single short contig): > LOCUS ctg7180000000028 4191 bp DNA linear UNK DEFINITION Contig ctg7180000000028 from Atopobium vaginae B758ACCESSION unknownFEATURES Location/Qualifiers source 1..4191 /mol_type="genomic DNA" /db_xref="taxon: 82135" /genome_md5="" /project="earl_82135" /genome_id="82135.3" /organism="Atopobium vaginae B758" CDS complement(10..1740) /translation="MKLAQLKMVCRGENAGFACIQLCEKPALLKVHAHTKDTNMPCPA RLVCLDELYGVRSSIDDAATGSWWVVIIPLLSVDCVVELSITRASQGLSSDTWSFVFG PHTSRYMSRLLTLRHPQAAALLRRIVHSAAYVHHQLNLIGIWNAAAQTSDSPDVPMRI WRFEARFTCDNSPAFEYFPISCCVLSSTGEPIRARVITLEEQTAAVPDDSAECVRRAV FSIALPHACTHAVVCARLNFKHVASLSCDGGDRARALQKASASTYEAFYTIFPAAAAA RIAEAERFSRDCAHDPHYERWFDEHAATSEQCAMQTRRYEEACACMEHREES! > TTHADQ PAQPVQLAQPAQPAQTSFDDALAHMGISVVLPVFSASTTLLARSINAMIHQSFPAWQL IVLDCTHMRTPNQQTDIARWLHSYTKTDARIMYVRMNVEQSQKNQAVAGESLSDSGAH AAGVAQPTDDIIDASRDHYAHHSSYLAYACSFIQNPYVYIMSEGAAPTPDALWHIAQT VAQHIAKGTPCDVVHVDEDELTPQGCTKPHVSYAASMIGLEGTNYLGHSLVLRTALLD ELRAPCDVAT" /product="hypothetical protein" CDS complement(1759..1875) /translation="MPRMYNAHADKVLQKRSKRRCTRLAPAEPHVVRVLLNL" /product="hypothetical protein" CDS complement(1844..2461) /db_xref="GO:0008830" /translation="MTQRTETSQAIQSGNFIFTPTSIRDVIIVDTKQYGDARGYFMET YKASDFAAGGISTTFVQDNQSSSTKGVLRGLHFQIEHPQAKLVRVVRGCVFDVAVDLR AGSETFGAWEGVELSAENHRQFYIPRGFAHGFFVLSDEAEFCYKCDDVYHPGDEGGLM WNDPDLAISWPAPCGCDSFSPSQVILSDKDTHHESFAAYVQRTRG" /pr! > oduct="dTDP-4-dehydrorhamnose 3,5-epimerase (EC 5. > 1.3.13)" /EC_number="5.1.3.13" CDS complement(2586..2741) /translation="MRTLSTYTSIFEQICISLGRALVLSCSMAARKIALADAPAGARC VALCAFP" /product="hypothetical protein" CDS complement(2798..3193) /translation="MSAILAIVPAYNEQQCIQQTIDELRRVCPGVDYLIVNDGSRDET AAICRRRHFNYINVPINCGLASGVQAGMKYAERNGYSAVVQFDADGQHKPEYIVPMYE HMQKTGADVVIGSRFVDDALLLVVCRHNC" /product="Glycosyltransferase involved in cell wall biogenesis (EC 2.4.-.-)" CDS 3238..3393 /translation="MRTLSTYTSIFEQICISLGRALVLSCSMAARKIALADAPAGARC VALCAFP" /product="hypothetical protein" CDS 3518..4135 /db_xref="GO:0008830" /translation="MTQRTETSQAIQSGNFIFTPTSIRDVIIVDTKQYGDARGYFMET YKA! > SDFAAGGISTTFVQDNQSSSTKGVLRGLHFQIEHPQAKLVRVVRGCVFDVAVDLR AGSETFGAWEGVELSAENHRQFYIPRGFAHGFFVLSDEAEFCYKCDDVYHPGDEGGLM WNDPDLAISWPAPCGCDSFSPSQVILSDKDTHHESFAAYVQRTRG" /product="dTDP-4-dehydrorhamnose 3,5-epimerase (EC 5.1.3.13)" /EC_number="5.1.3.13"BASE COUNT 1077 a 1055 c 1036 g 1023 tORIGIN 1 aatcgcgctt catgttgcaa catcgcatgg cgcgcgaagt tcatccaaaa gcgcagttct 61 caacacgagt gaatgtccaa gatagttagt gccttcaagc ccaatcatgc ttgctgcata 121 actcacgtga ggctttgtgc agccttgggg cgtgagctca tcttcatcaa catgtacaac 181 atcgcagggt gtaccttttg ctatgtgctg tgctaccgtt tgtgcaatat gccacagggc 241 atcgggcgtg ggagctgccc cctcactcat aatgtaaacg tacgggtttt gtataaacga 301 acatgcatac gcaagatacg agctatggtg cgcataatgg tcgcgagatg catcgataat 361 atcatctgta ggctgtgcta cgccagcagc atgcgcgcct gaatctgaca atgattcccc 421 agctacagct tggtttttct gtgattgttc cacgttcata cgca! > cataca taatgcgcgc 481 atcggtctta gtatagctat gaagccagcg tgcaatatct > gtttgttggt tgggagtgcg 541 catgtgtgta caatcgagca cgataagctg ccatgccgga aaactctgat gtatcatcgc 601 gttaatactg cgcgcaagca gtgtagtcga tgctgaaaaa acgggcagta ccaccgaaat 661 acccatgtgt gcaagcgcat catcaaacga tgtctgtgca ggttgtgcag gttgcgcaag 721 ctggacaggt tgtgcaggct ggtcagcatg cgtggtgctc tcttcgcggt gttccataca 781 cgcgcacgcc tcttcgtacc tgcgtgtttg catagcgcac tgctcagacg tagctgcatg 841 ctcatcaaac cagcgctcat agtgaggatc gtgagcgcaa tcgcgactaa aacgctcggc 901 ctcagcaatg cgcgcggccg cagcagcagg aaaaatggta taaaaggctt cgtaggtaga 961 tgcagacgct ttttgcaacg cgcgggctcg atctccaccg tcgcagctca acgatgccac 1021 atgcttaaag ttgaggcgcg cgcacacaac agcatgcgtg cacgcgtgcg gcaatgcaat 1081 cgaaaacacc gcacgacgaa cgcattcagc gctgtcatcc ggaacagctg ccgtttgctc 1141 ttctagcgta attacacgtg cacgtatggg ctcacctgta ctgctcagca cgcagcagct 1201 tataggaaaa tattcaaacg caggcgaatt gtcgcaggta aaccgcgctt caaatcgcca 1261 tatacgcatc ggcacatcgg gagagtcgct tgtttgcgca gcggcattcc aa! > ataccaat 1321 aagattaagc tgatggtgca catatgcagc actgtgaacg atgcggcgta gcaacgcggc 1381 cgcttgagga tggcggagcg taagtaaacg cgacatgtag cgcgacgtat gaggaccaaa 1441 aacaaacgac cacgtatcag aactcagacc ctgcgacgca cgtgttatgc tgagctcaac 1501 cacacaatca acgctcaaca gcggaataat aacaacccac cacgaccccg ttgcagcatc 1561 atcaatgctc gaacgaacgc catagagctc gtctaaacac acaagacgcg caggacacgg 1621 catattcgta tctttggtat gtgcatgcac tttaagcaat gcgggctttt cgcacagctg 1681 aatgcaggca aaccctgcgt tttcgccgcg gcaaaccatt tttaattgcg cgagtttcat 1741 gcagtcccct tactgttgtt aaagattaag cagcacgcgc acaacatgcg gctctgcggg 1801 cgcaagtcgt gtgcagcgcc tttttgagcg cttttggagc accttatccg cgtgtgcgtt 1861 gtacatacgc ggcaaacgat tcatgatggg tatctttatc agacaaaata acctgcgagg 1921 gcgaaaagct atcgcagcca caaggcgcag gccagctaat agcaagatcg ggatcgttcc 1981 acataaggcc accttcatcg cctggatgat acacgtcgtc gcacttatag caaaattctg 2041 cctcatctga gagtacaaaa aatccgtgag caaagccgcg cggtatatag aattgtcgat ! > 2101 gattttcggc cgataattca acgccttccc atgcaccaaa ggtctctgaa cctgcgcgca > 2161 agtctaccgc aacatcaaac acacagccac gcacaacacg aacgagtttt gcttgagggt 2221 gttcaatctg aaaatgcagg ccacgaagca cgccttttgt ggagctcgat tggttatcct 2281 gtacaaacgt agtagaaata ccacccgcag caaaatcgga tgctttgtac gtttccataa 2341 agtacccgcg cgcgtcacca tactgtttgg tatcaacaat aataacgtcg cgaatagatg 2401 taggtgtaaa aataaaattg cccgattgaa tagcctgaga tgtttctgta cgctgtgtca 2461 taccttaact cctttagcgc gcgctccttt agcgcacaaa tatgcgctaa cgaattgtgc 2521 aatacctgca acggtttatt tcattgtagc gcacgaatat acgcctacga tatgttcata 2581 caatgctacg gaaatgcaca cagcgcgaca caccgtgcgc ccgcaggcgc atctgccagt 2641 gcaatcttcc ttgccgccat gctacacgaa agaacaagtg cgcgccccaa tgatatgcag 2701 atttgctcaa aaatagaggt atacgtgctc aaagtacgca atggtgcggt atacttgcac 2761 agatatacca actttatgga gaactatgtc tgcaatatta gcaattgtgc ctgcatacaa 2821 cgagcagcag tgcatcgtca acaaaacgcg aaccaataac aacgtcggca cctgtttttt 2881 gcatgtgctc gtacatgggc acgatgtact cgggtttgtg ctgaccatcg gcatcaaatt 2941 gcac! > aactgc agaatagcca ttgcgctctg catatttcat gccagcttga acgcccgaag 3001 caaggccgca gttaatgggc acatttatgt agttaaaatg gcgcctacgg catatagctg 3061 cggtttcgtc gcgcgagccg tcgtttacaa taaggtagtc tacgccgggg cacacgcggc 3121 gcagctcatc gattgtttgt tgtatgcact gctgctcgtt gtatgcaggc acaattgcta 3181 atattgcaga catagttctc cataaagttg gtatatctgt gcaagtatac cgcaccattg 3241 cgtactttga gcacgtatac ctctattttt gagcaaatct gcatatcatt ggggcgcgca 3301 cttgttcttt cgtgtagcat ggcggcaagg aagattgcac tggcagatgc gcctgcgggc 3361 gcacggtgtg tcgcgctgtg tgcatttccg tagcattgta tgaacatatc gtaggcgtat 3421 attcgtgcgc tacaatgaaa taaaccgttg caggtattgc acaattcgtt agcgcatatt 3481 tgtgcgctaa aggagcgcgc gctaaaggag ttaaggtatg acacagcgta cagaaacatc 3541 tcaggctatt caatcgggca attttatttt tacacctaca tctattcgcg acgttattat 3601 tgttgatacc aaacagtatg gtgacgcgcg cgggtacttt atggaaacgt acaaagcatc 3661 cgattttgct gcgggtggta tttctactac gtttgtacag gataaccaat cgagctccac 3721 aaaaggcgtg cttcgt! > ggcc tgcattttca gattgaacac cctcaagcaa aactcgttcg 3781 tgttgtgcgt g > gctgtgtgt ttgatgttgc ggtagacttg cgcgcaggtt cagagacctt 3841 tggtgcatgg gaaggcgttg aattatcggc cgaaaatcat cgacaattct atataccgcg 3901 cggctttgct cacggatttt ttgtactctc agatgaggca gaattttgct ataagtgcga 3961 cgacgtgtat catccaggcg atgaaggtgg ccttatgtgg aacgatcccg atcttgctat 4021 tagctggcct gcgccttgtg gctgcgatag cttttcgccc tcgcaggtta ttttgtctga 4081 taaagatacc catcatgaat cgtttgccgc gtatgtacaa cgcacacgcg gataaggtgc 4141 tccaaaagcg ctcaaaaagg cgctgcacac gacttgcgcc cggcagagcc g// > Center for Genomic Sciences > (412)-359-8341 > > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From joshearl1 at hotmail.com Mon Sep 17 11:39:44 2012 From: joshearl1 at hotmail.com (Josh Earl) Date: Mon, 17 Sep 2012 11:39:44 -0400 Subject: [BioRuby] Genbank file parsing question In-Reply-To: References: Message-ID: Hi Nick, Yeah, sorry about the genbank example, it appears to have lost all the formatting when I sent the email. This might be more handy: http://pastebin.com/N1D7jUuu I'm running into several issues. The first is if I try and load the file from which the above excerpt is from, whenever I load the file, and call methods on it, this is what happens (for example): bioruby> gb = Bio::FlatFile.new(Bio::GenBank, '/mnt/p/o_drive/Homes/jearl/Magee/Atopobium_vaginae.gbk') ==> #, @dbclass=Bio::GenBank, @splitter=#, @entry_pos_flag=nil, @delimiter="\n//\n", @header="LOCUS ", @delimiter_overrun=nil>, @skip_leader_mode=:firsttime, @firsttime_flag=true, @raw=false>But, if I try to call any methods on this:bioruby> gb.firstNoMethodError: private method `gets' called for "/mnt/p/o_drive/Homes/jearl/Magee/Atopobium_vaginae.gbk":String from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile/buffer.rb:251:in `gets' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile/splitter.rb:161:in `skip_leader' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile.rb:283:in `next_entry' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile.rb:335:in `each_entry' from (irb):4:in `first' from (irb):4 from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/bin/bioruby:41:in `block in ' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/bin/bioruby:40:in `catch' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/bin/bioruby:40:in `' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/bioruby:19:in `load' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/bioruby:19:in `
' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/ruby_noexec_wrapper:14:in `eval' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/ruby_noexec_wrapper:14:in `
opening these files with Bio::FlatFile.auto('Atopobium_vaginae.gbk') seems to work inconsistently, but for this file it opens ok. Also, Bio::GenBank.new('Atopobium_vaginae.gbk') will open this file and seems to work the most consistently. Loading into this object truncates the Locus id from: ctg7180000000048 toctg7180000 i.e.bioruby> gb.first.locus.entry_id ==> "ctg7180000" And if I attempt to say something like:bioruby> gb.first.organism ==> "" It is just an empty string. Does this variable not get set for each genbank entry? The organism is listed under the "source" attribute in the file. Not all of these are really errors per se, but odd behavior. ~josh > Hi Josh, > > I've used the Bio gem to parse several Genbank files from NCBI. The snippet you provided looks like it should be handled correctly; except it is missing newlines. > > Could you provide more specific details about the errors you are receiving? > > -Nick > > -- > Nick Thrower > Information Technologist > Michigan State University > Great Lakes Bioenergy Research Center > East Lansing MI 48824 > > On Sep 14, 2012, at 12:00 PM, bioruby-request at lists.open-bio.org wrote: > > > Send BioRuby mailing list submissions to > > bioruby at lists.open-bio.org > > > > To subscribe or unsubscribe via the World Wide Web, visit > > http://lists.open-bio.org/mailman/listinfo/bioruby > > or, via email, send a message with subject or body 'help' to > > bioruby-request at lists.open-bio.org > > > > You can reach the person managing the list at > > bioruby-owner at lists.open-bio.org > > > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of BioRuby digest..." > > Today's Topics: > > > > 1. Genbank file parsing question (Josh Earl) > > > > From: Josh Earl > > Date: September 13, 2012 1:50:34 PM EDT > > To: > > Subject: [BioRuby] Genbank file parsing question > > > > > > > > Hello all, > > I'm trying to use bioruby to write a small program that will allow me to rearrange the contigs in a genbank file (based on either a list of contig names, mauve output, or whatever). The idea is that the annotation service that we use (RAST - > > http://rast.nmpdr.org ) produces Genbank files, but not exactly the same format that Bioruby is expecting. They seem to outright break the Bio::FlatFile.new(Bio::GenBank, 'file') format, and confuse the Bio::GenBank.open('file') function. My question is, what should I do? Write my own parser, or try and fiddle with the Bioruby implementation or something else entirely? I'm fairly new to ruby, but I've been programming for a long time. > > ~josh > > > > P.S. Here is a short section of what the RAST GenBank file looks like (just a single short contig): > > LOCUS ctg7180000000028 4191 bp DNA linear UNK DEFINITION Contig ctg7180000000028 from Atopobium vaginae B758ACCESSION unknownFEATURES Location/Qualifiers source 1..4191 /mol_type="genomic DNA" /db_xref="taxon: 82135" /genome_md5="" /project="earl_82135" /genome_id="82135.3" /organism="Atopobium vaginae B758" CDS complement(10..1740) /translation="MKLAQLKMVCRGENAGFACIQLCEKPALLKVHAHTKDTNMPCPA RLVCLDELYGVRSSIDDAATGSWWVVIIPLLSVDCVVELSITRASQGLSSDTWSFVFG PHTSRYMSRLLTLRHPQAAALLRRIVHSAAYVHHQLNLIGIWNAAAQTSDSPDVPMRI WRFEARFTCDNSPAFEYFPISCCVLSSTGEPIRARVITLEEQTAAVPDDSAECVRRAV FSIALPHACTHAVVCARLNFKHVASLSCDGGDRARALQKASASTYEAFYTIFPAAAAA RIAEAERFSRDCAHDPHYERWFDEHAATSEQCAMQTRRYEEACACMEHRE! > ES! > > TTHADQ PAQPVQLAQPAQPAQTSFDDALAHMGISVVLPVFSASTTLLARSINAMIHQSFPAWQL IVLDCTHMRTPNQQTDIARWLHSYTKTDARIMYVRMNVEQSQKNQAVAGESLSDSGAH AAGVAQPTDDIIDASRDHYAHHSSYLAYACSFIQNPYVYIMSEGAAPTPDALWHIAQT VAQHIAKGTPCDVVHVDEDELTPQGCTKPHVSYAASMIGLEGTNYLGHSLVLRTALLD ELRAPCDVAT" /product="hypothetical protein" CDS complement(1759..1875) /translation="MPRMYNAHADKVLQKRSKRRCTRLAPAEPHVVRVLLNL" /product="hypothetical protein" CDS complement(1844..2461) /db_xref="GO:0008830" /translation="MTQRTETSQAIQSGNFIFTPTSIRDVIIVDTKQYGDARGYFMET YKASDFAAGGISTTFVQDNQSSSTKGVLRGLHFQIEHPQAKLVRVVRGCVFDVAVDLR AGSETFGAWEGVELSAENHRQFYIPRGFAHGFFVLSDEAEFCYKCDDVYHPGDEGGLM WNDPDLAISWPAPCGCDSFSPSQVILSDKDTHHESFAAYVQRTRG" /p! > r! > > oduct="dTDP-4-dehydrorhamnose 3,5-epimerase (EC 5. > > 1.3.13)" /EC_number="5.1.3.13" CDS complement(2586..2741) /translation="MRTLSTYTSIFEQICISLGRALVLSCSMAARKIALADAPAGARC VALCAFP" /product="hypothetical protein" CDS complement(2798..3193) /translation="MSAILAIVPAYNEQQCIQQTIDELRRVCPGVDYLIVNDGSRDET AAICRRRHFNYINVPINCGLASGVQAGMKYAERNGYSAVVQFDADGQHKPEYIVPMYE HMQKTGADVVIGSRFVDDALLLVVCRHNC" /product="Glycosyltransferase involved in cell wall biogenesis (EC 2.4.-.-)" CDS 3238..3393 /translation="MRTLSTYTSIFEQICISLGRALVLSCSMAARKIALADAPAGARC VALCAFP" /product="hypothetical protein" CDS 3518..4135 /db_xref="GO:0008830" /translation="MTQRTETSQAIQSGNFIFTPTSIRDVIIVDTKQYGDARGYFMET Y! > KA! > > SDFAAGGISTTFVQDNQSSSTKGVLRGLHFQIEHPQAKLVRVVRGCVFDVAVDLR AGSETFGAWEGVELSAENHRQFYIPRGFAHGFFVLSDEAEFCYKCDDVYHPGDEGGLM WNDPDLAISWPAPCGCDSFSPSQVILSDKDTHHESFAAYVQRTRG" /product="dTDP-4-dehydrorhamnose 3,5-epimerase (EC 5.1.3.13)" /EC_number="5.1.3.13"BASE COUNT 1077 a 1055 c 1036 g 1023 tORIGIN 1 aatcgcgctt catgttgcaa catcgcatgg cgcgcgaagt tcatccaaaa gcgcagttct 61 caacacgagt gaatgtccaa gatagttagt gccttcaagc ccaatcatgc ttgctgcata 121 actcacgtga ggctttgtgc agccttgggg cgtgagctca tcttcatcaa catgtacaac 181 atcgcagggt gtaccttttg ctatgtgctg tgctaccgtt tgtgcaatat gccacagggc 241 atcgggcgtg ggagctgccc cctcactcat aatgtaaacg tacgggtttt gtataaacga 301 acatgcatac gcaagatacg agctatggtg cgcataatgg tcgcgagatg catcgataat 361 atcatctgta ggctgtgcta cgccagcagc atgcgcgcct gaatctgaca atgattcccc 421 agctacagct tggtttttct gtgattgttc cacgttcata cgc! > a! > > cataca taatgcgcgc 481 atcggtctta gtatagctat gaagccagcg tgcaatatct > > gtttgttggt tgggagtgcg 541 catgtgtgta caatcgagca cgataagctg ccatgccgga aaactctgat gtatcatcgc 601 gttaatactg cgcgcaagca gtgtagtcga tgctgaaaaa acgggcagta ccaccgaaat 661 acccatgtgt gcaagcgcat catcaaacga tgtctgtgca ggttgtgcag gttgcgcaag 721 ctggacaggt tgtgcaggct ggtcagcatg cgtggtgctc tcttcgcggt gttccataca 781 cgcgcacgcc tcttcgtacc tgcgtgtttg catagcgcac tgctcagacg tagctgcatg 841 ctcatcaaac cagcgctcat agtgaggatc gtgagcgcaa tcgcgactaa aacgctcggc 901 ctcagcaatg cgcgcggccg cagcagcagg aaaaatggta taaaaggctt cgtaggtaga 961 tgcagacgct ttttgcaacg cgcgggctcg atctccaccg tcgcagctca acgatgccac 1021 atgcttaaag ttgaggcgcg cgcacacaac agcatgcgtg cacgcgtgcg gcaatgcaat 1081 cgaaaacacc gcacgacgaa cgcattcagc gctgtcatcc ggaacagctg ccgtttgctc 1141 ttctagcgta attacacgtg cacgtatggg ctcacctgta ctgctcagca cgcagcagct 1201 tataggaaaa tattcaaacg caggcgaatt gtcgcaggta aaccgcgctt caaatcgcca 1261 tatacgcatc ggcacatcgg gagagtcgct tgtttgcgca gcggcattcc a! > a! > > ataccaat 1321 aagattaagc tgatggtgca catatgcagc actgtgaacg atgcggcgta gcaacgcggc 1381 cgcttgagga tggcggagcg taagtaaacg cgacatgtag cgcgacgtat gaggaccaaa 1441 aacaaacgac cacgtatcag aactcagacc ctgcgacgca cgtgttatgc tgagctcaac 1501 cacacaatca acgctcaaca gcggaataat aacaacccac cacgaccccg ttgcagcatc 1561 atcaatgctc gaacgaacgc catagagctc gtctaaacac acaagacgcg caggacacgg 1621 catattcgta tctttggtat gtgcatgcac tttaagcaat gcgggctttt cgcacagctg 1681 aatgcaggca aaccctgcgt tttcgccgcg gcaaaccatt tttaattgcg cgagtttcat 1741 gcagtcccct tactgttgtt aaagattaag cagcacgcgc acaacatgcg gctctgcggg 1801 cgcaagtcgt gtgcagcgcc tttttgagcg cttttggagc accttatccg cgtgtgcgtt 1861 gtacatacgc ggcaaacgat tcatgatggg tatctttatc agacaaaata acctgcgagg 1921 gcgaaaagct atcgcagcca caaggcgcag gccagctaat agcaagatcg ggatcgttcc 1981 acataaggcc accttcatcg cctggatgat acacgtcgtc gcacttatag caaaattctg 2041 cctcatctga gagtacaaaa aatccgtgag caaagccgcg cggtatatag aattgtcgat ! > ! > > 2101 gattttcggc cgataattca acgccttccc atgcaccaaa ggtctctgaa cctgcgcgca > > 2161 agtctaccgc aacatcaaac acacagccac gcacaacacg aacgagtttt gcttgagggt 2221 gttcaatctg aaaatgcagg ccacgaagca cgccttttgt ggagctcgat tggttatcct 2281 gtacaaacgt agtagaaata ccacccgcag caaaatcgga tgctttgtac gtttccataa 2341 agtacccgcg cgcgtcacca tactgtttgg tatcaacaat aataacgtcg cgaatagatg 2401 taggtgtaaa aataaaattg cccgattgaa tagcctgaga tgtttctgta cgctgtgtca 2461 taccttaact cctttagcgc gcgctccttt agcgcacaaa tatgcgctaa cgaattgtgc 2521 aatacctgca acggtttatt tcattgtagc gcacgaatat acgcctacga tatgttcata 2581 caatgctacg gaaatgcaca cagcgcgaca caccgtgcgc ccgcaggcgc atctgccagt 2641 gcaatcttcc ttgccgccat gctacacgaa agaacaagtg cgcgccccaa tgatatgcag 2701 atttgctcaa aaatagaggt atacgtgctc aaagtacgca atggtgcggt atacttgcac 2761 agatatacca actttatgga gaactatgtc tgcaatatta gcaattgtgc ctgcatacaa 2821 cgagcagcag tgcatcgtca acaaaacgcg aaccaataac aacgtcggca cctgtttttt 2881 gcatgtgctc gtacatgggc acgatgtact cgggtttgtg ctgaccatcg gcatcaaatt 2941 gca! > c! > > aactgc agaatagcca ttgcgctctg catatttcat gccagcttga acgcccgaag 3001 caaggccgca gttaatgggc acatttatgt agttaaaatg gcgcctacgg catatagctg 3061 cggtttcgtc gcgcgagccg tcgtttacaa taaggtagtc tacgccgggg cacacgcggc 3121 gcagctcatc gattgtttgt tgtatgcact gctgctcgtt gtatgcaggc acaattgcta 3181 atattgcaga catagttctc cataaagttg gtatatctgt gcaagtatac cgcaccattg 3241 cgtactttga gcacgtatac ctctattttt gagcaaatct gcatatcatt ggggcgcgca 3301 cttgttcttt cgtgtagcat ggcggcaagg aagattgcac tggcagatgc gcctgcgggc 3361 gcacggtgtg tcgcgctgtg tgcatttccg tagcattgta tgaacatatc gtaggcgtat 3421 attcgtgcgc tacaatgaaa taaaccgttg caggtattgc acaattcgtt agcgcatatt 3481 tgtgcgctaa aggagcgcgc gctaaaggag ttaaggtatg acacagcgta cagaaacatc 3541 tcaggctatt caatcgggca attttatttt tacacctaca tctattcgcg acgttattat 3601 tgttgatacc aaacagtatg gtgacgcgcg cgggtacttt atggaaacgt acaaagcatc 3661 cgattttgct gcgggtggta tttctactac gtttgtacag gataaccaat cgagctccac 3721 aaaaggcgtg cttcg! > t! > > ggcc tgcattttca gattgaacac cctcaagcaa aactcgttcg 3781 tgttgtgcgt g > > gctgtgtgt ttgatgttgc ggtagacttg cgcgcaggtt cagagacctt 3841 tggtgcatgg gaaggcgttg aattatcggc cgaaaatcat cgacaattct atataccgcg 3901 cggctttgct cacggatttt ttgtactctc agatgaggca gaattttgct ataagtgcga 3961 cgacgtgtat catccaggcg atgaaggtgg ccttatgtgg aacgatcccg atcttgctat 4021 tagctggcct gcgccttgtg gctgcgatag cttttcgccc tcgcaggtta ttttgtctga 4081 taaagatacc catcatgaat cgtttgccgc gtatgtacaa cgcacacgcg gataaggtgc 4141 tccaaaagcg ctcaaaaagg cgctgcacac gacttgcgcc cggcagagcc g// > > Center for Genomic Sciences > > (412)-359-8341 > > > > > > > > _______________________________________________ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > > > > > > > ------------------------------ > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > > > End of BioRuby Digest, Vol 84, Issue 6 > ************************************** From throwern at msu.edu Mon Sep 17 13:28:56 2012 From: throwern at msu.edu (Nick Thrower) Date: Mon, 17 Sep 2012 13:28:56 -0400 Subject: [BioRuby] Genbank file parsing question In-Reply-To: References: Message-ID: <03A1944A-AD1F-4C94-B026-74443E716242@msu.edu> Hi Josh, 1.) You are getting an error because you must pass an open stream to the 'new' method http://bioruby.org/rdoc/Bio/FlatFile.html#method-c-new If you want to supply a file location you should use the 'open' method http://bioruby.org/rdoc/Bio/FlatFile.html#method-c-open gb = Bio::FlatFile.open(Bio::GenBank,'/mnt/p/o_drive/Homes/jearl/Magee/Atopobium_vaginae.gbk') 2.) The locus line is position parsed, and it looks like your locus is beyond the hard coded bounds: http://bioruby.org/rdoc/Bio/GenBank/Locus.html (look at the source for 'new') Maybe somebody else could help with that? 3.) To access the organism line you need to drill down through the data. A Genbank file is made up of several entries. Each entry has many features, and each feature has many qualifiers. gb.first.features.first.qualifiers.select{|f| f.qualifier=='organism'} => [#] -Nick -- Nick Thrower Information Technologist Michigan State University Great Lakes Bioenergy Research Center East Lansing MI 48824 > > Hi Nick, > Yeah, sorry about the genbank example, it appears to have lost all the formatting when I sent the email. This might be more handy: > http://pastebin.com/N1D7jUuu > I'm running into several issues. The first is if I try and load the file from which the above excerpt is from, whenever I load the file, and call methods on it, this is what happens (for example): > bioruby> gb = Bio::FlatFile.new(Bio::GenBank, '/mnt/p/o_drive/Homes/jearl/Magee/Atopobium_vaginae.gbk') ==> #, @dbclass=Bio::GenBank, @splitter=#, @entry_pos_flag=nil, @delimiter="\n//\n", @header="LOCUS ", @delimiter_overrun=nil>, @skip_leader_mode=:firsttime, @firsttime_flag=true, @raw=false>But, if I try to call any methods on this:bioruby> gb.firstNoMethodError: private method `gets' called for "/mnt/p/o_drive/Homes/jearl/Magee/Atopobium_vaginae.gbk":String from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile/buffer.rb:251:in `gets' fro! > m /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile/splitter.rb:161:in `skip_leader' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile.rb:283:in `next_entry' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile.rb:335:in `each_entry' from (irb):4:in `first' from (irb):4 from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/bin/bioruby:41:in `block in ' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/bin/bioruby:40:in `catch' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/bin/bioruby:40:in `' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/bioruby:19:in `load' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/bioruby:19:in `
' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/ruby_noexec_wrapper:14:in `eval' from /home/josh/.r! > vm/gems/ruby-1.9.2-p290 at proj2/bin/ruby_noexec_wrapper:14:in `
> opening these files with Bio::FlatFile.auto('Atopobium_vaginae.gbk') seems to work inconsistently, but for this file it opens ok. Also, Bio::GenBank.new('Atopobium_vaginae.gbk') will open this file and seems to work the most consistently. > Loading into this object truncates the Locus id from: > ctg7180000000048 toctg7180000 > i.e.bioruby> gb.first.locus.entry_id ==> "ctg7180000" > And if I attempt to say something like:bioruby> gb.first.organism ==> "" > It is just an empty string. Does this variable not get set for each genbank entry? The organism is listed under the "source" attribute in the file. > Not all of these are really errors per se, but odd behavior. > ~josh From joshearl1 at hotmail.com Mon Sep 17 14:46:21 2012 From: joshearl1 at hotmail.com (Josh Earl) Date: Mon, 17 Sep 2012 14:46:21 -0400 Subject: [BioRuby] Genbank file parsing question In-Reply-To: <03A1944A-AD1F-4C94-B026-74443E716242@msu.edu> References: , <03A1944A-AD1F-4C94-B026-74443E716242@msu.edu> Message-ID: Hey Nick, Wow, that was incredibly helpful, thanks. One of the reasons I was confused about with the Bio::FlatFile.new method is the http://thebird.nl/bioruby/Tutorial.rd.html is a bit confusing in that regard (always uses the .new method instead of .open for Bio::FlatFile). Is it the correct usage on the tutorial, or was I just interpreting that incorrectly? Is there a reason to go specifically in hard-coding the locations instead of just splitting it into an array? I suppose there are issues with both, since if there are missing values, the locations in the array would be incorrect, and if the locus name is too long that might move the other fields out of position. And thanks for clarifying how to get access to the organism. It does seem a bit odd to me that the first genbank record would have an "organism" method that wasn't set, even if there is an organism name listed. For instance gb.first refers to a single genbank record, right? So, what is gb.first.organism referring to, if not the organism of that record? I guess I have similar questions about the .classification and .taxonomy methods (both of which return empty values on the first genbank record). It seems odd that you would have to dig into the record like that to get the information, especially if the methods are available on a record. Maybe they refer to something else than the items listed in the "source" feature? ~josh Center for Genomic Sciences (412)-359-8341 > From: throwern at msu.edu > Date: Mon, 17 Sep 2012 13:28:56 -0400 > To: bioruby at lists.open-bio.org > Subject: Re: [BioRuby] Genbank file parsing question > > Hi Josh, > > 1.) > You are getting an error because you must pass an open stream to the 'new' method > http://bioruby.org/rdoc/Bio/FlatFile.html#method-c-new > > If you want to supply a file location you should use the 'open' method > http://bioruby.org/rdoc/Bio/FlatFile.html#method-c-open > > gb = Bio::FlatFile.open(Bio::GenBank,'/mnt/p/o_drive/Homes/jearl/Magee/Atopobium_vaginae.gbk') > > 2.) > The locus line is position parsed, and it looks like your locus is beyond the hard coded bounds: > http://bioruby.org/rdoc/Bio/GenBank/Locus.html (look at the source for 'new') > > Maybe somebody else could help with that? > > 3.) > To access the organism line you need to drill down through the data. A Genbank file is made up of several entries. Each entry has many features, and each feature has many qualifiers. > > gb.first.features.first.qualifiers.select{|f| f.qualifier=='organism'} > => [#] > > -Nick > > -- > Nick Thrower > Information Technologist > Michigan State University > Great Lakes Bioenergy Research Center > East Lansing MI 48824 > > > > > Hi Nick, > > Yeah, sorry about the genbank example, it appears to have lost all the formatting when I sent the email. This might be more handy: > > http://pastebin.com/N1D7jUuu > > I'm running into several issues. The first is if I try and load the file from which the above excerpt is from, whenever I load the file, and call methods on it, this is what happens (for example): > > bioruby> gb = Bio::FlatFile.new(Bio::GenBank, '/mnt/p/o_drive/Homes/jearl/Magee/Atopobium_vaginae.gbk') ==> #, @dbclass=Bio::GenBank, @splitter=#, @entry_pos_flag=nil, @delimiter="\n//\n", @header="LOCUS ", @delimiter_overrun=nil>, @skip_leader_mode=:firsttime, @firsttime_flag=true, @raw=false>But, if I try to call any methods on this:bioruby> gb.firstNoMethodError: private method `gets' called for "/mnt/p/o_drive/Homes/jearl/Magee/Atopobium_vaginae.gbk":String from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile/buffer.rb:251:in `gets' f! > ro! > > m /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile/splitter.rb:161:in `skip_leader' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile.rb:283:in `next_entry' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile.rb:335:in `each_entry' from (irb):4:in `first' from (irb):4 from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/bin/bioruby:41:in `block in ' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/bin/bioruby:40:in `catch' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/bin/bioruby:40:in `' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/bioruby:19:in `load' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/bioruby:19:in `
' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/ruby_noexec_wrapper:14:in `eval' from /home/josh/.! > r! > > vm/gems/ruby-1.9.2-p290 at proj2/bin/ruby_noexec_wrapper:14:in `
> > opening these files with Bio::FlatFile.auto('Atopobium_vaginae.gbk') seems to work inconsistently, but for this file it opens ok. Also, Bio::GenBank.new('Atopobium_vaginae.gbk') will open this file and seems to work the most consistently. > > Loading into this object truncates the Locus id from: > > ctg7180000000048 toctg7180000 > > i.e.bioruby> gb.first.locus.entry_id ==> "ctg7180000" > > And if I attempt to say something like:bioruby> gb.first.organism ==> "" > > It is just an empty string. Does this variable not get set for each genbank entry? The organism is listed under the "source" attribute in the file. > > Not all of these are really errors per se, but odd behavior. > > ~josh > > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From ngoto at gen-info.osaka-u.ac.jp Tue Sep 18 09:22:13 2012 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Tue, 18 Sep 2012 22:22:13 +0900 Subject: [BioRuby] Genbank file parsing question In-Reply-To: References: <03A1944A-AD1F-4C94-B026-74443E716242@msu.edu> Message-ID: <201209181327.q8IDRfqa031985@portal.open-bio.org> Hi, On Mon, 17 Sep 2012 14:46:21 -0400 Josh Earl wrote: > Hey Nick, > Wow, that was incredibly helpful, thanks. One of the reasons I was confused about with the Bio::FlatFile.new method is the > http://thebird.nl/bioruby/Tutorial.rd.html is a bit confusing in that regard (always uses the .new method instead of .open for Bio::FlatFile). Is it the correct usage on the tutorial, or was I just interpreting that incorrectly? The usage in the tutorial is right. As you can see, it only teaches Bio::FlatFile.new, but this does not mean there are no other methods. Indeed, I think many useful methods, classes, modules, and usages of them are not yet described in the tutorial. Thanks giving us an idea to improve the tutorial. > Is there a reason to go specifically in hard-coding the locations instead of just splitting it into an array? Because the positions are officially defined by NCBI. See section 3.4.4 in the NCBI GenBank Release Note. ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt (current version: Release 191.0) It says: >> Positions Contents >> --------- -------- >> 01-05 'LOCUS' >> 06-12 spaces >> 13-28 Locus name >> 29-29 space >> 30-40 Length of sequence, right-justified >> 41-41 space >> 42-43 bp >> 44-44 space >> 45-47 spaces, ss- (single-stranded), ds- (double-stranded), or >> ms- (mixed-stranded) >> 48-53 NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), >> mRNA (messenger RNA), uRNA (small nuclear RNA). >> Left justified. >> 54-55 space >> 56-63 'linear' followed by two spaces, or 'circular' >> 64-64 space >> 65-67 The division code (see Section 3.3) >> 68-68 space >> 69-79 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991) > I suppose there are issues with both, since if there are missing values, the locations in the array would be incorrect, and if the locus name is too long that might move the other fields out of position. Locus name longer than 16 characters is not officially allowed in the GenBank format. It is not so easy to allow parsing of non-standard GenBank format that breaks the above definition, partly because of avoiding potential conflicts with future versions of NCBI GenBank format. Only NCBI has the right to change the format definition. In addition, non-standard means that the format definition is ambiguous and not fixed. This also makes difficult to parse such kind of data. > And thanks for clarifying how to get access to the organism. It does seem a bit odd to me that the first genbank record would have an "organism" method that wasn't set, even if there is an organism name listed. For instance > gb.first > refers to a single genbank record, right? So, what is > gb.first.organism > referring to, if not the organism of that record? I guess I have similar questions about the .classification and .taxonomy methods (both of which return empty values on the first genbank record). Each GenBank entry provided by NCBI has SOURCE field and ORGANISM subkeyward. See sections 3.4.2 and 3.4.10 in the GenBank Release Note. (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt) According to the section 3.4.2, SOURCE is mandatory keyword. Bio::GenBank#organism, source, common_name, taxonomy and classification methods get their contents from the SOURCE and ORGANISM, not from the "source" feature in the feature table. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From joshearl1 at hotmail.com Tue Sep 18 11:19:27 2012 From: joshearl1 at hotmail.com (Josh Earl) Date: Tue, 18 Sep 2012 11:19:27 -0400 Subject: [BioRuby] Genbank file parsing question In-Reply-To: <201209181327.q8IDRfqa031985@portal.open-bio.org> References: , <03A1944A-AD1F-4C94-B026-74443E716242@msu.edu>, , <201209181327.q8IDRfqa031985@portal.open-bio.org> Message-ID: Thanks! This was all great information, especially ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt which I had never tracked down before. This will help a lot with any issues I run into with non-standard genbank formats. My confusion with the tutorial is: Bio::FlatFile.new(Bio::GenBank, ARGF)works when part of a script and you pass the script a path/filename. Bio::FlatFile.new(Bio::GenBank, "path/filename") doesn't work. I looked at the code for Bio::FlatFile.newnew(dbclass, stream)Same as ::open, except that ?stream? should be a opened stream object (IO, File, ?, who have the ?gets? method).So.. why is ARGF (which would just be a string passed to the script) working, if it should be a stream? Shouldn't I have to open the file? For instance this works: Bio::FlatFile.new(Bio::GenBank, File.open("path/filename")) Is there some ruby magic going on? ~josh Center for Genomic Sciences (412)-359-8341 > Date: Tue, 18 Sep 2012 22:22:13 +0900 > From: ngoto at gen-info.osaka-u.ac.jp > To: bioruby at lists.open-bio.org > Subject: Re: [BioRuby] Genbank file parsing question > > Hi, > > On Mon, 17 Sep 2012 14:46:21 -0400 > Josh Earl wrote: > > > Hey Nick, > > Wow, that was incredibly helpful, thanks. One of the reasons I was confused about with the Bio::FlatFile.new method is the > > http://thebird.nl/bioruby/Tutorial.rd.html is a bit confusing in that regard (always uses the .new method instead of .open for Bio::FlatFile). Is it the correct usage on the tutorial, or was I just interpreting that incorrectly? > > The usage in the tutorial is right. As you can see, it only > teaches Bio::FlatFile.new, but this does not mean there are no > other methods. Indeed, I think many useful methods, classes, > modules, and usages of them are not yet described in the tutorial. > Thanks giving us an idea to improve the tutorial. > > > Is there a reason to go specifically in hard-coding the locations instead of just splitting it into an array? > > Because the positions are officially defined by NCBI. > See section 3.4.4 in the NCBI GenBank Release Note. > ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt > (current version: Release 191.0) > > It says: > >> Positions Contents > >> --------- -------- > >> 01-05 'LOCUS' > >> 06-12 spaces > >> 13-28 Locus name > >> 29-29 space > >> 30-40 Length of sequence, right-justified > >> 41-41 space > >> 42-43 bp > >> 44-44 space > >> 45-47 spaces, ss- (single-stranded), ds- (double-stranded), or > >> ms- (mixed-stranded) > >> 48-53 NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), > >> mRNA (messenger RNA), uRNA (small nuclear RNA). > >> Left justified. > >> 54-55 space > >> 56-63 'linear' followed by two spaces, or 'circular' > >> 64-64 space > >> 65-67 The division code (see Section 3.3) > >> 68-68 space > >> 69-79 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991) > > > > I suppose there are issues with both, since if there are missing values, the locations in the array would be incorrect, and if the locus name is too long that might move the other fields out of position. > > Locus name longer than 16 characters is not officially allowed > in the GenBank format. > > It is not so easy to allow parsing of non-standard GenBank format > that breaks the above definition, partly because of avoiding > potential conflicts with future versions of NCBI GenBank format. > Only NCBI has the right to change the format definition. > In addition, non-standard means that the format definition is > ambiguous and not fixed. This also makes difficult to parse > such kind of data. > > > And thanks for clarifying how to get access to the organism. It does seem a bit odd to me that the first genbank record would have an "organism" method that wasn't set, even if there is an organism name listed. For instance > > gb.first > > refers to a single genbank record, right? So, what is > > gb.first.organism > > referring to, if not the organism of that record? I guess I have similar questions about the .classification and .taxonomy methods (both of which return empty values on the first genbank record). > > Each GenBank entry provided by NCBI has SOURCE field and ORGANISM > subkeyward. See sections 3.4.2 and 3.4.10 in the GenBank Release > Note. (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt) > According to the section 3.4.2, SOURCE is mandatory keyword. > Bio::GenBank#organism, source, common_name, taxonomy and > classification methods get their contents from the SOURCE and > ORGANISM, not from the "source" feature in the feature table. > > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From pjotr.public14 at thebird.nl Tue Sep 18 11:40:46 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 18 Sep 2012 17:40:46 +0200 Subject: [BioRuby] Genbank file parsing question In-Reply-To: References: <03A1944A-AD1F-4C94-B026-74443E716242@msu.edu> <201209181327.q8IDRfqa031985@portal.open-bio.org> Message-ID: <20120918154046.GA30842@thebird.nl> ARGF is a stream. On Tue, Sep 18, 2012 at 11:19:27AM -0400, Josh Earl wrote: > > Thanks! This was all great information, especially > ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt which I had never tracked down before. This will help a lot with any issues I run into with non-standard genbank formats. > My confusion with the tutorial is: > Bio::FlatFile.new(Bio::GenBank, ARGF)works when part of a script and you pass the script a path/filename. > Bio::FlatFile.new(Bio::GenBank, "path/filename") doesn't work. > I looked at the code for Bio::FlatFile.newnew(dbclass, stream)Same as ::open, except that ?stream? should be a opened stream object (IO, File, ?, who have the ?gets? method).So.. why is ARGF (which would just be a string passed to the script) working, if it should be a stream? Shouldn't I have to open the file? For instance this works: > Bio::FlatFile.new(Bio::GenBank, File.open("path/filename")) > > Is there some ruby magic going on? > ~josh > > Center for Genomic Sciences > (412)-359-8341 > > > Date: Tue, 18 Sep 2012 22:22:13 +0900 > > From: ngoto at gen-info.osaka-u.ac.jp > > To: bioruby at lists.open-bio.org > > Subject: Re: [BioRuby] Genbank file parsing question > > > > Hi, > > > > On Mon, 17 Sep 2012 14:46:21 -0400 > > Josh Earl wrote: > > > > > Hey Nick, > > > Wow, that was incredibly helpful, thanks. One of the reasons I was confused about with the Bio::FlatFile.new method is the > > > http://thebird.nl/bioruby/Tutorial.rd.html is a bit confusing in that regard (always uses the .new method instead of .open for Bio::FlatFile). Is it the correct usage on the tutorial, or was I just interpreting that incorrectly? > > > > The usage in the tutorial is right. As you can see, it only > > teaches Bio::FlatFile.new, but this does not mean there are no > > other methods. Indeed, I think many useful methods, classes, > > modules, and usages of them are not yet described in the tutorial. > > Thanks giving us an idea to improve the tutorial. > > > > > Is there a reason to go specifically in hard-coding the locations instead of just splitting it into an array? > > > > Because the positions are officially defined by NCBI. > > See section 3.4.4 in the NCBI GenBank Release Note. > > ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt > > (current version: Release 191.0) > > > > It says: > > >> Positions Contents > > >> --------- -------- > > >> 01-05 'LOCUS' > > >> 06-12 spaces > > >> 13-28 Locus name > > >> 29-29 space > > >> 30-40 Length of sequence, right-justified > > >> 41-41 space > > >> 42-43 bp > > >> 44-44 space > > >> 45-47 spaces, ss- (single-stranded), ds- (double-stranded), or > > >> ms- (mixed-stranded) > > >> 48-53 NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), > > >> mRNA (messenger RNA), uRNA (small nuclear RNA). > > >> Left justified. > > >> 54-55 space > > >> 56-63 'linear' followed by two spaces, or 'circular' > > >> 64-64 space > > >> 65-67 The division code (see Section 3.3) > > >> 68-68 space > > >> 69-79 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991) > > > > > > > I suppose there are issues with both, since if there are missing values, the locations in the array would be incorrect, and if the locus name is too long that might move the other fields out of position. > > > > Locus name longer than 16 characters is not officially allowed > > in the GenBank format. > > > > It is not so easy to allow parsing of non-standard GenBank format > > that breaks the above definition, partly because of avoiding > > potential conflicts with future versions of NCBI GenBank format. > > Only NCBI has the right to change the format definition. > > In addition, non-standard means that the format definition is > > ambiguous and not fixed. This also makes difficult to parse > > such kind of data. > > > > > And thanks for clarifying how to get access to the organism. It does seem a bit odd to me that the first genbank record would have an "organism" method that wasn't set, even if there is an organism name listed. For instance > > > gb.first > > > refers to a single genbank record, right? So, what is > > > gb.first.organism > > > referring to, if not the organism of that record? I guess I have similar questions about the .classification and .taxonomy methods (both of which return empty values on the first genbank record). > > > > Each GenBank entry provided by NCBI has SOURCE field and ORGANISM > > subkeyward. See sections 3.4.2 and 3.4.10 in the GenBank Release > > Note. (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt) > > According to the section 3.4.2, SOURCE is mandatory keyword. > > Bio::GenBank#organism, source, common_name, taxonomy and > > classification methods get their contents from the SOURCE and > > ORGANISM, not from the "source" feature in the feature table. > > > > > > Naohisa Goto > > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > _______________________________________________ > > BioRuby Project - http://www.bioruby.org/ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From joshearl1 at hotmail.com Tue Sep 18 11:55:04 2012 From: joshearl1 at hotmail.com (Josh Earl) Date: Tue, 18 Sep 2012 11:55:04 -0400 Subject: [BioRuby] Genbank file parsing question In-Reply-To: <20120918154046.GA30842@thebird.nl> References: , <03A1944A-AD1F-4C94-B026-74443E716242@msu.edu>, , <201209181327.q8IDRfqa031985@portal.open-bio.org>, , <20120918154046.GA30842@thebird.nl> Message-ID: ahhh. I see, I was confusing it with ARGV. I'm new to ruby. Thanks for the heads up. Center for Genomic Sciences (412)-359-8341 > Date: Tue, 18 Sep 2012 17:40:46 +0200 > From: pjotr.public14 at thebird.nl > To: joshearl1 at hotmail.com > CC: bioruby at lists.open-bio.org > Subject: Re: [BioRuby] Genbank file parsing question > > ARGF is a stream. > > On Tue, Sep 18, 2012 at 11:19:27AM -0400, Josh Earl wrote: > > > > Thanks! This was all great information, especially > > ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt which I had never tracked down before. This will help a lot with any issues I run into with non-standard genbank formats. > > My confusion with the tutorial is: > > Bio::FlatFile.new(Bio::GenBank, ARGF)works when part of a script and you pass the script a path/filename. > > Bio::FlatFile.new(Bio::GenBank, "path/filename") doesn't work. > > I looked at the code for Bio::FlatFile.newnew(dbclass, stream)Same as ::open, except that ?stream? should be a opened stream object (IO, File, ?, who have the ?gets? method).So.. why is ARGF (which would just be a string passed to the script) working, if it should be a stream? Shouldn't I have to open the file? For instance this works: > > Bio::FlatFile.new(Bio::GenBank, File.open("path/filename")) > > > > Is there some ruby magic going on? > > ~josh > > > > Center for Genomic Sciences > > (412)-359-8341 > > > > > Date: Tue, 18 Sep 2012 22:22:13 +0900 > > > From: ngoto at gen-info.osaka-u.ac.jp > > > To: bioruby at lists.open-bio.org > > > Subject: Re: [BioRuby] Genbank file parsing question > > > > > > Hi, > > > > > > On Mon, 17 Sep 2012 14:46:21 -0400 > > > Josh Earl wrote: > > > > > > > Hey Nick, > > > > Wow, that was incredibly helpful, thanks. One of the reasons I was confused about with the Bio::FlatFile.new method is the > > > > http://thebird.nl/bioruby/Tutorial.rd.html is a bit confusing in that regard (always uses the .new method instead of .open for Bio::FlatFile). Is it the correct usage on the tutorial, or was I just interpreting that incorrectly? > > > > > > The usage in the tutorial is right. As you can see, it only > > > teaches Bio::FlatFile.new, but this does not mean there are no > > > other methods. Indeed, I think many useful methods, classes, > > > modules, and usages of them are not yet described in the tutorial. > > > Thanks giving us an idea to improve the tutorial. > > > > > > > Is there a reason to go specifically in hard-coding the locations instead of just splitting it into an array? > > > > > > Because the positions are officially defined by NCBI. > > > See section 3.4.4 in the NCBI GenBank Release Note. > > > ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt > > > (current version: Release 191.0) > > > > > > It says: > > > >> Positions Contents > > > >> --------- -------- > > > >> 01-05 'LOCUS' > > > >> 06-12 spaces > > > >> 13-28 Locus name > > > >> 29-29 space > > > >> 30-40 Length of sequence, right-justified > > > >> 41-41 space > > > >> 42-43 bp > > > >> 44-44 space > > > >> 45-47 spaces, ss- (single-stranded), ds- (double-stranded), or > > > >> ms- (mixed-stranded) > > > >> 48-53 NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), > > > >> mRNA (messenger RNA), uRNA (small nuclear RNA). > > > >> Left justified. > > > >> 54-55 space > > > >> 56-63 'linear' followed by two spaces, or 'circular' > > > >> 64-64 space > > > >> 65-67 The division code (see Section 3.3) > > > >> 68-68 space > > > >> 69-79 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991) > > > > > > > > > > I suppose there are issues with both, since if there are missing values, the locations in the array would be incorrect, and if the locus name is too long that might move the other fields out of position. > > > > > > Locus name longer than 16 characters is not officially allowed > > > in the GenBank format. > > > > > > It is not so easy to allow parsing of non-standard GenBank format > > > that breaks the above definition, partly because of avoiding > > > potential conflicts with future versions of NCBI GenBank format. > > > Only NCBI has the right to change the format definition. > > > In addition, non-standard means that the format definition is > > > ambiguous and not fixed. This also makes difficult to parse > > > such kind of data. > > > > > > > And thanks for clarifying how to get access to the organism. It does seem a bit odd to me that the first genbank record would have an "organism" method that wasn't set, even if there is an organism name listed. For instance > > > > gb.first > > > > refers to a single genbank record, right? So, what is > > > > gb.first.organism > > > > referring to, if not the organism of that record? I guess I have similar questions about the .classification and .taxonomy methods (both of which return empty values on the first genbank record). > > > > > > Each GenBank entry provided by NCBI has SOURCE field and ORGANISM > > > subkeyward. See sections 3.4.2 and 3.4.10 in the GenBank Release > > > Note. (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt) > > > According to the section 3.4.2, SOURCE is mandatory keyword. > > > Bio::GenBank#organism, source, common_name, taxonomy and > > > classification methods get their contents from the SOURCE and > > > ORGANISM, not from the "source" feature in the feature table. > > > > > > > > > Naohisa Goto > > > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > > _______________________________________________ > > > BioRuby Project - http://www.bioruby.org/ > > > BioRuby mailing list > > > BioRuby at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > > _______________________________________________ > > BioRuby Project - http://www.bioruby.org/ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > From pjotr.public14 at thebird.nl Tue Sep 25 02:08:50 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 25 Sep 2012 08:08:50 +0200 Subject: [BioRuby] [GSoC] GSoC week 2 status report Message-ID: <20120925060850.GA1143@thebird.nl> Hi John, Congrats from the BioRuby panel and community winning Ruby Association Grant! http://sciruby.com/blog/2012/09/24/sciruby-receives-ruby-association-grant--fellowships-available/ Pj. From pjotr.public14 at thebird.nl Sun Sep 30 12:29:58 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sun, 30 Sep 2012 18:29:58 +0200 Subject: [BioRuby] Price for BioRuby/biogems character! Message-ID: <20120930162958.GB23298@thebird.nl> Hi list, We are looking for a cartoon character, Japanese style, to represent BioRuby and biogems, and make the website(s) attractive to a young(er) audience. We will credit the creator on the website, and he/she will win a prize. Note: there should be no hampering copyright on the cartoon. Best to create one yourself. Pj. From ngoto at gen-info.osaka-u.ac.jp Mon Sep 3 08:10:12 2012 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa Goto) Date: Mon, 03 Sep 2012 17:10:12 +0900 Subject: [BioRuby] Removal of Bio::DBGET and Bio::Ensembl wihch use discontinued API Message-ID: <20120903171010.A2E0.EEF6E030@gen-info.osaka-u.ac.jp> Hi all, I'd like to remove the following two old obsolete classes that use discontinued API access via the internet. Bio::DBGET in bio/io/dbget.rb (and sample/dbget): Reason: It does not work because it uses old original protocol that was discontinued about 8 years ago. Alternatives: The DBGET system is still available via the web. http://www.genome.jp/en/gn_dbget.html However, no API code is written in Ruby. Bio::Ensembl in bio/io/ensembl.rb (and test codes): Reason: It does not work after the renewal of Ensembl web site in 2008. Alternatives: bio-ensembl gem which supports current ensembl API. http://rubygems.org/gems/bio-ensembl Regards, -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From p.j.a.cock at googlemail.com Mon Sep 3 13:08:51 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Sep 2012 14:08:51 +0100 Subject: [BioRuby] [GSoC] GSoC final report In-Reply-To: <56EAF480-2694-4825-999E-14763A6A3ACB@drycafe.net> References: <20120820095559.GB2453@thebird.nl> <118F034CF4C3EF48A96F86CE585B94BF33B73325@CHIMBX5.ad.uillinois.edu> <56EAF480-2694-4825-999E-14763A6A3ACB@drycafe.net> Message-ID: On Mon, Aug 27, 2012 at 3:14 AM, Hilmar Lapp wrote: > Indeed, congratulations to all of OBF's 2012 GSoC students > and mentors - great job! > > It'd be great to have a summary blog post on the OBF news > blog - anyone up for composing that? > > -hilmar I agree it is a good idea. I'm in Japan for the 2012 BioHackathon, and have spoken with Pjotr, Raul and Francesco - I think we can work on a blog post together this week (I have editing rights). Brad - would you like to contribute/preview the text? Shall we ask your co-mentors too? Regards, Peter From ngoto at gen-info.osaka-u.ac.jp Tue Sep 4 08:40:04 2012 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa Goto) Date: Tue, 04 Sep 2012 17:40:04 +0900 Subject: [BioRuby] Remove classes that does not work: Bio::NCBI::SOAP, Bio::KEGG::Taxonomy Message-ID: <20120904173948.79E3.EEF6E030@gen-info.osaka-u.ac.jp> Hi all, I'd like to remove the following two classes that are currently broken and I think there are no hope to be fixed. Bio::NCBI::SOAP Bio::NCBI::SOAP (in lib/bio/io/ncbisoap.rb) always raises error during the parsing of WSDL files provided by NCBI. The error occurrs both with Ruby 1.8.X (with bundled SOAP4R) and Ruby 1.9.X (with soap4r-ruby1.9 gem). To solve the error, modifying SOAP4R may be needed. I think it is difficult. Fortunately, there is already an alternative class Bio::NCBI::REST, REST client class for NCBI EUtil web services. Bio::KEGG::Taxonomy Bio::KEGG::Taxonomy (in lib/bio/db/kegg/taxonomy.rb) raises error or the returned data seems to be broken. Running the sample script sample/demo_kegg_taxonomy.rb shows error or falls into infinite loop. Moreover, KEGG closes public FTP site and the file "taxonomy" can only be obtained by paid subscribers. So, I can not test the class with the latest data and thus I give up fixing. Of course, patches to solve the above problems are welcome. -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From ngoto at gen-info.osaka-u.ac.jp Thu Sep 6 09:14:04 2012 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa Goto) Date: Thu, 06 Sep 2012 18:14:04 +0900 Subject: [BioRuby] Remove broken Bio.method_missing Message-ID: <20120906181404.BDBE.EEF6E030@gen-info.osaka-u.ac.jp> Hi all, There is Bio.method_missing, the hook of undefined methods. In the existing code, Bio::Shell method corresponding to the given method name is called. The expected behavior is to provide shortcut of Bio::Shell methods with shorter name without typing "Shell". However, currently, most methods raises error, partly due to the bypass of initialization procedure. Our experience of writing and using BioRuby suggests that the use of method_missing should normally be avoided unless it is really necessary, partly because it tends to cause catastrophe especially when an exception is raised. In the case of Bio.method_missing, I think it is not necessary to use the method here. So, I remove Bio.method_missing. Alternatively, use Bio::Shell.xxxxx (xxxxx is a method name). -- Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From p.j.a.cock at googlemail.com Mon Sep 10 08:39:30 2012 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 10 Sep 2012 09:39:30 +0100 Subject: [BioRuby] Most buildbot slaves down Message-ID: Hi all, For those of you actively monitoring the nightly BuildBot for Biopython and/or BioRuby, all the buildslaves at my institute are currently effectively offline. A new stricter firewall policy was introduced last week while I was away. I hope we'll have the necessary outgoing ports opened again soon. In the meantime, additional buildslaves hosted elsewhere would be very useful. The machines need to be online and are typically only used once every 24 hours for the scheduled builds. Non-Linux machines are particularly important for cross-platform testing (while for Linux the TravisCI testing seems to be working nicely overall). Any volunteers? Thanks, Peter From tiagoantao at gmail.com Mon Sep 10 08:50:41 2012 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Mon, 10 Sep 2012 09:50:41 +0100 Subject: [BioRuby] Most buildbot slaves down In-Reply-To: References: Message-ID: Hi, Not much helpful in the non-linux front, but I noticed that my machine was down for some reason, restarted it and it is doing at least a few of the builds. Tiago On Mon, Sep 10, 2012 at 9:39 AM, Peter Cock wrote: > Hi all, > > For those of you actively monitoring the nightly BuildBot > for Biopython and/or BioRuby, all the buildslaves at my > institute are currently effectively offline. A new stricter > firewall policy was introduced last week while I was away. > I hope we'll have the necessary outgoing ports opened > again soon. > > In the meantime, additional buildslaves hosted elsewhere > would be very useful. The machines need to be online > and are typically only used once every 24 hours for the > scheduled builds. Non-Linux machines are particularly > important for cross-platform testing (while for Linux > the TravisCI testing seems to be working nicely overall). > > Any volunteers? > > Thanks, > > Peter > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby -- "Liberty for wolves is death to the lambs" - Isaiah Berlin From joshearl1 at hotmail.com Thu Sep 13 17:50:34 2012 From: joshearl1 at hotmail.com (Josh Earl) Date: Thu, 13 Sep 2012 13:50:34 -0400 Subject: [BioRuby] Genbank file parsing question Message-ID: Hello all, I'm trying to use bioruby to write a small program that will allow me to rearrange the contigs in a genbank file (based on either a list of contig names, mauve output, or whatever). The idea is that the annotation service that we use (RAST - http://rast.nmpdr.org ) produces Genbank files, but not exactly the same format that Bioruby is expecting. They seem to outright break the Bio::FlatFile.new(Bio::GenBank, 'file') format, and confuse the Bio::GenBank.open('file') function. My question is, what should I do? Write my own parser, or try and fiddle with the Bioruby implementation or something else entirely? I'm fairly new to ruby, but I've been programming for a long time. ~josh P.S. Here is a short section of what the RAST GenBank file looks like (just a single short contig): LOCUS ctg7180000000028 4191 bp DNA linear UNK DEFINITION Contig ctg7180000000028 from Atopobium vaginae B758ACCESSION unknownFEATURES Location/Qualifiers source 1..4191 /mol_type="genomic DNA" /db_xref="taxon: 82135" /genome_md5="" /project="earl_82135" /genome_id="82135.3" /organism="Atopobium vaginae B758" CDS complement(10..1740) /translation="MKLAQLKMVCRGENAGFACIQLCEKPALLKVHAHTKDTNMPCPA RLVCLDELYGVRSSIDDAATGSWWVVIIPLLSVDCVVELSITRASQGLSSDTWSFVFG PHTSRYMSRLLTLRHPQAAALLRRIVHSAAYVHHQLNLIGIWNAAAQTSDSPDVPMRI WRFEARFTCDNSPAFEYFPISCCVLSSTGEPIRARVITLEEQTAAVPDDSAECVRRAV FSIALPHACTHAVVCARLNFKHVASLSCDGGDRARALQKASASTYEAFYTIFPAAAAA RIAEAERFSRDCAHDPHYERWFDEHAATSEQCAMQTRRYEEACACMEHREESTTHADQ PAQPVQLAQPAQPAQTSFDDALAHMGISVVLPVFSASTTLLARSINAMIHQSFPAWQL IVLDCTHMRTPNQQTDIARWLHSYTKTDARIMYVRMNVEQSQKNQAVAGESLSDSGAH AAGVAQPTDDIIDASRDHYAHHSSYLAYACSFIQNPYVYIMSEGAAPTPDALWHIAQT VAQHIAKGTPCDVVHVDEDELTPQGCTKPHVSYAASMIGLEGTNYLGHSLVLRTALLD ELRAPCDVAT" /product="hypothetical protein" CDS complement(1759..1875) /translation="MPRMYNAHADKVLQKRSKRRCTRLAPAEPHVVRVLLNL" /product="hypothetical protein" CDS complement(1844..2461) /db_xref="GO:0008830" /translation="MTQRTETSQAIQSGNFIFTPTSIRDVIIVDTKQYGDARGYFMET YKASDFAAGGISTTFVQDNQSSSTKGVLRGLHFQIEHPQAKLVRVVRGCVFDVAVDLR AGSETFGAWEGVELSAENHRQFYIPRGFAHGFFVLSDEAEFCYKCDDVYHPGDEGGLM WNDPDLAISWPAPCGCDSFSPSQVILSDKDTHHESFAAYVQRTRG" /product="dTDP-4-dehydrorhamnose 3,5-epimerase (EC 5.1.3.13)" /EC_number="5.1.3.13" CDS complement(2586..2741) /translation="MRTLSTYTSIFEQICISLGRALVLSCSMAARKIALADAPAGARC VALCAFP" /product="hypothetical protein" CDS complement(2798..3193) /translation="MSAILAIVPAYNEQQCIQQTIDELRRVCPGVDYLIVNDGSRDET AAICRRRHFNYINVPINCGLASGVQAGMKYAERNGYSAVVQFDADGQHKPEYIVPMYE HMQKTGADVVIGSRFVDDALLLVVCRHNC" /product="Glycosyltransferase involved in cell wall biogenesis (EC 2.4.-.-)" CDS 3238..3393 /translation="MRTLSTYTSIFEQICISLGRALVLSCSMAARKIALADAPAGARC VALCAFP" /product="hypothetical protein" CDS 3518..4135 /db_xref="GO:0008830" /translation="MTQRTETSQAIQSGNFIFTPTSIRDVIIVDTKQYGDARGYFMET YKASDFAAGGISTTFVQDNQSSSTKGVLRGLHFQIEHPQAKLVRVVRGCVFDVAVDLR AGSETFGAWEGVELSAENHRQFYIPRGFAHGFFVLSDEAEFCYKCDDVYHPGDEGGLM WNDPDLAISWPAPCGCDSFSPSQVILSDKDTHHESFAAYVQRTRG" /product="dTDP-4-dehydrorhamnose 3,5-epimerase (EC 5.1.3.13)" /EC_number="5.1.3.13"BASE COUNT 1077 a 1055 c 1036 g 1023 tORIGIN 1 aatcgcgctt catgttgcaa catcgcatgg cgcgcgaagt tcatccaaaa gcgcagttct 61 caacacgagt gaatgtccaa gatagttagt gccttcaagc ccaatcatgc ttgctgcata 121 actcacgtga ggctttgtgc agccttgggg cgtgagctca tcttcatcaa catgtacaac 181 atcgcagggt gtaccttttg ctatgtgctg tgctaccgtt tgtgcaatat gccacagggc 241 atcgggcgtg ggagctgccc cctcactcat aatgtaaacg tacgggtttt gtataaacga 301 acatgcatac gcaagatacg agctatggtg cgcataatgg tcgcgagatg catcgataat 361 atcatctgta ggctgtgcta cgccagcagc atgcgcgcct gaatctgaca atgattcccc 421 agctacagct tggtttttct gtgattgttc cacgttcata cgcacataca taatgcgcgc 481 atcggtctta gtatagctat gaagccagcg tgcaatatct gtttgttggt tgggagtgcg 541 catgtgtgta caatcgagca cgataagctg ccatgccgga aaactctgat gtatcatcgc 601 gttaatactg cgcgcaagca gtgtagtcga tgctgaaaaa acgggcagta ccaccgaaat 661 acccatgtgt gcaagcgcat catcaaacga tgtctgtgca ggttgtgcag gttgcgcaag 721 ctggacaggt tgtgcaggct ggtcagcatg cgtggtgctc tcttcgcggt gttccataca 781 cgcgcacgcc tcttcgtacc tgcgtgtttg catagcgcac tgctcagacg tagctgcatg 841 ctcatcaaac cagcgctcat agtgaggatc gtgagcgcaa tcgcgactaa aacgctcggc 901 ctcagcaatg cgcgcggccg cagcagcagg aaaaatggta taaaaggctt cgtaggtaga 961 tgcagacgct ttttgcaacg cgcgggctcg atctccaccg tcgcagctca acgatgccac 1021 atgcttaaag ttgaggcgcg cgcacacaac agcatgcgtg cacgcgtgcg gcaatgcaat 1081 cgaaaacacc gcacgacgaa cgcattcagc gctgtcatcc ggaacagctg ccgtttgctc 1141 ttctagcgta attacacgtg cacgtatggg ctcacctgta ctgctcagca cgcagcagct 1201 tataggaaaa tattcaaacg caggcgaatt gtcgcaggta aaccgcgctt caaatcgcca 1261 tatacgcatc ggcacatcgg gagagtcgct tgtttgcgca gcggcattcc aaataccaat 1321 aagattaagc tgatggtgca catatgcagc actgtgaacg atgcggcgta gcaacgcggc 1381 cgcttgagga tggcggagcg taagtaaacg cgacatgtag cgcgacgtat gaggaccaaa 1441 aacaaacgac cacgtatcag aactcagacc ctgcgacgca cgtgttatgc tgagctcaac 1501 cacacaatca acgctcaaca gcggaataat aacaacccac cacgaccccg ttgcagcatc 1561 atcaatgctc gaacgaacgc catagagctc gtctaaacac acaagacgcg caggacacgg 1621 catattcgta tctttggtat gtgcatgcac tttaagcaat gcgggctttt cgcacagctg 1681 aatgcaggca aaccctgcgt tttcgccgcg gcaaaccatt tttaattgcg cgagtttcat 1741 gcagtcccct tactgttgtt aaagattaag cagcacgcgc acaacatgcg gctctgcggg 1801 cgcaagtcgt gtgcagcgcc tttttgagcg cttttggagc accttatccg cgtgtgcgtt 1861 gtacatacgc ggcaaacgat tcatgatggg tatctttatc agacaaaata acctgcgagg 1921 gcgaaaagct atcgcagcca caaggcgcag gccagctaat agcaagatcg ggatcgttcc 1981 acataaggcc accttcatcg cctggatgat acacgtcgtc gcacttatag caaaattctg 2041 cctcatctga gagtacaaaa aatccgtgag caaagccgcg cggtatatag aattgtcgat 2101 gattttcggc cgataattca acgccttccc atgcaccaaa ggtctctgaa cctgcgcgca 2161 agtctaccgc aacatcaaac acacagccac gcacaacacg aacgagtttt gcttgagggt 2221 gttcaatctg aaaatgcagg ccacgaagca cgccttttgt ggagctcgat tggttatcct 2281 gtacaaacgt agtagaaata ccacccgcag caaaatcgga tgctttgtac gtttccataa 2341 agtacccgcg cgcgtcacca tactgtttgg tatcaacaat aataacgtcg cgaatagatg 2401 taggtgtaaa aataaaattg cccgattgaa tagcctgaga tgtttctgta cgctgtgtca 2461 taccttaact cctttagcgc gcgctccttt agcgcacaaa tatgcgctaa cgaattgtgc 2521 aatacctgca acggtttatt tcattgtagc gcacgaatat acgcctacga tatgttcata 2581 caatgctacg gaaatgcaca cagcgcgaca caccgtgcgc ccgcaggcgc atctgccagt 2641 gcaatcttcc ttgccgccat gctacacgaa agaacaagtg cgcgccccaa tgatatgcag 2701 atttgctcaa aaatagaggt atacgtgctc aaagtacgca atggtgcggt atacttgcac 2761 agatatacca actttatgga gaactatgtc tgcaatatta gcaattgtgc ctgcatacaa 2821 cgagcagcag tgcatcgtca acaaaacgcg aaccaataac aacgtcggca cctgtttttt 2881 gcatgtgctc gtacatgggc acgatgtact cgggtttgtg ctgaccatcg gcatcaaatt 2941 gcacaactgc agaatagcca ttgcgctctg catatttcat gccagcttga acgcccgaag 3001 caaggccgca gttaatgggc acatttatgt agttaaaatg gcgcctacgg catatagctg 3061 cggtttcgtc gcgcgagccg tcgtttacaa taaggtagtc tacgccgggg cacacgcggc 3121 gcagctcatc gattgtttgt tgtatgcact gctgctcgtt gtatgcaggc acaattgcta 3181 atattgcaga catagttctc cataaagttg gtatatctgt gcaagtatac cgcaccattg 3241 cgtactttga gcacgtatac ctctattttt gagcaaatct gcatatcatt ggggcgcgca 3301 cttgttcttt cgtgtagcat ggcggcaagg aagattgcac tggcagatgc gcctgcgggc 3361 gcacggtgtg tcgcgctgtg tgcatttccg tagcattgta tgaacatatc gtaggcgtat 3421 attcgtgcgc tacaatgaaa taaaccgttg caggtattgc acaattcgtt agcgcatatt 3481 tgtgcgctaa aggagcgcgc gctaaaggag ttaaggtatg acacagcgta cagaaacatc 3541 tcaggctatt caatcgggca attttatttt tacacctaca tctattcgcg acgttattat 3601 tgttgatacc aaacagtatg gtgacgcgcg cgggtacttt atggaaacgt acaaagcatc 3661 cgattttgct gcgggtggta tttctactac gtttgtacag gataaccaat cgagctccac 3721 aaaaggcgtg cttcgtggcc tgcattttca gattgaacac cctcaagcaa aactcgttcg 3781 tgttgtgcgt ggctgtgtgt ttgatgttgc ggtagacttg cgcgcaggtt cagagacctt 3841 tggtgcatgg gaaggcgttg aattatcggc cgaaaatcat cgacaattct atataccgcg 3901 cggctttgct cacggatttt ttgtactctc agatgaggca gaattttgct ataagtgcga 3961 cgacgtgtat catccaggcg atgaaggtgg ccttatgtgg aacgatcccg atcttgctat 4021 tagctggcct gcgccttgtg gctgcgatag cttttcgccc tcgcaggtta ttttgtctga 4081 taaagatacc catcatgaat cgtttgccgc gtatgtacaa cgcacacgcg gataaggtgc 4141 tccaaaagcg ctcaaaaagg cgctgcacac gacttgcgcc cggcagagcc g// Center for Genomic Sciences (412)-359-8341 From throwern at msu.edu Fri Sep 14 17:26:30 2012 From: throwern at msu.edu (Nick Thrower) Date: Fri, 14 Sep 2012 13:26:30 -0400 Subject: [BioRuby] Genbank file parsing question In-Reply-To: References: Message-ID: <988BDAA2-B026-429F-BCE4-06290F5AFEB9@msu.edu> Hi Josh, I've used the Bio gem to parse several Genbank files from NCBI. The snippet you provided looks like it should be handled correctly; except it is missing newlines. Could you provide more specific details about the errors you are receiving? -Nick -- Nick Thrower Information Technologist Michigan State University Great Lakes Bioenergy Research Center East Lansing MI 48824 On Sep 14, 2012, at 12:00 PM, bioruby-request at lists.open-bio.org wrote: > Send BioRuby mailing list submissions to > bioruby at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/bioruby > or, via email, send a message with subject or body 'help' to > bioruby-request at lists.open-bio.org > > You can reach the person managing the list at > bioruby-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of BioRuby digest..." > Today's Topics: > > 1. Genbank file parsing question (Josh Earl) > > From: Josh Earl > Date: September 13, 2012 1:50:34 PM EDT > To: > Subject: [BioRuby] Genbank file parsing question > > > > Hello all, > I'm trying to use bioruby to write a small program that will allow me to rearrange the contigs in a genbank file (based on either a list of contig names, mauve output, or whatever). The idea is that the annotation service that we use (RAST - > http://rast.nmpdr.org ) produces Genbank files, but not exactly the same format that Bioruby is expecting. They seem to outright break the Bio::FlatFile.new(Bio::GenBank, 'file') format, and confuse the Bio::GenBank.open('file') function. My question is, what should I do? Write my own parser, or try and fiddle with the Bioruby implementation or something else entirely? I'm fairly new to ruby, but I've been programming for a long time. > ~josh > > P.S. Here is a short section of what the RAST GenBank file looks like (just a single short contig): > LOCUS ctg7180000000028 4191 bp DNA linear UNK DEFINITION Contig ctg7180000000028 from Atopobium vaginae B758ACCESSION unknownFEATURES Location/Qualifiers source 1..4191 /mol_type="genomic DNA" /db_xref="taxon: 82135" /genome_md5="" /project="earl_82135" /genome_id="82135.3" /organism="Atopobium vaginae B758" CDS complement(10..1740) /translation="MKLAQLKMVCRGENAGFACIQLCEKPALLKVHAHTKDTNMPCPA RLVCLDELYGVRSSIDDAATGSWWVVIIPLLSVDCVVELSITRASQGLSSDTWSFVFG PHTSRYMSRLLTLRHPQAAALLRRIVHSAAYVHHQLNLIGIWNAAAQTSDSPDVPMRI WRFEARFTCDNSPAFEYFPISCCVLSSTGEPIRARVITLEEQTAAVPDDSAECVRRAV FSIALPHACTHAVVCARLNFKHVASLSCDGGDRARALQKASASTYEAFYTIFPAAAAA RIAEAERFSRDCAHDPHYERWFDEHAATSEQCAMQTRRYEEACACMEHREES! > TTHADQ PAQPVQLAQPAQPAQTSFDDALAHMGISVVLPVFSASTTLLARSINAMIHQSFPAWQL IVLDCTHMRTPNQQTDIARWLHSYTKTDARIMYVRMNVEQSQKNQAVAGESLSDSGAH AAGVAQPTDDIIDASRDHYAHHSSYLAYACSFIQNPYVYIMSEGAAPTPDALWHIAQT VAQHIAKGTPCDVVHVDEDELTPQGCTKPHVSYAASMIGLEGTNYLGHSLVLRTALLD ELRAPCDVAT" /product="hypothetical protein" CDS complement(1759..1875) /translation="MPRMYNAHADKVLQKRSKRRCTRLAPAEPHVVRVLLNL" /product="hypothetical protein" CDS complement(1844..2461) /db_xref="GO:0008830" /translation="MTQRTETSQAIQSGNFIFTPTSIRDVIIVDTKQYGDARGYFMET YKASDFAAGGISTTFVQDNQSSSTKGVLRGLHFQIEHPQAKLVRVVRGCVFDVAVDLR AGSETFGAWEGVELSAENHRQFYIPRGFAHGFFVLSDEAEFCYKCDDVYHPGDEGGLM WNDPDLAISWPAPCGCDSFSPSQVILSDKDTHHESFAAYVQRTRG" /pr! > oduct="dTDP-4-dehydrorhamnose 3,5-epimerase (EC 5. > 1.3.13)" /EC_number="5.1.3.13" CDS complement(2586..2741) /translation="MRTLSTYTSIFEQICISLGRALVLSCSMAARKIALADAPAGARC VALCAFP" /product="hypothetical protein" CDS complement(2798..3193) /translation="MSAILAIVPAYNEQQCIQQTIDELRRVCPGVDYLIVNDGSRDET AAICRRRHFNYINVPINCGLASGVQAGMKYAERNGYSAVVQFDADGQHKPEYIVPMYE HMQKTGADVVIGSRFVDDALLLVVCRHNC" /product="Glycosyltransferase involved in cell wall biogenesis (EC 2.4.-.-)" CDS 3238..3393 /translation="MRTLSTYTSIFEQICISLGRALVLSCSMAARKIALADAPAGARC VALCAFP" /product="hypothetical protein" CDS 3518..4135 /db_xref="GO:0008830" /translation="MTQRTETSQAIQSGNFIFTPTSIRDVIIVDTKQYGDARGYFMET YKA! > SDFAAGGISTTFVQDNQSSSTKGVLRGLHFQIEHPQAKLVRVVRGCVFDVAVDLR AGSETFGAWEGVELSAENHRQFYIPRGFAHGFFVLSDEAEFCYKCDDVYHPGDEGGLM WNDPDLAISWPAPCGCDSFSPSQVILSDKDTHHESFAAYVQRTRG" /product="dTDP-4-dehydrorhamnose 3,5-epimerase (EC 5.1.3.13)" /EC_number="5.1.3.13"BASE COUNT 1077 a 1055 c 1036 g 1023 tORIGIN 1 aatcgcgctt catgttgcaa catcgcatgg cgcgcgaagt tcatccaaaa gcgcagttct 61 caacacgagt gaatgtccaa gatagttagt gccttcaagc ccaatcatgc ttgctgcata 121 actcacgtga ggctttgtgc agccttgggg cgtgagctca tcttcatcaa catgtacaac 181 atcgcagggt gtaccttttg ctatgtgctg tgctaccgtt tgtgcaatat gccacagggc 241 atcgggcgtg ggagctgccc cctcactcat aatgtaaacg tacgggtttt gtataaacga 301 acatgcatac gcaagatacg agctatggtg cgcataatgg tcgcgagatg catcgataat 361 atcatctgta ggctgtgcta cgccagcagc atgcgcgcct gaatctgaca atgattcccc 421 agctacagct tggtttttct gtgattgttc cacgttcata cgca! > cataca taatgcgcgc 481 atcggtctta gtatagctat gaagccagcg tgcaatatct > gtttgttggt tgggagtgcg 541 catgtgtgta caatcgagca cgataagctg ccatgccgga aaactctgat gtatcatcgc 601 gttaatactg cgcgcaagca gtgtagtcga tgctgaaaaa acgggcagta ccaccgaaat 661 acccatgtgt gcaagcgcat catcaaacga tgtctgtgca ggttgtgcag gttgcgcaag 721 ctggacaggt tgtgcaggct ggtcagcatg cgtggtgctc tcttcgcggt gttccataca 781 cgcgcacgcc tcttcgtacc tgcgtgtttg catagcgcac tgctcagacg tagctgcatg 841 ctcatcaaac cagcgctcat agtgaggatc gtgagcgcaa tcgcgactaa aacgctcggc 901 ctcagcaatg cgcgcggccg cagcagcagg aaaaatggta taaaaggctt cgtaggtaga 961 tgcagacgct ttttgcaacg cgcgggctcg atctccaccg tcgcagctca acgatgccac 1021 atgcttaaag ttgaggcgcg cgcacacaac agcatgcgtg cacgcgtgcg gcaatgcaat 1081 cgaaaacacc gcacgacgaa cgcattcagc gctgtcatcc ggaacagctg ccgtttgctc 1141 ttctagcgta attacacgtg cacgtatggg ctcacctgta ctgctcagca cgcagcagct 1201 tataggaaaa tattcaaacg caggcgaatt gtcgcaggta aaccgcgctt caaatcgcca 1261 tatacgcatc ggcacatcgg gagagtcgct tgtttgcgca gcggcattcc aa! > ataccaat 1321 aagattaagc tgatggtgca catatgcagc actgtgaacg atgcggcgta gcaacgcggc 1381 cgcttgagga tggcggagcg taagtaaacg cgacatgtag cgcgacgtat gaggaccaaa 1441 aacaaacgac cacgtatcag aactcagacc ctgcgacgca cgtgttatgc tgagctcaac 1501 cacacaatca acgctcaaca gcggaataat aacaacccac cacgaccccg ttgcagcatc 1561 atcaatgctc gaacgaacgc catagagctc gtctaaacac acaagacgcg caggacacgg 1621 catattcgta tctttggtat gtgcatgcac tttaagcaat gcgggctttt cgcacagctg 1681 aatgcaggca aaccctgcgt tttcgccgcg gcaaaccatt tttaattgcg cgagtttcat 1741 gcagtcccct tactgttgtt aaagattaag cagcacgcgc acaacatgcg gctctgcggg 1801 cgcaagtcgt gtgcagcgcc tttttgagcg cttttggagc accttatccg cgtgtgcgtt 1861 gtacatacgc ggcaaacgat tcatgatggg tatctttatc agacaaaata acctgcgagg 1921 gcgaaaagct atcgcagcca caaggcgcag gccagctaat agcaagatcg ggatcgttcc 1981 acataaggcc accttcatcg cctggatgat acacgtcgtc gcacttatag caaaattctg 2041 cctcatctga gagtacaaaa aatccgtgag caaagccgcg cggtatatag aattgtcgat ! > 2101 gattttcggc cgataattca acgccttccc atgcaccaaa ggtctctgaa cctgcgcgca > 2161 agtctaccgc aacatcaaac acacagccac gcacaacacg aacgagtttt gcttgagggt 2221 gttcaatctg aaaatgcagg ccacgaagca cgccttttgt ggagctcgat tggttatcct 2281 gtacaaacgt agtagaaata ccacccgcag caaaatcgga tgctttgtac gtttccataa 2341 agtacccgcg cgcgtcacca tactgtttgg tatcaacaat aataacgtcg cgaatagatg 2401 taggtgtaaa aataaaattg cccgattgaa tagcctgaga tgtttctgta cgctgtgtca 2461 taccttaact cctttagcgc gcgctccttt agcgcacaaa tatgcgctaa cgaattgtgc 2521 aatacctgca acggtttatt tcattgtagc gcacgaatat acgcctacga tatgttcata 2581 caatgctacg gaaatgcaca cagcgcgaca caccgtgcgc ccgcaggcgc atctgccagt 2641 gcaatcttcc ttgccgccat gctacacgaa agaacaagtg cgcgccccaa tgatatgcag 2701 atttgctcaa aaatagaggt atacgtgctc aaagtacgca atggtgcggt atacttgcac 2761 agatatacca actttatgga gaactatgtc tgcaatatta gcaattgtgc ctgcatacaa 2821 cgagcagcag tgcatcgtca acaaaacgcg aaccaataac aacgtcggca cctgtttttt 2881 gcatgtgctc gtacatgggc acgatgtact cgggtttgtg ctgaccatcg gcatcaaatt 2941 gcac! > aactgc agaatagcca ttgcgctctg catatttcat gccagcttga acgcccgaag 3001 caaggccgca gttaatgggc acatttatgt agttaaaatg gcgcctacgg catatagctg 3061 cggtttcgtc gcgcgagccg tcgtttacaa taaggtagtc tacgccgggg cacacgcggc 3121 gcagctcatc gattgtttgt tgtatgcact gctgctcgtt gtatgcaggc acaattgcta 3181 atattgcaga catagttctc cataaagttg gtatatctgt gcaagtatac cgcaccattg 3241 cgtactttga gcacgtatac ctctattttt gagcaaatct gcatatcatt ggggcgcgca 3301 cttgttcttt cgtgtagcat ggcggcaagg aagattgcac tggcagatgc gcctgcgggc 3361 gcacggtgtg tcgcgctgtg tgcatttccg tagcattgta tgaacatatc gtaggcgtat 3421 attcgtgcgc tacaatgaaa taaaccgttg caggtattgc acaattcgtt agcgcatatt 3481 tgtgcgctaa aggagcgcgc gctaaaggag ttaaggtatg acacagcgta cagaaacatc 3541 tcaggctatt caatcgggca attttatttt tacacctaca tctattcgcg acgttattat 3601 tgttgatacc aaacagtatg gtgacgcgcg cgggtacttt atggaaacgt acaaagcatc 3661 cgattttgct gcgggtggta tttctactac gtttgtacag gataaccaat cgagctccac 3721 aaaaggcgtg cttcgt! > ggcc tgcattttca gattgaacac cctcaagcaa aactcgttcg 3781 tgttgtgcgt g > gctgtgtgt ttgatgttgc ggtagacttg cgcgcaggtt cagagacctt 3841 tggtgcatgg gaaggcgttg aattatcggc cgaaaatcat cgacaattct atataccgcg 3901 cggctttgct cacggatttt ttgtactctc agatgaggca gaattttgct ataagtgcga 3961 cgacgtgtat catccaggcg atgaaggtgg ccttatgtgg aacgatcccg atcttgctat 4021 tagctggcct gcgccttgtg gctgcgatag cttttcgccc tcgcaggtta ttttgtctga 4081 taaagatacc catcatgaat cgtttgccgc gtatgtacaa cgcacacgcg gataaggtgc 4141 tccaaaagcg ctcaaaaagg cgctgcacac gacttgcgcc cggcagagcc g// > Center for Genomic Sciences > (412)-359-8341 > > > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From joshearl1 at hotmail.com Mon Sep 17 15:39:44 2012 From: joshearl1 at hotmail.com (Josh Earl) Date: Mon, 17 Sep 2012 11:39:44 -0400 Subject: [BioRuby] Genbank file parsing question In-Reply-To: References: Message-ID: Hi Nick, Yeah, sorry about the genbank example, it appears to have lost all the formatting when I sent the email. This might be more handy: http://pastebin.com/N1D7jUuu I'm running into several issues. The first is if I try and load the file from which the above excerpt is from, whenever I load the file, and call methods on it, this is what happens (for example): bioruby> gb = Bio::FlatFile.new(Bio::GenBank, '/mnt/p/o_drive/Homes/jearl/Magee/Atopobium_vaginae.gbk') ==> #, @dbclass=Bio::GenBank, @splitter=#, @entry_pos_flag=nil, @delimiter="\n//\n", @header="LOCUS ", @delimiter_overrun=nil>, @skip_leader_mode=:firsttime, @firsttime_flag=true, @raw=false>But, if I try to call any methods on this:bioruby> gb.firstNoMethodError: private method `gets' called for "/mnt/p/o_drive/Homes/jearl/Magee/Atopobium_vaginae.gbk":String from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile/buffer.rb:251:in `gets' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile/splitter.rb:161:in `skip_leader' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile.rb:283:in `next_entry' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile.rb:335:in `each_entry' from (irb):4:in `first' from (irb):4 from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/bin/bioruby:41:in `block in ' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/bin/bioruby:40:in `catch' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/bin/bioruby:40:in `' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/bioruby:19:in `load' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/bioruby:19:in `
' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/ruby_noexec_wrapper:14:in `eval' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/ruby_noexec_wrapper:14:in `
opening these files with Bio::FlatFile.auto('Atopobium_vaginae.gbk') seems to work inconsistently, but for this file it opens ok. Also, Bio::GenBank.new('Atopobium_vaginae.gbk') will open this file and seems to work the most consistently. Loading into this object truncates the Locus id from: ctg7180000000048 toctg7180000 i.e.bioruby> gb.first.locus.entry_id ==> "ctg7180000" And if I attempt to say something like:bioruby> gb.first.organism ==> "" It is just an empty string. Does this variable not get set for each genbank entry? The organism is listed under the "source" attribute in the file. Not all of these are really errors per se, but odd behavior. ~josh > Hi Josh, > > I've used the Bio gem to parse several Genbank files from NCBI. The snippet you provided looks like it should be handled correctly; except it is missing newlines. > > Could you provide more specific details about the errors you are receiving? > > -Nick > > -- > Nick Thrower > Information Technologist > Michigan State University > Great Lakes Bioenergy Research Center > East Lansing MI 48824 > > On Sep 14, 2012, at 12:00 PM, bioruby-request at lists.open-bio.org wrote: > > > Send BioRuby mailing list submissions to > > bioruby at lists.open-bio.org > > > > To subscribe or unsubscribe via the World Wide Web, visit > > http://lists.open-bio.org/mailman/listinfo/bioruby > > or, via email, send a message with subject or body 'help' to > > bioruby-request at lists.open-bio.org > > > > You can reach the person managing the list at > > bioruby-owner at lists.open-bio.org > > > > When replying, please edit your Subject line so it is more specific > > than "Re: Contents of BioRuby digest..." > > Today's Topics: > > > > 1. Genbank file parsing question (Josh Earl) > > > > From: Josh Earl > > Date: September 13, 2012 1:50:34 PM EDT > > To: > > Subject: [BioRuby] Genbank file parsing question > > > > > > > > Hello all, > > I'm trying to use bioruby to write a small program that will allow me to rearrange the contigs in a genbank file (based on either a list of contig names, mauve output, or whatever). The idea is that the annotation service that we use (RAST - > > http://rast.nmpdr.org ) produces Genbank files, but not exactly the same format that Bioruby is expecting. They seem to outright break the Bio::FlatFile.new(Bio::GenBank, 'file') format, and confuse the Bio::GenBank.open('file') function. My question is, what should I do? Write my own parser, or try and fiddle with the Bioruby implementation or something else entirely? I'm fairly new to ruby, but I've been programming for a long time. > > ~josh > > > > P.S. Here is a short section of what the RAST GenBank file looks like (just a single short contig): > > LOCUS ctg7180000000028 4191 bp DNA linear UNK DEFINITION Contig ctg7180000000028 from Atopobium vaginae B758ACCESSION unknownFEATURES Location/Qualifiers source 1..4191 /mol_type="genomic DNA" /db_xref="taxon: 82135" /genome_md5="" /project="earl_82135" /genome_id="82135.3" /organism="Atopobium vaginae B758" CDS complement(10..1740) /translation="MKLAQLKMVCRGENAGFACIQLCEKPALLKVHAHTKDTNMPCPA RLVCLDELYGVRSSIDDAATGSWWVVIIPLLSVDCVVELSITRASQGLSSDTWSFVFG PHTSRYMSRLLTLRHPQAAALLRRIVHSAAYVHHQLNLIGIWNAAAQTSDSPDVPMRI WRFEARFTCDNSPAFEYFPISCCVLSSTGEPIRARVITLEEQTAAVPDDSAECVRRAV FSIALPHACTHAVVCARLNFKHVASLSCDGGDRARALQKASASTYEAFYTIFPAAAAA RIAEAERFSRDCAHDPHYERWFDEHAATSEQCAMQTRRYEEACACMEHRE! > ES! > > TTHADQ PAQPVQLAQPAQPAQTSFDDALAHMGISVVLPVFSASTTLLARSINAMIHQSFPAWQL IVLDCTHMRTPNQQTDIARWLHSYTKTDARIMYVRMNVEQSQKNQAVAGESLSDSGAH AAGVAQPTDDIIDASRDHYAHHSSYLAYACSFIQNPYVYIMSEGAAPTPDALWHIAQT VAQHIAKGTPCDVVHVDEDELTPQGCTKPHVSYAASMIGLEGTNYLGHSLVLRTALLD ELRAPCDVAT" /product="hypothetical protein" CDS complement(1759..1875) /translation="MPRMYNAHADKVLQKRSKRRCTRLAPAEPHVVRVLLNL" /product="hypothetical protein" CDS complement(1844..2461) /db_xref="GO:0008830" /translation="MTQRTETSQAIQSGNFIFTPTSIRDVIIVDTKQYGDARGYFMET YKASDFAAGGISTTFVQDNQSSSTKGVLRGLHFQIEHPQAKLVRVVRGCVFDVAVDLR AGSETFGAWEGVELSAENHRQFYIPRGFAHGFFVLSDEAEFCYKCDDVYHPGDEGGLM WNDPDLAISWPAPCGCDSFSPSQVILSDKDTHHESFAAYVQRTRG" /p! > r! > > oduct="dTDP-4-dehydrorhamnose 3,5-epimerase (EC 5. > > 1.3.13)" /EC_number="5.1.3.13" CDS complement(2586..2741) /translation="MRTLSTYTSIFEQICISLGRALVLSCSMAARKIALADAPAGARC VALCAFP" /product="hypothetical protein" CDS complement(2798..3193) /translation="MSAILAIVPAYNEQQCIQQTIDELRRVCPGVDYLIVNDGSRDET AAICRRRHFNYINVPINCGLASGVQAGMKYAERNGYSAVVQFDADGQHKPEYIVPMYE HMQKTGADVVIGSRFVDDALLLVVCRHNC" /product="Glycosyltransferase involved in cell wall biogenesis (EC 2.4.-.-)" CDS 3238..3393 /translation="MRTLSTYTSIFEQICISLGRALVLSCSMAARKIALADAPAGARC VALCAFP" /product="hypothetical protein" CDS 3518..4135 /db_xref="GO:0008830" /translation="MTQRTETSQAIQSGNFIFTPTSIRDVIIVDTKQYGDARGYFMET Y! > KA! > > SDFAAGGISTTFVQDNQSSSTKGVLRGLHFQIEHPQAKLVRVVRGCVFDVAVDLR AGSETFGAWEGVELSAENHRQFYIPRGFAHGFFVLSDEAEFCYKCDDVYHPGDEGGLM WNDPDLAISWPAPCGCDSFSPSQVILSDKDTHHESFAAYVQRTRG" /product="dTDP-4-dehydrorhamnose 3,5-epimerase (EC 5.1.3.13)" /EC_number="5.1.3.13"BASE COUNT 1077 a 1055 c 1036 g 1023 tORIGIN 1 aatcgcgctt catgttgcaa catcgcatgg cgcgcgaagt tcatccaaaa gcgcagttct 61 caacacgagt gaatgtccaa gatagttagt gccttcaagc ccaatcatgc ttgctgcata 121 actcacgtga ggctttgtgc agccttgggg cgtgagctca tcttcatcaa catgtacaac 181 atcgcagggt gtaccttttg ctatgtgctg tgctaccgtt tgtgcaatat gccacagggc 241 atcgggcgtg ggagctgccc cctcactcat aatgtaaacg tacgggtttt gtataaacga 301 acatgcatac gcaagatacg agctatggtg cgcataatgg tcgcgagatg catcgataat 361 atcatctgta ggctgtgcta cgccagcagc atgcgcgcct gaatctgaca atgattcccc 421 agctacagct tggtttttct gtgattgttc cacgttcata cgc! > a! > > cataca taatgcgcgc 481 atcggtctta gtatagctat gaagccagcg tgcaatatct > > gtttgttggt tgggagtgcg 541 catgtgtgta caatcgagca cgataagctg ccatgccgga aaactctgat gtatcatcgc 601 gttaatactg cgcgcaagca gtgtagtcga tgctgaaaaa acgggcagta ccaccgaaat 661 acccatgtgt gcaagcgcat catcaaacga tgtctgtgca ggttgtgcag gttgcgcaag 721 ctggacaggt tgtgcaggct ggtcagcatg cgtggtgctc tcttcgcggt gttccataca 781 cgcgcacgcc tcttcgtacc tgcgtgtttg catagcgcac tgctcagacg tagctgcatg 841 ctcatcaaac cagcgctcat agtgaggatc gtgagcgcaa tcgcgactaa aacgctcggc 901 ctcagcaatg cgcgcggccg cagcagcagg aaaaatggta taaaaggctt cgtaggtaga 961 tgcagacgct ttttgcaacg cgcgggctcg atctccaccg tcgcagctca acgatgccac 1021 atgcttaaag ttgaggcgcg cgcacacaac agcatgcgtg cacgcgtgcg gcaatgcaat 1081 cgaaaacacc gcacgacgaa cgcattcagc gctgtcatcc ggaacagctg ccgtttgctc 1141 ttctagcgta attacacgtg cacgtatggg ctcacctgta ctgctcagca cgcagcagct 1201 tataggaaaa tattcaaacg caggcgaatt gtcgcaggta aaccgcgctt caaatcgcca 1261 tatacgcatc ggcacatcgg gagagtcgct tgtttgcgca gcggcattcc a! > a! > > ataccaat 1321 aagattaagc tgatggtgca catatgcagc actgtgaacg atgcggcgta gcaacgcggc 1381 cgcttgagga tggcggagcg taagtaaacg cgacatgtag cgcgacgtat gaggaccaaa 1441 aacaaacgac cacgtatcag aactcagacc ctgcgacgca cgtgttatgc tgagctcaac 1501 cacacaatca acgctcaaca gcggaataat aacaacccac cacgaccccg ttgcagcatc 1561 atcaatgctc gaacgaacgc catagagctc gtctaaacac acaagacgcg caggacacgg 1621 catattcgta tctttggtat gtgcatgcac tttaagcaat gcgggctttt cgcacagctg 1681 aatgcaggca aaccctgcgt tttcgccgcg gcaaaccatt tttaattgcg cgagtttcat 1741 gcagtcccct tactgttgtt aaagattaag cagcacgcgc acaacatgcg gctctgcggg 1801 cgcaagtcgt gtgcagcgcc tttttgagcg cttttggagc accttatccg cgtgtgcgtt 1861 gtacatacgc ggcaaacgat tcatgatggg tatctttatc agacaaaata acctgcgagg 1921 gcgaaaagct atcgcagcca caaggcgcag gccagctaat agcaagatcg ggatcgttcc 1981 acataaggcc accttcatcg cctggatgat acacgtcgtc gcacttatag caaaattctg 2041 cctcatctga gagtacaaaa aatccgtgag caaagccgcg cggtatatag aattgtcgat ! > ! > > 2101 gattttcggc cgataattca acgccttccc atgcaccaaa ggtctctgaa cctgcgcgca > > 2161 agtctaccgc aacatcaaac acacagccac gcacaacacg aacgagtttt gcttgagggt 2221 gttcaatctg aaaatgcagg ccacgaagca cgccttttgt ggagctcgat tggttatcct 2281 gtacaaacgt agtagaaata ccacccgcag caaaatcgga tgctttgtac gtttccataa 2341 agtacccgcg cgcgtcacca tactgtttgg tatcaacaat aataacgtcg cgaatagatg 2401 taggtgtaaa aataaaattg cccgattgaa tagcctgaga tgtttctgta cgctgtgtca 2461 taccttaact cctttagcgc gcgctccttt agcgcacaaa tatgcgctaa cgaattgtgc 2521 aatacctgca acggtttatt tcattgtagc gcacgaatat acgcctacga tatgttcata 2581 caatgctacg gaaatgcaca cagcgcgaca caccgtgcgc ccgcaggcgc atctgccagt 2641 gcaatcttcc ttgccgccat gctacacgaa agaacaagtg cgcgccccaa tgatatgcag 2701 atttgctcaa aaatagaggt atacgtgctc aaagtacgca atggtgcggt atacttgcac 2761 agatatacca actttatgga gaactatgtc tgcaatatta gcaattgtgc ctgcatacaa 2821 cgagcagcag tgcatcgtca acaaaacgcg aaccaataac aacgtcggca cctgtttttt 2881 gcatgtgctc gtacatgggc acgatgtact cgggtttgtg ctgaccatcg gcatcaaatt 2941 gca! > c! > > aactgc agaatagcca ttgcgctctg catatttcat gccagcttga acgcccgaag 3001 caaggccgca gttaatgggc acatttatgt agttaaaatg gcgcctacgg catatagctg 3061 cggtttcgtc gcgcgagccg tcgtttacaa taaggtagtc tacgccgggg cacacgcggc 3121 gcagctcatc gattgtttgt tgtatgcact gctgctcgtt gtatgcaggc acaattgcta 3181 atattgcaga catagttctc cataaagttg gtatatctgt gcaagtatac cgcaccattg 3241 cgtactttga gcacgtatac ctctattttt gagcaaatct gcatatcatt ggggcgcgca 3301 cttgttcttt cgtgtagcat ggcggcaagg aagattgcac tggcagatgc gcctgcgggc 3361 gcacggtgtg tcgcgctgtg tgcatttccg tagcattgta tgaacatatc gtaggcgtat 3421 attcgtgcgc tacaatgaaa taaaccgttg caggtattgc acaattcgtt agcgcatatt 3481 tgtgcgctaa aggagcgcgc gctaaaggag ttaaggtatg acacagcgta cagaaacatc 3541 tcaggctatt caatcgggca attttatttt tacacctaca tctattcgcg acgttattat 3601 tgttgatacc aaacagtatg gtgacgcgcg cgggtacttt atggaaacgt acaaagcatc 3661 cgattttgct gcgggtggta tttctactac gtttgtacag gataaccaat cgagctccac 3721 aaaaggcgtg cttcg! > t! > > ggcc tgcattttca gattgaacac cctcaagcaa aactcgttcg 3781 tgttgtgcgt g > > gctgtgtgt ttgatgttgc ggtagacttg cgcgcaggtt cagagacctt 3841 tggtgcatgg gaaggcgttg aattatcggc cgaaaatcat cgacaattct atataccgcg 3901 cggctttgct cacggatttt ttgtactctc agatgaggca gaattttgct ataagtgcga 3961 cgacgtgtat catccaggcg atgaaggtgg ccttatgtgg aacgatcccg atcttgctat 4021 tagctggcct gcgccttgtg gctgcgatag cttttcgccc tcgcaggtta ttttgtctga 4081 taaagatacc catcatgaat cgtttgccgc gtatgtacaa cgcacacgcg gataaggtgc 4141 tccaaaagcg ctcaaaaagg cgctgcacac gacttgcgcc cggcagagcc g// > > Center for Genomic Sciences > > (412)-359-8341 > > > > > > > > _______________________________________________ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > > > > > > > ------------------------------ > > _______________________________________________ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > > > End of BioRuby Digest, Vol 84, Issue 6 > ************************************** From throwern at msu.edu Mon Sep 17 17:28:56 2012 From: throwern at msu.edu (Nick Thrower) Date: Mon, 17 Sep 2012 13:28:56 -0400 Subject: [BioRuby] Genbank file parsing question In-Reply-To: References: Message-ID: <03A1944A-AD1F-4C94-B026-74443E716242@msu.edu> Hi Josh, 1.) You are getting an error because you must pass an open stream to the 'new' method http://bioruby.org/rdoc/Bio/FlatFile.html#method-c-new If you want to supply a file location you should use the 'open' method http://bioruby.org/rdoc/Bio/FlatFile.html#method-c-open gb = Bio::FlatFile.open(Bio::GenBank,'/mnt/p/o_drive/Homes/jearl/Magee/Atopobium_vaginae.gbk') 2.) The locus line is position parsed, and it looks like your locus is beyond the hard coded bounds: http://bioruby.org/rdoc/Bio/GenBank/Locus.html (look at the source for 'new') Maybe somebody else could help with that? 3.) To access the organism line you need to drill down through the data. A Genbank file is made up of several entries. Each entry has many features, and each feature has many qualifiers. gb.first.features.first.qualifiers.select{|f| f.qualifier=='organism'} => [#] -Nick -- Nick Thrower Information Technologist Michigan State University Great Lakes Bioenergy Research Center East Lansing MI 48824 > > Hi Nick, > Yeah, sorry about the genbank example, it appears to have lost all the formatting when I sent the email. This might be more handy: > http://pastebin.com/N1D7jUuu > I'm running into several issues. The first is if I try and load the file from which the above excerpt is from, whenever I load the file, and call methods on it, this is what happens (for example): > bioruby> gb = Bio::FlatFile.new(Bio::GenBank, '/mnt/p/o_drive/Homes/jearl/Magee/Atopobium_vaginae.gbk') ==> #, @dbclass=Bio::GenBank, @splitter=#, @entry_pos_flag=nil, @delimiter="\n//\n", @header="LOCUS ", @delimiter_overrun=nil>, @skip_leader_mode=:firsttime, @firsttime_flag=true, @raw=false>But, if I try to call any methods on this:bioruby> gb.firstNoMethodError: private method `gets' called for "/mnt/p/o_drive/Homes/jearl/Magee/Atopobium_vaginae.gbk":String from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile/buffer.rb:251:in `gets' fro! > m /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile/splitter.rb:161:in `skip_leader' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile.rb:283:in `next_entry' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile.rb:335:in `each_entry' from (irb):4:in `first' from (irb):4 from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/bin/bioruby:41:in `block in ' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/bin/bioruby:40:in `catch' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/bin/bioruby:40:in `' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/bioruby:19:in `load' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/bioruby:19:in `
' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/ruby_noexec_wrapper:14:in `eval' from /home/josh/.r! > vm/gems/ruby-1.9.2-p290 at proj2/bin/ruby_noexec_wrapper:14:in `
> opening these files with Bio::FlatFile.auto('Atopobium_vaginae.gbk') seems to work inconsistently, but for this file it opens ok. Also, Bio::GenBank.new('Atopobium_vaginae.gbk') will open this file and seems to work the most consistently. > Loading into this object truncates the Locus id from: > ctg7180000000048 toctg7180000 > i.e.bioruby> gb.first.locus.entry_id ==> "ctg7180000" > And if I attempt to say something like:bioruby> gb.first.organism ==> "" > It is just an empty string. Does this variable not get set for each genbank entry? The organism is listed under the "source" attribute in the file. > Not all of these are really errors per se, but odd behavior. > ~josh From joshearl1 at hotmail.com Mon Sep 17 18:46:21 2012 From: joshearl1 at hotmail.com (Josh Earl) Date: Mon, 17 Sep 2012 14:46:21 -0400 Subject: [BioRuby] Genbank file parsing question In-Reply-To: <03A1944A-AD1F-4C94-B026-74443E716242@msu.edu> References: , <03A1944A-AD1F-4C94-B026-74443E716242@msu.edu> Message-ID: Hey Nick, Wow, that was incredibly helpful, thanks. One of the reasons I was confused about with the Bio::FlatFile.new method is the http://thebird.nl/bioruby/Tutorial.rd.html is a bit confusing in that regard (always uses the .new method instead of .open for Bio::FlatFile). Is it the correct usage on the tutorial, or was I just interpreting that incorrectly? Is there a reason to go specifically in hard-coding the locations instead of just splitting it into an array? I suppose there are issues with both, since if there are missing values, the locations in the array would be incorrect, and if the locus name is too long that might move the other fields out of position. And thanks for clarifying how to get access to the organism. It does seem a bit odd to me that the first genbank record would have an "organism" method that wasn't set, even if there is an organism name listed. For instance gb.first refers to a single genbank record, right? So, what is gb.first.organism referring to, if not the organism of that record? I guess I have similar questions about the .classification and .taxonomy methods (both of which return empty values on the first genbank record). It seems odd that you would have to dig into the record like that to get the information, especially if the methods are available on a record. Maybe they refer to something else than the items listed in the "source" feature? ~josh Center for Genomic Sciences (412)-359-8341 > From: throwern at msu.edu > Date: Mon, 17 Sep 2012 13:28:56 -0400 > To: bioruby at lists.open-bio.org > Subject: Re: [BioRuby] Genbank file parsing question > > Hi Josh, > > 1.) > You are getting an error because you must pass an open stream to the 'new' method > http://bioruby.org/rdoc/Bio/FlatFile.html#method-c-new > > If you want to supply a file location you should use the 'open' method > http://bioruby.org/rdoc/Bio/FlatFile.html#method-c-open > > gb = Bio::FlatFile.open(Bio::GenBank,'/mnt/p/o_drive/Homes/jearl/Magee/Atopobium_vaginae.gbk') > > 2.) > The locus line is position parsed, and it looks like your locus is beyond the hard coded bounds: > http://bioruby.org/rdoc/Bio/GenBank/Locus.html (look at the source for 'new') > > Maybe somebody else could help with that? > > 3.) > To access the organism line you need to drill down through the data. A Genbank file is made up of several entries. Each entry has many features, and each feature has many qualifiers. > > gb.first.features.first.qualifiers.select{|f| f.qualifier=='organism'} > => [#] > > -Nick > > -- > Nick Thrower > Information Technologist > Michigan State University > Great Lakes Bioenergy Research Center > East Lansing MI 48824 > > > > > Hi Nick, > > Yeah, sorry about the genbank example, it appears to have lost all the formatting when I sent the email. This might be more handy: > > http://pastebin.com/N1D7jUuu > > I'm running into several issues. The first is if I try and load the file from which the above excerpt is from, whenever I load the file, and call methods on it, this is what happens (for example): > > bioruby> gb = Bio::FlatFile.new(Bio::GenBank, '/mnt/p/o_drive/Homes/jearl/Magee/Atopobium_vaginae.gbk') ==> #, @dbclass=Bio::GenBank, @splitter=#, @entry_pos_flag=nil, @delimiter="\n//\n", @header="LOCUS ", @delimiter_overrun=nil>, @skip_leader_mode=:firsttime, @firsttime_flag=true, @raw=false>But, if I try to call any methods on this:bioruby> gb.firstNoMethodError: private method `gets' called for "/mnt/p/o_drive/Homes/jearl/Magee/Atopobium_vaginae.gbk":String from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile/buffer.rb:251:in `gets' f! > ro! > > m /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile/splitter.rb:161:in `skip_leader' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile.rb:283:in `next_entry' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile.rb:335:in `each_entry' from (irb):4:in `first' from (irb):4 from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/bin/bioruby:41:in `block in ' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/bin/bioruby:40:in `catch' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/bin/bioruby:40:in `' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/bioruby:19:in `load' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/bioruby:19:in `
' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/ruby_noexec_wrapper:14:in `eval' from /home/josh/.! > r! > > vm/gems/ruby-1.9.2-p290 at proj2/bin/ruby_noexec_wrapper:14:in `
> > opening these files with Bio::FlatFile.auto('Atopobium_vaginae.gbk') seems to work inconsistently, but for this file it opens ok. Also, Bio::GenBank.new('Atopobium_vaginae.gbk') will open this file and seems to work the most consistently. > > Loading into this object truncates the Locus id from: > > ctg7180000000048 toctg7180000 > > i.e.bioruby> gb.first.locus.entry_id ==> "ctg7180000" > > And if I attempt to say something like:bioruby> gb.first.organism ==> "" > > It is just an empty string. Does this variable not get set for each genbank entry? The organism is listed under the "source" attribute in the file. > > Not all of these are really errors per se, but odd behavior. > > ~josh > > > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From ngoto at gen-info.osaka-u.ac.jp Tue Sep 18 13:22:13 2012 From: ngoto at gen-info.osaka-u.ac.jp (Naohisa GOTO) Date: Tue, 18 Sep 2012 22:22:13 +0900 Subject: [BioRuby] Genbank file parsing question In-Reply-To: References: <03A1944A-AD1F-4C94-B026-74443E716242@msu.edu> Message-ID: <201209181327.q8IDRfqa031985@portal.open-bio.org> Hi, On Mon, 17 Sep 2012 14:46:21 -0400 Josh Earl wrote: > Hey Nick, > Wow, that was incredibly helpful, thanks. One of the reasons I was confused about with the Bio::FlatFile.new method is the > http://thebird.nl/bioruby/Tutorial.rd.html is a bit confusing in that regard (always uses the .new method instead of .open for Bio::FlatFile). Is it the correct usage on the tutorial, or was I just interpreting that incorrectly? The usage in the tutorial is right. As you can see, it only teaches Bio::FlatFile.new, but this does not mean there are no other methods. Indeed, I think many useful methods, classes, modules, and usages of them are not yet described in the tutorial. Thanks giving us an idea to improve the tutorial. > Is there a reason to go specifically in hard-coding the locations instead of just splitting it into an array? Because the positions are officially defined by NCBI. See section 3.4.4 in the NCBI GenBank Release Note. ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt (current version: Release 191.0) It says: >> Positions Contents >> --------- -------- >> 01-05 'LOCUS' >> 06-12 spaces >> 13-28 Locus name >> 29-29 space >> 30-40 Length of sequence, right-justified >> 41-41 space >> 42-43 bp >> 44-44 space >> 45-47 spaces, ss- (single-stranded), ds- (double-stranded), or >> ms- (mixed-stranded) >> 48-53 NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), >> mRNA (messenger RNA), uRNA (small nuclear RNA). >> Left justified. >> 54-55 space >> 56-63 'linear' followed by two spaces, or 'circular' >> 64-64 space >> 65-67 The division code (see Section 3.3) >> 68-68 space >> 69-79 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991) > I suppose there are issues with both, since if there are missing values, the locations in the array would be incorrect, and if the locus name is too long that might move the other fields out of position. Locus name longer than 16 characters is not officially allowed in the GenBank format. It is not so easy to allow parsing of non-standard GenBank format that breaks the above definition, partly because of avoiding potential conflicts with future versions of NCBI GenBank format. Only NCBI has the right to change the format definition. In addition, non-standard means that the format definition is ambiguous and not fixed. This also makes difficult to parse such kind of data. > And thanks for clarifying how to get access to the organism. It does seem a bit odd to me that the first genbank record would have an "organism" method that wasn't set, even if there is an organism name listed. For instance > gb.first > refers to a single genbank record, right? So, what is > gb.first.organism > referring to, if not the organism of that record? I guess I have similar questions about the .classification and .taxonomy methods (both of which return empty values on the first genbank record). Each GenBank entry provided by NCBI has SOURCE field and ORGANISM subkeyward. See sections 3.4.2 and 3.4.10 in the GenBank Release Note. (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt) According to the section 3.4.2, SOURCE is mandatory keyword. Bio::GenBank#organism, source, common_name, taxonomy and classification methods get their contents from the SOURCE and ORGANISM, not from the "source" feature in the feature table. Naohisa Goto ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org From joshearl1 at hotmail.com Tue Sep 18 15:19:27 2012 From: joshearl1 at hotmail.com (Josh Earl) Date: Tue, 18 Sep 2012 11:19:27 -0400 Subject: [BioRuby] Genbank file parsing question In-Reply-To: <201209181327.q8IDRfqa031985@portal.open-bio.org> References: , <03A1944A-AD1F-4C94-B026-74443E716242@msu.edu>, , <201209181327.q8IDRfqa031985@portal.open-bio.org> Message-ID: Thanks! This was all great information, especially ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt which I had never tracked down before. This will help a lot with any issues I run into with non-standard genbank formats. My confusion with the tutorial is: Bio::FlatFile.new(Bio::GenBank, ARGF)works when part of a script and you pass the script a path/filename. Bio::FlatFile.new(Bio::GenBank, "path/filename") doesn't work. I looked at the code for Bio::FlatFile.newnew(dbclass, stream)Same as ::open, except that ?stream? should be a opened stream object (IO, File, ?, who have the ?gets? method).So.. why is ARGF (which would just be a string passed to the script) working, if it should be a stream? Shouldn't I have to open the file? For instance this works: Bio::FlatFile.new(Bio::GenBank, File.open("path/filename")) Is there some ruby magic going on? ~josh Center for Genomic Sciences (412)-359-8341 > Date: Tue, 18 Sep 2012 22:22:13 +0900 > From: ngoto at gen-info.osaka-u.ac.jp > To: bioruby at lists.open-bio.org > Subject: Re: [BioRuby] Genbank file parsing question > > Hi, > > On Mon, 17 Sep 2012 14:46:21 -0400 > Josh Earl wrote: > > > Hey Nick, > > Wow, that was incredibly helpful, thanks. One of the reasons I was confused about with the Bio::FlatFile.new method is the > > http://thebird.nl/bioruby/Tutorial.rd.html is a bit confusing in that regard (always uses the .new method instead of .open for Bio::FlatFile). Is it the correct usage on the tutorial, or was I just interpreting that incorrectly? > > The usage in the tutorial is right. As you can see, it only > teaches Bio::FlatFile.new, but this does not mean there are no > other methods. Indeed, I think many useful methods, classes, > modules, and usages of them are not yet described in the tutorial. > Thanks giving us an idea to improve the tutorial. > > > Is there a reason to go specifically in hard-coding the locations instead of just splitting it into an array? > > Because the positions are officially defined by NCBI. > See section 3.4.4 in the NCBI GenBank Release Note. > ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt > (current version: Release 191.0) > > It says: > >> Positions Contents > >> --------- -------- > >> 01-05 'LOCUS' > >> 06-12 spaces > >> 13-28 Locus name > >> 29-29 space > >> 30-40 Length of sequence, right-justified > >> 41-41 space > >> 42-43 bp > >> 44-44 space > >> 45-47 spaces, ss- (single-stranded), ds- (double-stranded), or > >> ms- (mixed-stranded) > >> 48-53 NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), > >> mRNA (messenger RNA), uRNA (small nuclear RNA). > >> Left justified. > >> 54-55 space > >> 56-63 'linear' followed by two spaces, or 'circular' > >> 64-64 space > >> 65-67 The division code (see Section 3.3) > >> 68-68 space > >> 69-79 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991) > > > > I suppose there are issues with both, since if there are missing values, the locations in the array would be incorrect, and if the locus name is too long that might move the other fields out of position. > > Locus name longer than 16 characters is not officially allowed > in the GenBank format. > > It is not so easy to allow parsing of non-standard GenBank format > that breaks the above definition, partly because of avoiding > potential conflicts with future versions of NCBI GenBank format. > Only NCBI has the right to change the format definition. > In addition, non-standard means that the format definition is > ambiguous and not fixed. This also makes difficult to parse > such kind of data. > > > And thanks for clarifying how to get access to the organism. It does seem a bit odd to me that the first genbank record would have an "organism" method that wasn't set, even if there is an organism name listed. For instance > > gb.first > > refers to a single genbank record, right? So, what is > > gb.first.organism > > referring to, if not the organism of that record? I guess I have similar questions about the .classification and .taxonomy methods (both of which return empty values on the first genbank record). > > Each GenBank entry provided by NCBI has SOURCE field and ORGANISM > subkeyward. See sections 3.4.2 and 3.4.10 in the GenBank Release > Note. (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt) > According to the section 3.4.2, SOURCE is mandatory keyword. > Bio::GenBank#organism, source, common_name, taxonomy and > classification methods get their contents from the SOURCE and > ORGANISM, not from the "source" feature in the feature table. > > > Naohisa Goto > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby From pjotr.public14 at thebird.nl Tue Sep 18 15:40:46 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 18 Sep 2012 17:40:46 +0200 Subject: [BioRuby] Genbank file parsing question In-Reply-To: References: <03A1944A-AD1F-4C94-B026-74443E716242@msu.edu> <201209181327.q8IDRfqa031985@portal.open-bio.org> Message-ID: <20120918154046.GA30842@thebird.nl> ARGF is a stream. On Tue, Sep 18, 2012 at 11:19:27AM -0400, Josh Earl wrote: > > Thanks! This was all great information, especially > ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt which I had never tracked down before. This will help a lot with any issues I run into with non-standard genbank formats. > My confusion with the tutorial is: > Bio::FlatFile.new(Bio::GenBank, ARGF)works when part of a script and you pass the script a path/filename. > Bio::FlatFile.new(Bio::GenBank, "path/filename") doesn't work. > I looked at the code for Bio::FlatFile.newnew(dbclass, stream)Same as ::open, except that ?stream? should be a opened stream object (IO, File, ?, who have the ?gets? method).So.. why is ARGF (which would just be a string passed to the script) working, if it should be a stream? Shouldn't I have to open the file? For instance this works: > Bio::FlatFile.new(Bio::GenBank, File.open("path/filename")) > > Is there some ruby magic going on? > ~josh > > Center for Genomic Sciences > (412)-359-8341 > > > Date: Tue, 18 Sep 2012 22:22:13 +0900 > > From: ngoto at gen-info.osaka-u.ac.jp > > To: bioruby at lists.open-bio.org > > Subject: Re: [BioRuby] Genbank file parsing question > > > > Hi, > > > > On Mon, 17 Sep 2012 14:46:21 -0400 > > Josh Earl wrote: > > > > > Hey Nick, > > > Wow, that was incredibly helpful, thanks. One of the reasons I was confused about with the Bio::FlatFile.new method is the > > > http://thebird.nl/bioruby/Tutorial.rd.html is a bit confusing in that regard (always uses the .new method instead of .open for Bio::FlatFile). Is it the correct usage on the tutorial, or was I just interpreting that incorrectly? > > > > The usage in the tutorial is right. As you can see, it only > > teaches Bio::FlatFile.new, but this does not mean there are no > > other methods. Indeed, I think many useful methods, classes, > > modules, and usages of them are not yet described in the tutorial. > > Thanks giving us an idea to improve the tutorial. > > > > > Is there a reason to go specifically in hard-coding the locations instead of just splitting it into an array? > > > > Because the positions are officially defined by NCBI. > > See section 3.4.4 in the NCBI GenBank Release Note. > > ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt > > (current version: Release 191.0) > > > > It says: > > >> Positions Contents > > >> --------- -------- > > >> 01-05 'LOCUS' > > >> 06-12 spaces > > >> 13-28 Locus name > > >> 29-29 space > > >> 30-40 Length of sequence, right-justified > > >> 41-41 space > > >> 42-43 bp > > >> 44-44 space > > >> 45-47 spaces, ss- (single-stranded), ds- (double-stranded), or > > >> ms- (mixed-stranded) > > >> 48-53 NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), > > >> mRNA (messenger RNA), uRNA (small nuclear RNA). > > >> Left justified. > > >> 54-55 space > > >> 56-63 'linear' followed by two spaces, or 'circular' > > >> 64-64 space > > >> 65-67 The division code (see Section 3.3) > > >> 68-68 space > > >> 69-79 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991) > > > > > > > I suppose there are issues with both, since if there are missing values, the locations in the array would be incorrect, and if the locus name is too long that might move the other fields out of position. > > > > Locus name longer than 16 characters is not officially allowed > > in the GenBank format. > > > > It is not so easy to allow parsing of non-standard GenBank format > > that breaks the above definition, partly because of avoiding > > potential conflicts with future versions of NCBI GenBank format. > > Only NCBI has the right to change the format definition. > > In addition, non-standard means that the format definition is > > ambiguous and not fixed. This also makes difficult to parse > > such kind of data. > > > > > And thanks for clarifying how to get access to the organism. It does seem a bit odd to me that the first genbank record would have an "organism" method that wasn't set, even if there is an organism name listed. For instance > > > gb.first > > > refers to a single genbank record, right? So, what is > > > gb.first.organism > > > referring to, if not the organism of that record? I guess I have similar questions about the .classification and .taxonomy methods (both of which return empty values on the first genbank record). > > > > Each GenBank entry provided by NCBI has SOURCE field and ORGANISM > > subkeyward. See sections 3.4.2 and 3.4.10 in the GenBank Release > > Note. (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt) > > According to the section 3.4.2, SOURCE is mandatory keyword. > > Bio::GenBank#organism, source, common_name, taxonomy and > > classification methods get their contents from the SOURCE and > > ORGANISM, not from the "source" feature in the feature table. > > > > > > Naohisa Goto > > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > _______________________________________________ > > BioRuby Project - http://www.bioruby.org/ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > _______________________________________________ > BioRuby Project - http://www.bioruby.org/ > BioRuby mailing list > BioRuby at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioruby > From joshearl1 at hotmail.com Tue Sep 18 15:55:04 2012 From: joshearl1 at hotmail.com (Josh Earl) Date: Tue, 18 Sep 2012 11:55:04 -0400 Subject: [BioRuby] Genbank file parsing question In-Reply-To: <20120918154046.GA30842@thebird.nl> References: , <03A1944A-AD1F-4C94-B026-74443E716242@msu.edu>, , <201209181327.q8IDRfqa031985@portal.open-bio.org>, , <20120918154046.GA30842@thebird.nl> Message-ID: ahhh. I see, I was confusing it with ARGV. I'm new to ruby. Thanks for the heads up. Center for Genomic Sciences (412)-359-8341 > Date: Tue, 18 Sep 2012 17:40:46 +0200 > From: pjotr.public14 at thebird.nl > To: joshearl1 at hotmail.com > CC: bioruby at lists.open-bio.org > Subject: Re: [BioRuby] Genbank file parsing question > > ARGF is a stream. > > On Tue, Sep 18, 2012 at 11:19:27AM -0400, Josh Earl wrote: > > > > Thanks! This was all great information, especially > > ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt which I had never tracked down before. This will help a lot with any issues I run into with non-standard genbank formats. > > My confusion with the tutorial is: > > Bio::FlatFile.new(Bio::GenBank, ARGF)works when part of a script and you pass the script a path/filename. > > Bio::FlatFile.new(Bio::GenBank, "path/filename") doesn't work. > > I looked at the code for Bio::FlatFile.newnew(dbclass, stream)Same as ::open, except that ?stream? should be a opened stream object (IO, File, ?, who have the ?gets? method).So.. why is ARGF (which would just be a string passed to the script) working, if it should be a stream? Shouldn't I have to open the file? For instance this works: > > Bio::FlatFile.new(Bio::GenBank, File.open("path/filename")) > > > > Is there some ruby magic going on? > > ~josh > > > > Center for Genomic Sciences > > (412)-359-8341 > > > > > Date: Tue, 18 Sep 2012 22:22:13 +0900 > > > From: ngoto at gen-info.osaka-u.ac.jp > > > To: bioruby at lists.open-bio.org > > > Subject: Re: [BioRuby] Genbank file parsing question > > > > > > Hi, > > > > > > On Mon, 17 Sep 2012 14:46:21 -0400 > > > Josh Earl wrote: > > > > > > > Hey Nick, > > > > Wow, that was incredibly helpful, thanks. One of the reasons I was confused about with the Bio::FlatFile.new method is the > > > > http://thebird.nl/bioruby/Tutorial.rd.html is a bit confusing in that regard (always uses the .new method instead of .open for Bio::FlatFile). Is it the correct usage on the tutorial, or was I just interpreting that incorrectly? > > > > > > The usage in the tutorial is right. As you can see, it only > > > teaches Bio::FlatFile.new, but this does not mean there are no > > > other methods. Indeed, I think many useful methods, classes, > > > modules, and usages of them are not yet described in the tutorial. > > > Thanks giving us an idea to improve the tutorial. > > > > > > > Is there a reason to go specifically in hard-coding the locations instead of just splitting it into an array? > > > > > > Because the positions are officially defined by NCBI. > > > See section 3.4.4 in the NCBI GenBank Release Note. > > > ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt > > > (current version: Release 191.0) > > > > > > It says: > > > >> Positions Contents > > > >> --------- -------- > > > >> 01-05 'LOCUS' > > > >> 06-12 spaces > > > >> 13-28 Locus name > > > >> 29-29 space > > > >> 30-40 Length of sequence, right-justified > > > >> 41-41 space > > > >> 42-43 bp > > > >> 44-44 space > > > >> 45-47 spaces, ss- (single-stranded), ds- (double-stranded), or > > > >> ms- (mixed-stranded) > > > >> 48-53 NA, DNA, RNA, tRNA (transfer RNA), rRNA (ribosomal RNA), > > > >> mRNA (messenger RNA), uRNA (small nuclear RNA). > > > >> Left justified. > > > >> 54-55 space > > > >> 56-63 'linear' followed by two spaces, or 'circular' > > > >> 64-64 space > > > >> 65-67 The division code (see Section 3.3) > > > >> 68-68 space > > > >> 69-79 Date, in the form dd-MMM-yyyy (e.g., 15-MAR-1991) > > > > > > > > > > I suppose there are issues with both, since if there are missing values, the locations in the array would be incorrect, and if the locus name is too long that might move the other fields out of position. > > > > > > Locus name longer than 16 characters is not officially allowed > > > in the GenBank format. > > > > > > It is not so easy to allow parsing of non-standard GenBank format > > > that breaks the above definition, partly because of avoiding > > > potential conflicts with future versions of NCBI GenBank format. > > > Only NCBI has the right to change the format definition. > > > In addition, non-standard means that the format definition is > > > ambiguous and not fixed. This also makes difficult to parse > > > such kind of data. > > > > > > > And thanks for clarifying how to get access to the organism. It does seem a bit odd to me that the first genbank record would have an "organism" method that wasn't set, even if there is an organism name listed. For instance > > > > gb.first > > > > refers to a single genbank record, right? So, what is > > > > gb.first.organism > > > > referring to, if not the organism of that record? I guess I have similar questions about the .classification and .taxonomy methods (both of which return empty values on the first genbank record). > > > > > > Each GenBank entry provided by NCBI has SOURCE field and ORGANISM > > > subkeyward. See sections 3.4.2 and 3.4.10 in the GenBank Release > > > Note. (ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt) > > > According to the section 3.4.2, SOURCE is mandatory keyword. > > > Bio::GenBank#organism, source, common_name, taxonomy and > > > classification methods get their contents from the SOURCE and > > > ORGANISM, not from the "source" feature in the feature table. > > > > > > > > > Naohisa Goto > > > ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org > > > _______________________________________________ > > > BioRuby Project - http://www.bioruby.org/ > > > BioRuby mailing list > > > BioRuby at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/bioruby > > > > _______________________________________________ > > BioRuby Project - http://www.bioruby.org/ > > BioRuby mailing list > > BioRuby at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/bioruby > > From pjotr.public14 at thebird.nl Tue Sep 25 06:08:50 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Tue, 25 Sep 2012 08:08:50 +0200 Subject: [BioRuby] [GSoC] GSoC week 2 status report Message-ID: <20120925060850.GA1143@thebird.nl> Hi John, Congrats from the BioRuby panel and community winning Ruby Association Grant! http://sciruby.com/blog/2012/09/24/sciruby-receives-ruby-association-grant--fellowships-available/ Pj. From pjotr.public14 at thebird.nl Sun Sep 30 16:29:58 2012 From: pjotr.public14 at thebird.nl (Pjotr Prins) Date: Sun, 30 Sep 2012 18:29:58 +0200 Subject: [BioRuby] Price for BioRuby/biogems character! Message-ID: <20120930162958.GB23298@thebird.nl> Hi list, We are looking for a cartoon character, Japanese style, to represent BioRuby and biogems, and make the website(s) attractive to a young(er) audience. We will credit the creator on the website, and he/she will win a prize. Note: there should be no hampering copyright on the cartoon. Best to create one yourself. Pj.