[BioRuby] Genbank file parsing question
Josh Earl
joshearl1 at hotmail.com
Mon Sep 17 15:39:44 UTC 2012
Hi Nick,
Yeah, sorry about the genbank example, it appears to have lost all the formatting when I sent the email. This might be more handy:
http://pastebin.com/N1D7jUuu
I'm running into several issues. The first is if I try and load the file from which the above excerpt is from, whenever I load the file, and call methods on it, this is what happens (for example):
bioruby> gb = Bio::FlatFile.new(Bio::GenBank, '/mnt/p/o_drive/Homes/jearl/Magee/Atopobium_vaginae.gbk') ==> #<Bio::FlatFile:0x00000005237800 @stream=#<Bio::FlatFile::BufferedInputStream:0x000000051bd3c0 @io="/mnt/p/o_drive/Homes/jearl/Magee/Atopobium_vaginae.gbk", @path=nil, @buffer="">, @dbclass=Bio::GenBank, @splitter=#<Bio::FlatFile::Splitter::Default:0x000000050f3778 @dbclass=Bio::GenBank, @stream=#<Bio::FlatFile::BufferedInputStream:0x000000051bd3c0 @io="/mnt/p/o_drive/Homes/jearl/Magee/Atopobium_vaginae.gbk", @path=nil, @buffer="">, @entry_pos_flag=nil, @delimiter="\n//\n", @header="LOCUS ", @delimiter_overrun=nil>, @skip_leader_mode=:firsttime, @firsttime_flag=true, @raw=false>But, if I try to call any methods on this:bioruby> gb.firstNoMethodError: private method `gets' called for "/mnt/p/o_drive/Homes/jearl/Magee/Atopobium_vaginae.gbk":String from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile/buffer.rb:251:in `gets' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile/splitter.rb:161:in `skip_leader' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile.rb:283:in `next_entry' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/lib/bio/io/flatfile.rb:335:in `each_entry' from (irb):4:in `first' from (irb):4 from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/bin/bioruby:41:in `block in <top (required)>' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/bin/bioruby:40:in `catch' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/gems/bio-1.4.3/bin/bioruby:40:in `<top (required)>' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/bioruby:19:in `load' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/bioruby:19:in `<main>' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/ruby_noexec_wrapper:14:in `eval' from /home/josh/.rvm/gems/ruby-1.9.2-p290 at proj2/bin/ruby_noexec_wrapper:14:in `<main>
opening these files with Bio::FlatFile.auto('Atopobium_vaginae.gbk') seems to work inconsistently, but for this file it opens ok. Also, Bio::GenBank.new('Atopobium_vaginae.gbk') will open this file and seems to work the most consistently.
Loading into this object truncates the Locus id from:
ctg7180000000048 toctg7180000
i.e.bioruby> gb.first.locus.entry_id ==> "ctg7180000"
And if I attempt to say something like:bioruby> gb.first.organism ==> ""
It is just an empty string. Does this variable not get set for each genbank entry? The organism is listed under the "source" attribute in the file.
Not all of these are really errors per se, but odd behavior.
~josh
> Hi Josh,
>
> I've used the Bio gem to parse several Genbank files from NCBI. The snippet you provided looks like it should be handled correctly; except it is missing newlines.
>
> Could you provide more specific details about the errors you are receiving?
>
> -Nick
>
> --
> Nick Thrower
> Information Technologist
> Michigan State University
> Great Lakes Bioenergy Research Center
> East Lansing MI 48824
>
> On Sep 14, 2012, at 12:00 PM, bioruby-request at lists.open-bio.org wrote:
>
> > Send BioRuby mailing list submissions to
> > bioruby at lists.open-bio.org
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> > http://lists.open-bio.org/mailman/listinfo/bioruby
> > or, via email, send a message with subject or body 'help' to
> > bioruby-request at lists.open-bio.org
> >
> > You can reach the person managing the list at
> > bioruby-owner at lists.open-bio.org
> >
> > When replying, please edit your Subject line so it is more specific
> > than "Re: Contents of BioRuby digest..."
> > Today's Topics:
> >
> > 1. Genbank file parsing question (Josh Earl)
> >
> > From: Josh Earl <joshearl1 at hotmail.com>
> > Date: September 13, 2012 1:50:34 PM EDT
> > To: <bioruby at lists.open-bio.org>
> > Subject: [BioRuby] Genbank file parsing question
> >
> >
> >
> > Hello all,
> > I'm trying to use bioruby to write a small program that will allow me to rearrange the contigs in a genbank file (based on either a list of contig names, mauve output, or whatever). The idea is that the annotation service that we use (RAST -
> > http://rast.nmpdr.org ) produces Genbank files, but not exactly the same format that Bioruby is expecting. They seem to outright break the Bio::FlatFile.new(Bio::GenBank, 'file') format, and confuse the Bio::GenBank.open('file') function. My question is, what should I do? Write my own parser, or try and fiddle with the Bioruby implementation or something else entirely? I'm fairly new to ruby, but I've been programming for a long time.
> > ~josh
> >
> > P.S. Here is a short section of what the RAST GenBank file looks like (just a single short contig):
> > LOCUS ctg7180000000028 4191 bp DNA linear UNK DEFINITION Contig ctg7180000000028 from Atopobium vaginae B758ACCESSION unknownFEATURES Location/Qualifiers source 1..4191 /mol_type="genomic DNA" /db_xref="taxon: 82135" /genome_md5="" /project="earl_82135" /genome_id="82135.3" /organism="Atopobium vaginae B758" CDS complement(10..1740) /translation="MKLAQLKMVCRGENAGFACIQLCEKPALLKVHAHTKDTNMPCPA RLVCLDELYGVRSSIDDAATGSWWVVIIPLLSVDCVVELSITRASQGLSSDTWSFVFG PHTSRYMSRLLTLRHPQAAALLRRIVHSAAYVHHQLNLIGIWNAAAQTSDSPDVPMRI WRFEARFTCDNSPAFEYFPISCCVLSSTGEPIRARVITLEEQTAAVPDDSAECVRRAV FSIALPHACTHAVVCARLNFKHVASLSCDGGDRARALQKASASTYEAFYTIFPAAAAA RIAEAERFSRDCAHDPHYERWFDEHAATSEQCAMQTRRYEEACACMEHRE!
> ES!
> > TTHADQ PAQPVQLAQPAQPAQTSFDDALAHMGISVVLPVFSASTTLLARSINAMIHQSFPAWQL IVLDCTHMRTPNQQTDIARWLHSYTKTDARIMYVRMNVEQSQKNQAVAGESLSDSGAH AAGVAQPTDDIIDASRDHYAHHSSYLAYACSFIQNPYVYIMSEGAAPTPDALWHIAQT VAQHIAKGTPCDVVHVDEDELTPQGCTKPHVSYAASMIGLEGTNYLGHSLVLRTALLD ELRAPCDVAT" /product="hypothetical protein" CDS complement(1759..1875) /translation="MPRMYNAHADKVLQKRSKRRCTRLAPAEPHVVRVLLNL" /product="hypothetical protein" CDS complement(1844..2461) /db_xref="GO:0008830" /translation="MTQRTETSQAIQSGNFIFTPTSIRDVIIVDTKQYGDARGYFMET YKASDFAAGGISTTFVQDNQSSSTKGVLRGLHFQIEHPQAKLVRVVRGCVFDVAVDLR AGSETFGAWEGVELSAENHRQFYIPRGFAHGFFVLSDEAEFCYKCDDVYHPGDEGGLM WNDPDLAISWPAPCGCDSFSPSQVILSDKDTHHESFAAYVQRTRG" /p!
> r!
> > oduct="dTDP-4-dehydrorhamnose 3,5-epimerase (EC 5.
> > 1.3.13)" /EC_number="5.1.3.13" CDS complement(2586..2741) /translation="MRTLSTYTSIFEQICISLGRALVLSCSMAARKIALADAPAGARC VALCAFP" /product="hypothetical protein" CDS complement(2798..3193) /translation="MSAILAIVPAYNEQQCIQQTIDELRRVCPGVDYLIVNDGSRDET AAICRRRHFNYINVPINCGLASGVQAGMKYAERNGYSAVVQFDADGQHKPEYIVPMYE HMQKTGADVVIGSRFVDDALLLVVCRHNC" /product="Glycosyltransferase involved in cell wall biogenesis (EC 2.4.-.-)" CDS 3238..3393 /translation="MRTLSTYTSIFEQICISLGRALVLSCSMAARKIALADAPAGARC VALCAFP" /product="hypothetical protein" CDS 3518..4135 /db_xref="GO:0008830" /translation="MTQRTETSQAIQSGNFIFTPTSIRDVIIVDTKQYGDARGYFMET Y!
> KA!
> > SDFAAGGISTTFVQDNQSSSTKGVLRGLHFQIEHPQAKLVRVVRGCVFDVAVDLR AGSETFGAWEGVELSAENHRQFYIPRGFAHGFFVLSDEAEFCYKCDDVYHPGDEGGLM WNDPDLAISWPAPCGCDSFSPSQVILSDKDTHHESFAAYVQRTRG" /product="dTDP-4-dehydrorhamnose 3,5-epimerase (EC 5.1.3.13)" /EC_number="5.1.3.13"BASE COUNT 1077 a 1055 c 1036 g 1023 tORIGIN 1 aatcgcgctt catgttgcaa catcgcatgg cgcgcgaagt tcatccaaaa gcgcagttct 61 caacacgagt gaatgtccaa gatagttagt gccttcaagc ccaatcatgc ttgctgcata 121 actcacgtga ggctttgtgc agccttgggg cgtgagctca tcttcatcaa catgtacaac 181 atcgcagggt gtaccttttg ctatgtgctg tgctaccgtt tgtgcaatat gccacagggc 241 atcgggcgtg ggagctgccc cctcactcat aatgtaaacg tacgggtttt gtataaacga 301 acatgcatac gcaagatacg agctatggtg cgcataatgg tcgcgagatg catcgataat 361 atcatctgta ggctgtgcta cgccagcagc atgcgcgcct gaatctgaca atgattcccc 421 agctacagct tggtttttct gtgattgttc cacgttcata cgc!
> a!
> > cataca taatgcgcgc 481 atcggtctta gtatagctat gaagccagcg tgcaatatct
> > gtttgttggt tgggagtgcg 541 catgtgtgta caatcgagca cgataagctg ccatgccgga aaactctgat gtatcatcgc 601 gttaatactg cgcgcaagca gtgtagtcga tgctgaaaaa acgggcagta ccaccgaaat 661 acccatgtgt gcaagcgcat catcaaacga tgtctgtgca ggttgtgcag gttgcgcaag 721 ctggacaggt tgtgcaggct ggtcagcatg cgtggtgctc tcttcgcggt gttccataca 781 cgcgcacgcc tcttcgtacc tgcgtgtttg catagcgcac tgctcagacg tagctgcatg 841 ctcatcaaac cagcgctcat agtgaggatc gtgagcgcaa tcgcgactaa aacgctcggc 901 ctcagcaatg cgcgcggccg cagcagcagg aaaaatggta taaaaggctt cgtaggtaga 961 tgcagacgct ttttgcaacg cgcgggctcg atctccaccg tcgcagctca acgatgccac 1021 atgcttaaag ttgaggcgcg cgcacacaac agcatgcgtg cacgcgtgcg gcaatgcaat 1081 cgaaaacacc gcacgacgaa cgcattcagc gctgtcatcc ggaacagctg ccgtttgctc 1141 ttctagcgta attacacgtg cacgtatggg ctcacctgta ctgctcagca cgcagcagct 1201 tataggaaaa tattcaaacg caggcgaatt gtcgcaggta aaccgcgctt caaatcgcca 1261 tatacgcatc ggcacatcgg gagagtcgct tgtttgcgca gcggcattcc a!
> a!
> > ataccaat 1321 aagattaagc tgatggtgca catatgcagc actgtgaacg atgcggcgta gcaacgcggc 1381 cgcttgagga tggcggagcg taagtaaacg cgacatgtag cgcgacgtat gaggaccaaa 1441 aacaaacgac cacgtatcag aactcagacc ctgcgacgca cgtgttatgc tgagctcaac 1501 cacacaatca acgctcaaca gcggaataat aacaacccac cacgaccccg ttgcagcatc 1561 atcaatgctc gaacgaacgc catagagctc gtctaaacac acaagacgcg caggacacgg 1621 catattcgta tctttggtat gtgcatgcac tttaagcaat gcgggctttt cgcacagctg 1681 aatgcaggca aaccctgcgt tttcgccgcg gcaaaccatt tttaattgcg cgagtttcat 1741 gcagtcccct tactgttgtt aaagattaag cagcacgcgc acaacatgcg gctctgcggg 1801 cgcaagtcgt gtgcagcgcc tttttgagcg cttttggagc accttatccg cgtgtgcgtt 1861 gtacatacgc ggcaaacgat tcatgatggg tatctttatc agacaaaata acctgcgagg 1921 gcgaaaagct atcgcagcca caaggcgcag gccagctaat agcaagatcg ggatcgttcc 1981 acataaggcc accttcatcg cctggatgat acacgtcgtc gcacttatag caaaattctg 2041 cctcatctga gagtacaaaa aatccgtgag caaagccgcg cggtatatag aattgtcgat !
> !
> > 2101 gattttcggc cgataattca acgccttccc atgcaccaaa ggtctctgaa cctgcgcgca
> > 2161 agtctaccgc aacatcaaac acacagccac gcacaacacg aacgagtttt gcttgagggt 2221 gttcaatctg aaaatgcagg ccacgaagca cgccttttgt ggagctcgat tggttatcct 2281 gtacaaacgt agtagaaata ccacccgcag caaaatcgga tgctttgtac gtttccataa 2341 agtacccgcg cgcgtcacca tactgtttgg tatcaacaat aataacgtcg cgaatagatg 2401 taggtgtaaa aataaaattg cccgattgaa tagcctgaga tgtttctgta cgctgtgtca 2461 taccttaact cctttagcgc gcgctccttt agcgcacaaa tatgcgctaa cgaattgtgc 2521 aatacctgca acggtttatt tcattgtagc gcacgaatat acgcctacga tatgttcata 2581 caatgctacg gaaatgcaca cagcgcgaca caccgtgcgc ccgcaggcgc atctgccagt 2641 gcaatcttcc ttgccgccat gctacacgaa agaacaagtg cgcgccccaa tgatatgcag 2701 atttgctcaa aaatagaggt atacgtgctc aaagtacgca atggtgcggt atacttgcac 2761 agatatacca actttatgga gaactatgtc tgcaatatta gcaattgtgc ctgcatacaa 2821 cgagcagcag tgcatcgtca acaaaacgcg aaccaataac aacgtcggca cctgtttttt 2881 gcatgtgctc gtacatgggc acgatgtact cgggtttgtg ctgaccatcg gcatcaaatt 2941 gca!
> c!
> > aactgc agaatagcca ttgcgctctg catatttcat gccagcttga acgcccgaag 3001 caaggccgca gttaatgggc acatttatgt agttaaaatg gcgcctacgg catatagctg 3061 cggtttcgtc gcgcgagccg tcgtttacaa taaggtagtc tacgccgggg cacacgcggc 3121 gcagctcatc gattgtttgt tgtatgcact gctgctcgtt gtatgcaggc acaattgcta 3181 atattgcaga catagttctc cataaagttg gtatatctgt gcaagtatac cgcaccattg 3241 cgtactttga gcacgtatac ctctattttt gagcaaatct gcatatcatt ggggcgcgca 3301 cttgttcttt cgtgtagcat ggcggcaagg aagattgcac tggcagatgc gcctgcgggc 3361 gcacggtgtg tcgcgctgtg tgcatttccg tagcattgta tgaacatatc gtaggcgtat 3421 attcgtgcgc tacaatgaaa taaaccgttg caggtattgc acaattcgtt agcgcatatt 3481 tgtgcgctaa aggagcgcgc gctaaaggag ttaaggtatg acacagcgta cagaaacatc 3541 tcaggctatt caatcgggca attttatttt tacacctaca tctattcgcg acgttattat 3601 tgttgatacc aaacagtatg gtgacgcgcg cgggtacttt atggaaacgt acaaagcatc 3661 cgattttgct gcgggtggta tttctactac gtttgtacag gataaccaat cgagctccac 3721 aaaaggcgtg cttcg!
> t!
> > ggcc tgcattttca gattgaacac cctcaagcaa aactcgttcg 3781 tgttgtgcgt g
> > gctgtgtgt ttgatgttgc ggtagacttg cgcgcaggtt cagagacctt 3841 tggtgcatgg gaaggcgttg aattatcggc cgaaaatcat cgacaattct atataccgcg 3901 cggctttgct cacggatttt ttgtactctc agatgaggca gaattttgct ataagtgcga 3961 cgacgtgtat catccaggcg atgaaggtgg ccttatgtgg aacgatcccg atcttgctat 4021 tagctggcct gcgccttgtg gctgcgatag cttttcgccc tcgcaggtta ttttgtctga 4081 taaagatacc catcatgaat cgtttgccgc gtatgtacaa cgcacacgcg gataaggtgc 4141 tccaaaagcg ctcaaaaagg cgctgcacac gacttgcgcc cggcagagcc g//
> > Center for Genomic Sciences
> > (412)-359-8341
> >
> >
> >
> > _______________________________________________
> > BioRuby mailing list
> > BioRuby at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/bioruby
>
>
>
>
>
>
>
>
> ------------------------------
>
> _______________________________________________
> BioRuby mailing list
> BioRuby at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioruby
>
>
> End of BioRuby Digest, Vol 84, Issue 6
> **************************************
More information about the BioRuby
mailing list