[BioRuby] Genbank file parsing question
Josh Earl
joshearl1 at hotmail.com
Thu Sep 13 17:50:34 UTC 2012
Hello all,
I'm trying to use bioruby to write a small program that will allow me to rearrange the contigs in a genbank file (based on either a list of contig names, mauve output, or whatever). The idea is that the annotation service that we use (RAST -
http://rast.nmpdr.org ) produces Genbank files, but not exactly the same format that Bioruby is expecting. They seem to outright break the Bio::FlatFile.new(Bio::GenBank, 'file') format, and confuse the Bio::GenBank.open('file') function. My question is, what should I do? Write my own parser, or try and fiddle with the Bioruby implementation or something else entirely? I'm fairly new to ruby, but I've been programming for a long time.
~josh
P.S. Here is a short section of what the RAST GenBank file looks like (just a single short contig):
LOCUS ctg7180000000028 4191 bp DNA linear UNK DEFINITION Contig ctg7180000000028 from Atopobium vaginae B758ACCESSION unknownFEATURES Location/Qualifiers source 1..4191 /mol_type="genomic DNA" /db_xref="taxon: 82135" /genome_md5="" /project="earl_82135" /genome_id="82135.3" /organism="Atopobium vaginae B758" CDS complement(10..1740) /translation="MKLAQLKMVCRGENAGFACIQLCEKPALLKVHAHTKDTNMPCPA RLVCLDELYGVRSSIDDAATGSWWVVIIPLLSVDCVVELSITRASQGLSSDTWSFVFG PHTSRYMSRLLTLRHPQAAALLRRIVHSAAYVHHQLNLIGIWNAAAQTSDSPDVPMRI WRFEARFTCDNSPAFEYFPISCCVLSSTGEPIRARVITLEEQTAAVPDDSAECVRRAV FSIALPHACTHAVVCARLNFKHVASLSCDGGDRARALQKASASTYEAFYTIFPAAAAA RIAEAERFSRDCAHDPHYERWFDEHAATSEQCAMQTRRYEEACACMEHREESTTHADQ PAQPVQLAQPAQPAQTSFDDALAHMGISVVLPVFSASTTLLARSINAMIHQSFPAWQL IVLDCTHMRTPNQQTDIARWLHSYTKTDARIMYVRMNVEQSQKNQAVAGESLSDSGAH AAGVAQPTDDIIDASRDHYAHHSSYLAYACSFIQNPYVYIMSEGAAPTPDALWHIAQT VAQHIAKGTPCDVVHVDEDELTPQGCTKPHVSYAASMIGLEGTNYLGHSLVLRTALLD ELRAPCDVAT" /product="hypothetical protein" CDS complement(1759..1875) /translation="MPRMYNAHADKVLQKRSKRRCTRLAPAEPHVVRVLLNL" /product="hypothetical protein" CDS complement(1844..2461) /db_xref="GO:0008830" /translation="MTQRTETSQAIQSGNFIFTPTSIRDVIIVDTKQYGDARGYFMET YKASDFAAGGISTTFVQDNQSSSTKGVLRGLHFQIEHPQAKLVRVVRGCVFDVAVDLR AGSETFGAWEGVELSAENHRQFYIPRGFAHGFFVLSDEAEFCYKCDDVYHPGDEGGLM WNDPDLAISWPAPCGCDSFSPSQVILSDKDTHHESFAAYVQRTRG" /product="dTDP-4-dehydrorhamnose 3,5-epimerase (EC 5.1.3.13)" /EC_number="5.1.3.13" CDS complement(2586..2741) /translation="MRTLSTYTSIFEQICISLGRALVLSCSMAARKIALADAPAGARC VALCAFP" /product="hypothetical protein" CDS complement(2798..3193) /translation="MSAILAIVPAYNEQQCIQQTIDELRRVCPGVDYLIVNDGSRDET AAICRRRHFNYINVPINCGLASGVQAGMKYAERNGYSAVVQFDADGQHKPEYIVPMYE HMQKTGADVVIGSRFVDDALLLVVCRHNC" /product="Glycosyltransferase involved in cell wall biogenesis (EC 2.4.-.-)" CDS 3238..3393 /translation="MRTLSTYTSIFEQICISLGRALVLSCSMAARKIALADAPAGARC VALCAFP" /product="hypothetical protein" CDS 3518..4135 /db_xref="GO:0008830" /translation="MTQRTETSQAIQSGNFIFTPTSIRDVIIVDTKQYGDARGYFMET YKASDFAAGGISTTFVQDNQSSSTKGVLRGLHFQIEHPQAKLVRVVRGCVFDVAVDLR AGSETFGAWEGVELSAENHRQFYIPRGFAHGFFVLSDEAEFCYKCDDVYHPGDEGGLM WNDPDLAISWPAPCGCDSFSPSQVILSDKDTHHESFAAYVQRTRG" /product="dTDP-4-dehydrorhamnose 3,5-epimerase (EC 5.1.3.13)" /EC_number="5.1.3.13"BASE COUNT 1077 a 1055 c 1036 g 1023 tORIGIN 1 aatcgcgctt catgttgcaa catcgcatgg cgcgcgaagt tcatccaaaa gcgcagttct 61 caacacgagt gaatgtccaa gatagttagt gccttcaagc ccaatcatgc ttgctgcata 121 actcacgtga ggctttgtgc agccttgggg cgtgagctca tcttcatcaa catgtacaac 181 atcgcagggt gtaccttttg ctatgtgctg tgctaccgtt tgtgcaatat gccacagggc 241 atcgggcgtg ggagctgccc cctcactcat aatgtaaacg tacgggtttt gtataaacga 301 acatgcatac gcaagatacg agctatggtg cgcataatgg tcgcgagatg catcgataat 361 atcatctgta ggctgtgcta cgccagcagc atgcgcgcct gaatctgaca atgattcccc 421 agctacagct tggtttttct gtgattgttc cacgttcata cgcacataca taatgcgcgc 481 atcggtctta gtatagctat gaagccagcg tgcaatatct gtttgttggt tgggagtgcg 541 catgtgtgta caatcgagca cgataagctg ccatgccgga aaactctgat gtatcatcgc 601 gttaatactg cgcgcaagca gtgtagtcga tgctgaaaaa acgggcagta ccaccgaaat 661 acccatgtgt gcaagcgcat catcaaacga tgtctgtgca ggttgtgcag gttgcgcaag 721 ctggacaggt tgtgcaggct ggtcagcatg cgtggtgctc tcttcgcggt gttccataca 781 cgcgcacgcc tcttcgtacc tgcgtgtttg catagcgcac tgctcagacg tagctgcatg 841 ctcatcaaac cagcgctcat agtgaggatc gtgagcgcaa tcgcgactaa aacgctcggc 901 ctcagcaatg cgcgcggccg cagcagcagg aaaaatggta taaaaggctt cgtaggtaga 961 tgcagacgct ttttgcaacg cgcgggctcg atctccaccg tcgcagctca acgatgccac 1021 atgcttaaag ttgaggcgcg cgcacacaac agcatgcgtg cacgcgtgcg gcaatgcaat 1081 cgaaaacacc gcacgacgaa cgcattcagc gctgtcatcc ggaacagctg ccgtttgctc 1141 ttctagcgta attacacgtg cacgtatggg ctcacctgta ctgctcagca cgcagcagct 1201 tataggaaaa tattcaaacg caggcgaatt gtcgcaggta aaccgcgctt caaatcgcca 1261 tatacgcatc ggcacatcgg gagagtcgct tgtttgcgca gcggcattcc aaataccaat 1321 aagattaagc tgatggtgca catatgcagc actgtgaacg atgcggcgta gcaacgcggc 1381 cgcttgagga tggcggagcg taagtaaacg cgacatgtag cgcgacgtat gaggaccaaa 1441 aacaaacgac cacgtatcag aactcagacc ctgcgacgca cgtgttatgc tgagctcaac 1501 cacacaatca acgctcaaca gcggaataat aacaacccac cacgaccccg ttgcagcatc 1561 atcaatgctc gaacgaacgc catagagctc gtctaaacac acaagacgcg caggacacgg 1621 catattcgta tctttggtat gtgcatgcac tttaagcaat gcgggctttt cgcacagctg 1681 aatgcaggca aaccctgcgt tttcgccgcg gcaaaccatt tttaattgcg cgagtttcat 1741 gcagtcccct tactgttgtt aaagattaag cagcacgcgc acaacatgcg gctctgcggg 1801 cgcaagtcgt gtgcagcgcc tttttgagcg cttttggagc accttatccg cgtgtgcgtt 1861 gtacatacgc ggcaaacgat tcatgatggg tatctttatc agacaaaata acctgcgagg 1921 gcgaaaagct atcgcagcca caaggcgcag gccagctaat agcaagatcg ggatcgttcc 1981 acataaggcc accttcatcg cctggatgat acacgtcgtc gcacttatag caaaattctg 2041 cctcatctga gagtacaaaa aatccgtgag caaagccgcg cggtatatag aattgtcgat 2101 gattttcggc cgataattca acgccttccc atgcaccaaa ggtctctgaa cctgcgcgca 2161 agtctaccgc aacatcaaac acacagccac gcacaacacg aacgagtttt gcttgagggt 2221 gttcaatctg aaaatgcagg ccacgaagca cgccttttgt ggagctcgat tggttatcct 2281 gtacaaacgt agtagaaata ccacccgcag caaaatcgga tgctttgtac gtttccataa 2341 agtacccgcg cgcgtcacca tactgtttgg tatcaacaat aataacgtcg cgaatagatg 2401 taggtgtaaa aataaaattg cccgattgaa tagcctgaga tgtttctgta cgctgtgtca 2461 taccttaact cctttagcgc gcgctccttt agcgcacaaa tatgcgctaa cgaattgtgc 2521 aatacctgca acggtttatt tcattgtagc gcacgaatat acgcctacga tatgttcata 2581 caatgctacg gaaatgcaca cagcgcgaca caccgtgcgc ccgcaggcgc atctgccagt 2641 gcaatcttcc ttgccgccat gctacacgaa agaacaagtg cgcgccccaa tgatatgcag 2701 atttgctcaa aaatagaggt atacgtgctc aaagtacgca atggtgcggt atacttgcac 2761 agatatacca actttatgga gaactatgtc tgcaatatta gcaattgtgc ctgcatacaa 2821 cgagcagcag tgcatcgtca acaaaacgcg aaccaataac aacgtcggca cctgtttttt 2881 gcatgtgctc gtacatgggc acgatgtact cgggtttgtg ctgaccatcg gcatcaaatt 2941 gcacaactgc agaatagcca ttgcgctctg catatttcat gccagcttga acgcccgaag 3001 caaggccgca gttaatgggc acatttatgt agttaaaatg gcgcctacgg catatagctg 3061 cggtttcgtc gcgcgagccg tcgtttacaa taaggtagtc tacgccgggg cacacgcggc 3121 gcagctcatc gattgtttgt tgtatgcact gctgctcgtt gtatgcaggc acaattgcta 3181 atattgcaga catagttctc cataaagttg gtatatctgt gcaagtatac cgcaccattg 3241 cgtactttga gcacgtatac ctctattttt gagcaaatct gcatatcatt ggggcgcgca 3301 cttgttcttt cgtgtagcat ggcggcaagg aagattgcac tggcagatgc gcctgcgggc 3361 gcacggtgtg tcgcgctgtg tgcatttccg tagcattgta tgaacatatc gtaggcgtat 3421 attcgtgcgc tacaatgaaa taaaccgttg caggtattgc acaattcgtt agcgcatatt 3481 tgtgcgctaa aggagcgcgc gctaaaggag ttaaggtatg acacagcgta cagaaacatc 3541 tcaggctatt caatcgggca attttatttt tacacctaca tctattcgcg acgttattat 3601 tgttgatacc aaacagtatg gtgacgcgcg cgggtacttt atggaaacgt acaaagcatc 3661 cgattttgct gcgggtggta tttctactac gtttgtacag gataaccaat cgagctccac 3721 aaaaggcgtg cttcgtggcc tgcattttca gattgaacac cctcaagcaa aactcgttcg 3781 tgttgtgcgt ggctgtgtgt ttgatgttgc ggtagacttg cgcgcaggtt cagagacctt 3841 tggtgcatgg gaaggcgttg aattatcggc cgaaaatcat cgacaattct atataccgcg 3901 cggctttgct cacggatttt ttgtactctc agatgaggca gaattttgct ataagtgcga 3961 cgacgtgtat catccaggcg atgaaggtgg ccttatgtgg aacgatcccg atcttgctat 4021 tagctggcct gcgccttgtg gctgcgatag cttttcgccc tcgcaggtta ttttgtctga 4081 taaagatacc catcatgaat cgtttgccgc gtatgtacaa cgcacacgcg gataaggtgc 4141 tccaaaagcg ctcaaaaagg cgctgcacac gacttgcgcc cggcagagcc g//
Center for Genomic Sciences
(412)-359-8341
More information about the BioRuby
mailing list