[BioPython] genbank annotation

Karin Lagesen karin.lagesen at medisin.uio.no
Fri Oct 8 09:21:15 EDT 2004


Hi!

I have two genbank genome files, one of the old kind where each region is
noted twize, and one where they are unique. 

What I would like to extract from this is the feature information, in
this sort of format:

Type    start   stop    direction       name


In the first case, where almost all regions are noted twize, I'd like
to have only one of them included in the list.


You have a genbank parser thing in biopython which I'd like to use,
however, I cannot figure out how to use it to do this. 


The files:

The first:

     source          1..2944528
                     /organism="Listeria monocytogenes"
                     /mol_type="genomic DNA"
                     /strain="EGD-e"
                     /db_xref="taxon:1639"
     gene            305..1673
                     /gene="dnaA"
     RBS             305..310
     CDS             318..1673
                     /codon_start=1
                     /transl_table=11
                     /product="Chromosomal replication initiation protein DnaA"
                     /protein_id="CAC98216.1"
                     /db_xref="GI:16409360"
                     /db_xref="GOA:Q8YAW2"
                     /db_xref="UniProt/Swiss-Prot:Q8YAW2"
                     /translation="MQSIEDIWQETLQIVKKNMSKPSYDTWMKSTTAHSLEGNTFIIS
                     APNNFVRDWLEKSYTQFIANILQEITGRLFDVRFIDGEQEENFEYTVIKPNPALDEDG
                     IEIGKHMLNPRYVFDTFVIGSGNRFAHAASLAVAEAPAKAYNPLFIYGGVGLGKTHLM
                     HAVGHYVQQHKDNAKVMYLSSEKFTNEFISSIRDNKTEEFRTKYRNVDVLLIDDIQFL
                     AGKEGTQEEFFHTFNTLYDEQKQIIISSDRPPKEIPTLEDRLRSRFEWGLITDITPPD
                     LETRIAILRKKAKADGLDIPNEVMLYIANQIDSNIRELEGALIRVVAYSSLVNKDITA
                     GLAAEALKDIIPSSKSQVITISGIQEAVGEYFHVRLEDFKAKKRTKSIAFPRQIAMYL
                     SRELTDASLPKIGDEFGGRDHTTVIHAHEKISQLLKTDQVLKNDLAEIEKNLRKAQNM
                     F"
     gene            1856..3062
                     /gene="dnaN"
     RBS             1856..1860
     CDS             1867..3012
                     /codon_start=1
                     /transl_table=11
                     /product="DNA polymerase III, beta chain"
                     /protein_id="CAC98217.1"
                     /db_xref="GI:16409361"
                     /db_xref="GOA:Q8YAW1"
                     /db_xref="UniProt/TrEMBL:Q8YAW1"
                     /translation="MKFVIERDRLVQAVNEVTRAISARTTIPILTGIKIVVNDEGVTL
                     TGSDSDISIEAFIPLIENDEVIVEVESFGGIVLQSKYFGDIVRRLPEENVEIEVTSNY
                     QTNISSGQASFTLNGLDPMEYPKLPEVTDGKTIKIPINVLKNIVRQTVFAVSAIEVRP
                     VLTGVNWIIKENKLSAVATDSHRLALREIPLETDIDEEYNIVIPGKSLSELNKLLDDA
                     SESIEMTLANNQILFKLKDLLFYSRLLEGSYPDTSRLIPTDTKSELVINSKAFLQAID
                     RASLLARENRNNVIKLMTLENGQVEVSSNSPEVGNVSENVFSQSFTGEEIKISFNGKY
                     MMDALRAFEGDDIQISFSGTMRPFVLRPKDAANPNEILQLITPVRTY"



The second:

     source          1..4214630
                     /strain=168
                     /organism="Bacillus subtilis subsp. subtilis str.
168"
                     /mol_type="genomic DNA"
                     /db_xref="taxon:224308"
     CDS             410..1750
                     /function="initiation of chromosome replication (DNA
                     synthesis)"
                     /gene="dnaA"
                     /protein_id="CAB11777.1"
                     /locus_tag="BSU00010"
                     /transl_table=11
                     /translation="MENILDLWNQALAQIEKKLSKPSFETWMKSTKAHSLQGDTLTIT
                     APNEFARDWLESRYLHLIADTIYELTGEELSIKFVIPQNQDVEDFMPKPQVKKAVKED
                     TSDFPQNMLNPKYTFDTFVIGSGNRFAHAASLAVAEAPAKAYNPLFIYGGVGLGKTHL
                     MHAIGHYVIDHNPSAKVVYLSSEKFTNEFINSIRDNKAVDFRNRYRNVDVLLIDDIQF
                     LAGKEQTQEEFFHTFNTLHEESKQIVISSDRPPKEIPTLEDRLRSRFEWGLITDITPP
                     DLETRIAILRKKAKAEGLDIPNEVMLYIANQIDSNIRELEGALIRVVAYSSLINKDIN
                     ADLAAEALKDIIPSSKPKVITIKEIQRVVGQQFNIKLEDFKAKKRTKSVAFPRQIAMY
                     LSREMTDSSLPKIGEEFGGRDHTTVIHAHEKISKLLADDEQLQQHVKEIKEQLK"
                     /db_xref="GOA:P05648"
                     /db_xref="SUBTILIS:BG10065"
                     /db_xref="SWISS-PROT:P05648"
                     /note="alternate gene name: dnaH, dnaJ, dnaK"
     CDS             1939..3075
                     /locus_tag="BSU00020"
                     /transl_table=11
                     /translation="MKFTIQKDRLVESVQDVLKAVSSRTTIPILTGIKIVASDDGVSF
                     TGSDSDISIESFIPKEEGDKEIVTIEQPGSIVLQARFFSEIVKKLPMATVEIEVQNQY
                     LTIIRSGKAEFNLNGLDADEYPHLPQIEEHHAIQIPTDLLKNLIRQTVFAVSTSETRP
                     ILTGVNWKVEQSELLCTATDSHRLALRKAKLDIPEDRSYNVVIPGKSLTELSKILDDN
                     QELVDIVITETQVLFKAKNVLFFSRLLDGNYPDTTSLIPQDSKTEIIVNTKEFLQAID
                     RASLLAREGRNNVVKLSAKPAESIEISSNSPEIGKVVEAIVADQIEGEELNISFSPKY
                     MLDALKVLEGAEIRVSFTGAMRPFLIRTPNDETIVQLILPVRTY"
                     /product="DNA polymerase III (beta subunit)"
                     /function="DNA synthesis"
                     /gene="dnaN"
                     /EC_number="2.7.7.7"
                     /protein_id="CAB11778.1"
                     /db_xref="GOA:P05649"
                     /db_xref="SUBTILIS:BG10066"
                     /db_xref="SWISS-PROT:P05649"
                     /note="alternate gene name: dnaG, dnaK"



Karin
-- 
Karin Lagesen, PhD student
karin.lagesen at medisin.uio.no
http://www.cmbn.no/rognes/


More information about the BioPython mailing list