[BioPython] Sorry, one more time: extract data from a large .gbk file

Michael Cariaso cariaso at yahoo.com
Mon Jan 2 17:05:43 EST 2006


Until we see some code, I can't be sure. But what you are doing seems 
well within BioPython's abilities. I wonder if perhaps your code looks 
like this:

alist = readWholeFile(filename)
for record in alist:
     process(record)


which is what is causing the performance problems. If your code does 
resemble the above, try to change it so that it looks more like:

recordIterator = createIterator(filename)
for record in recordIterator:
     process(record)


or

fileobj = open(filename)
done = false
while not done:
     record = readNextRecord(fileobj)
     if record:
         process(record)
     else:
         done = true


The first form reads the whole genbank file into memory, and might crush 
your machine. The second form reads in one record at a time, and 
processes it. This requires far less memory.







Hans Meier wrote:
> Dear friends,
>  
>  I apologize for bothering you once more with this.
>  But maybe we can make it now clear.
>  All I want to do is extract data from a whole genome .gbk file on my disk. 
>  The file has about 5000(!) entries like the one shown below.
>  All I want to do is:
>  
>  Give me the  protein sequence (="/translation) (or whatever)
>  of gene (="/gene") soandso. 
>  
>  Speed matters.
>  
>  Though I believe I'm not a total dummy in programming
>  and I tried several approaches taken from the web
>  I could not program this so that the request is finished
>  within a reasonable time or without crushing my box 
>  (P3,700MHz,256MB)
>  
>  Since this is an important question for me but 
>  I don't want to bother you with this any further,
>  maybe someone could just post a code snippet
>  how to accomplish this trivial(?) task?
>  
>  
>  Thanks a lot for all your work and your help, Harald
>  
>  
>  ###### a typical .gbk entry ###########
>   gene            94650..96008
>                       /gene="murF"
>                       /locus_tag="b0086"
>                       /note="synonyms: mra, EG10622, b0086"
>                       /db_xref="GeneID:944813"
>  CDS             94650..96008
>                       /gene="murF"
>                       /locus_tag="b0086"
>                       /EC_number="6.3.2.15"
>                       /function="enzyme; Murein sacculus, peptidoglycan"
>                       /note="go_component: cytoplasm [goid 0005737];
>                       go_process: peptidoglycan biosynthesis [goid 0009252];
>                       go_process: peptidoglycan metabolism [goid 0000270]"
>                       /codon_start=1
>                       /transl_table=11
>                       /product="D-alanine:D-alanine-adding enzyme"
>                       /protein_id="NP_414628.1"
>                       /db_xref="ASAP:313"
>                       /db_xref="GI:16128079"
>                       /db_xref="GeneID:944813"
>                       /translation="MISVTLSQLTDILNGELQGADITLDAVTTDTRKLTPGCLFVALK
>  GERFDAHDFADQAKAGGAGALLVSRPLDIDLPQLIVKDTRLAFGELAAWVRQQVPARV
>  VALTGSSGKTSVKEMTAAILSQCGNTLYTAGNLNNDIGVPMTLLRLTPEYDYAVIELG             ANHQGEIAWTVSLTRPEAALVNNLAAAHLEGFGSLAGVAKAKGEIFSGLPENGIAIMN  ADNNDWLNWQSVIGSRKVWRFSPNAANSDFTATNIHVTSHGTEFTLQTPTGSVDVLLP LPGRHNIANALAAAALSMSVGATLDAIKAGLANLKAVPGRLFPIQLAENQLLLDDSYN
>  ANVGSMTAAVQVLAEMPGYRVLVVGDMAELGAESEACHVQVGEAAKAAGIDRVLSVGK      QSHAISTASGVGEHFADKTALITRLKLLIAEQQVITILVKGSRSAAMEEVVRALQENG
>  TC"
>  ########end of the example#####################
>  
> 
> 		
> ---------------------------------
> Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC!
> Jetzt Yahoo! Messenger installieren!
> _______________________________________________
> BioPython mailing list  -  BioPython at biopython.org
> http://biopython.org/mailman/listinfo/biopython
> 



More information about the BioPython mailing list