[BioPython] Sorry, one more time: extract data from a large .gbk
file
Michael Cariaso
cariaso at yahoo.com
Mon Jan 2 17:05:43 EST 2006
Until we see some code, I can't be sure. But what you are doing seems
well within BioPython's abilities. I wonder if perhaps your code looks
like this:
alist = readWholeFile(filename)
for record in alist:
process(record)
which is what is causing the performance problems. If your code does
resemble the above, try to change it so that it looks more like:
recordIterator = createIterator(filename)
for record in recordIterator:
process(record)
or
fileobj = open(filename)
done = false
while not done:
record = readNextRecord(fileobj)
if record:
process(record)
else:
done = true
The first form reads the whole genbank file into memory, and might crush
your machine. The second form reads in one record at a time, and
processes it. This requires far less memory.
Hans Meier wrote:
> Dear friends,
>
> I apologize for bothering you once more with this.
> But maybe we can make it now clear.
> All I want to do is extract data from a whole genome .gbk file on my disk.
> The file has about 5000(!) entries like the one shown below.
> All I want to do is:
>
> Give me the protein sequence (="/translation) (or whatever)
> of gene (="/gene") soandso.
>
> Speed matters.
>
> Though I believe I'm not a total dummy in programming
> and I tried several approaches taken from the web
> I could not program this so that the request is finished
> within a reasonable time or without crushing my box
> (P3,700MHz,256MB)
>
> Since this is an important question for me but
> I don't want to bother you with this any further,
> maybe someone could just post a code snippet
> how to accomplish this trivial(?) task?
>
>
> Thanks a lot for all your work and your help, Harald
>
>
> ###### a typical .gbk entry ###########
> gene 94650..96008
> /gene="murF"
> /locus_tag="b0086"
> /note="synonyms: mra, EG10622, b0086"
> /db_xref="GeneID:944813"
> CDS 94650..96008
> /gene="murF"
> /locus_tag="b0086"
> /EC_number="6.3.2.15"
> /function="enzyme; Murein sacculus, peptidoglycan"
> /note="go_component: cytoplasm [goid 0005737];
> go_process: peptidoglycan biosynthesis [goid 0009252];
> go_process: peptidoglycan metabolism [goid 0000270]"
> /codon_start=1
> /transl_table=11
> /product="D-alanine:D-alanine-adding enzyme"
> /protein_id="NP_414628.1"
> /db_xref="ASAP:313"
> /db_xref="GI:16128079"
> /db_xref="GeneID:944813"
> /translation="MISVTLSQLTDILNGELQGADITLDAVTTDTRKLTPGCLFVALK
> GERFDAHDFADQAKAGGAGALLVSRPLDIDLPQLIVKDTRLAFGELAAWVRQQVPARV
> VALTGSSGKTSVKEMTAAILSQCGNTLYTAGNLNNDIGVPMTLLRLTPEYDYAVIELG ANHQGEIAWTVSLTRPEAALVNNLAAAHLEGFGSLAGVAKAKGEIFSGLPENGIAIMN ADNNDWLNWQSVIGSRKVWRFSPNAANSDFTATNIHVTSHGTEFTLQTPTGSVDVLLP LPGRHNIANALAAAALSMSVGATLDAIKAGLANLKAVPGRLFPIQLAENQLLLDDSYN
> ANVGSMTAAVQVLAEMPGYRVLVVGDMAELGAESEACHVQVGEAAKAAGIDRVLSVGK QSHAISTASGVGEHFADKTALITRLKLLIAEQQVITILVKGSRSAAMEEVVRALQENG
> TC"
> ########end of the example#####################
>
>
>
> ---------------------------------
> Telefonieren Sie ohne weitere Kosten mit Ihren Freunden von PC zu PC!
> Jetzt Yahoo! Messenger installieren!
> _______________________________________________
> BioPython mailing list - BioPython at biopython.org
> http://biopython.org/mailman/listinfo/biopython
>
More information about the BioPython
mailing list