[Biopython] Suggestions for working sequence data

Thu Apr 30 14:34:48 UTC 2015

Hi There,

I am new to Biopython, but have some experience with Python. Python is 
my favourite language, I really want to learn and apply Biopython.

I need to work on a large dataset, the whole genomic sequences of all 
bacteria (~2000) from genBank, with size of about 80GB. Each folder 
within the dataset is one organism, which includes several files for 
different types of data.

My task is easy to describe but I need some suggestions of how to work 
on the whole dataset efficiently with Biopython:

1. load FASTA file .ffn (protein genes) in each folder, which looks like 
this

> gi|158303474|gb|CP000828.1|:3233-4009 Acaryochloris marina MBIC11017, 
> complete genome
ATGCTAGGTGCAATTGC....

According to the address (3233-4009) of this gene, I then go to .fna 
file within the same folder which has the whole genome sequence, and 
read 20 base pairs (3213-3232) as the approximate promoter for the gene. 
Finally I construct a new gene sequence with address 3213-4009, which is 
just adding 20bp in front of the original gene sequence.

2. run motifs, for instance "TACGTC", through all those gene sequences 
to find out their frequency of appearance in all bacterial protein 
genes.

I hope my description is clear. Problem in short is that I need to work 
on two files in all 2000 folders, how to load and write file 
efficiently?

Would anyone give some hints?

Thanks in advance,
Linlin