[Biopython] Suggestions for working sequence data

Tommy Carstensen tommy.carstensen at gmail.com
Thu Apr 30 15:11:27 UTC 2015


Hi Linlin,

I think you need to subdivide your problem into smaller problems. Some
of the questions are more relevant in a forum such as
stackoverflow.com. I'm happy to help you out, if you can break it into
smaller questions/tasks.

Tommy

On Thu, Apr 30, 2015 at 3:34 PM, Linlin Zhao <Linlin.Zhao at hhu.de> wrote:
> Hi There,
>
> I am new to Biopython, but have some experience with Python. Python is my
> favourite language, I really want to learn and apply Biopython.
>
> I need to work on a large dataset, the whole genomic sequences of all
> bacteria (~2000) from genBank, with size of about 80GB. Each folder within
> the dataset is one organism, which includes several files for different
> types of data.
>
> My task is easy to describe but I need some suggestions of how to work on
> the whole dataset efficiently with Biopython:
>
> 1. load FASTA file .ffn (protein genes) in each folder, which looks like
> this
>
>> gi|158303474|gb|CP000828.1|:3233-4009 Acaryochloris marina MBIC11017,
>> complete genome
>
> ATGCTAGGTGCAATTGC....
>
> According to the address (3233-4009) of this gene, I then go to .fna file
> within the same folder which has the whole genome sequence, and read 20 base
> pairs (3213-3232) as the approximate promoter for the gene. Finally I
> construct a new gene sequence with address 3213-4009, which is just adding
> 20bp in front of the original gene sequence.
>
> 2. run motifs, for instance "TACGTC", through all those gene sequences to
> find out their frequency of appearance in all bacterial protein genes.
>
> I hope my description is clear. Problem in short is that I need to work on
> two files in all 2000 folders, how to load and write file efficiently?
>
> Would anyone give some hints?
>
> Thanks in advance,
> Linlin
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython


More information about the Biopython mailing list