[BioPython] relative entropy

Iddo Friedberg idoerg at burnham.org
Fri Jan 23 12:31:35 EST 2004


Ernesto,

Your question is both technical (how to do  things with Biopython) and
scientific (how to calculate expected frequencies). Technical stuff first:


In general, you should take the following steps:

# 1) Parse the BLAST output using biopython's BLAST parser.

# http://biopython.org/docs/tutorial/Tutorial004.html

# 2) From the parsed output create a multiple alignment object.
from Bio import Align
from Bio import Alphabet
my_alignment = \
Align.Generic.Alignment(alphabet=Alphabet.ProteinAlphabet)

for seq in your_blast_sequence_list:
    my_alignment.add_sequence(sequence_id, seq)

# 3) Use the IC module as shown in section 3.5.5 of the tutorial.


Now for the science:

Ernesto wrote:
> Dear Iddo,
> thank you for your answer. I'm a biginner and I don't know all the
> Biopython. My problem is to calculate the relative information content
> (entropy) for each site of an input fasta multiple alignment. I wrote a
> script (without Biopython) to calculate the information content per site of
> a fasta alignment, but I don't know how to implement the expected
> frequencies to calculate the relative formula.

1) Are you doing DNA or Protein multiple alignments?

2) In case it is protein: are you sure you need the prior (=expected)
frequencies for the amino-acid distribution? This is a tricky issue. See
the following paper:

http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=11327757&dopt=Abstract

For DNA just use the distribution in the genome you are analysing, or
0.25/0.25/0.25/0.25 if not confined to a single genome, or analysing a
bunch of GC rich organisms. In that case, use the genomic GC
distribution, in the putative genes. (intergenic regions tend to skew
the GC ratio, I'm assuming you are analysing coding regions only).

3) In case you are analyzing proteins and you are convinced you need the
prior frequencies, I would use those from the database from which you
derive your sequences (unless it is nr, which is biased), or just use
the frequencies of amino acids in the entire SwissProt database.

Best,

Iddo

> If you want I can send you a copy of my script. If you have another idea
> with Biopython let me know.
> Thank you very much for your support
> Ernesto
> ----- Original Message -----
> From: "Iddo Friedberg" <idoerg at burnham.org>
> To: "Ernesto" <e.picardi at unical.it>
> Cc: <biopython at biopython.org>
> Sent: Friday, January 23, 2004 3:52 PM
> Subject: Re: [BioPython] relative entropy
> 
> 
> 
>>Hi Ernesto,
>>
>>Look to the Biopython Tutorial & Cookbook, section 3.5 for multiple
>>alignments. 3.5.5 deals with information content. Write back if you have
>>any questions, we can always improve the docs.
>>
>>See:
>>http://biopython.org/docs/tutorial/Tutorial004.html#toc14
>>
>>Best,
>>
>>Iddo
>>
>>--
>>Iddo Friedberg, Ph.D.
>>The Burnham Institute
>>10901 N. Torrey Pines Rd.
>>La Jolla, CA 92037, USA
>>Tel: +1 (858) 646 3100 x3516
>>Fax: +1 (858) 646 3171
>>http://ffas.ljcrf.edu/~iddo
>>
>>On Fri, 23 Jan 2004, Ernesto wrote:
>>
>>
>>>Is it possible to valuate the relative entropy per site in a multiple
> 
> alignment?
> 
>>>Could you give me instructions?
>>>
>>>Thank you
>>>
>>>Ernesto e.picardi at unical.it
>>>
>>>
> 
> 
> 

-- 
Iddo Friedberg, Ph.D.
The Burnham Institute
10901 N. Torrey Pines Rd.
La Jolla, CA 92037
USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 713 9930
http://ffas.ljcrf.edu/~iddo


-- 
Iddo Friedberg, Ph.D.
The Burnham Institute
10901 N. Torrey Pines Rd.
La Jolla, CA 92037
USA
Tel: +1 (858) 646 3100 x3516
Fax: +1 (858) 713 9930
http://ffas.ljcrf.edu/~iddo



More information about the BioPython mailing list