[BioPython] Comment/Suggestion about Bio.PDB.Polypeptide class. How to keep gaps information ?

Tue May 24 13:26:13 EDT 2005

Hello

Let's imagine we want a fasta file or a seq object containing gaps
describing the amino acids that are present in a structure :

Ex : 1t6b chain X

Using this code :
 for pp in ppd.build_peptides(structure[0][X]):
                        print pp
We get :
	<Polypeptide start=16 end=158>
	<Polypeptide start=175 end=275>
	<Polypeptide start=288 end=303>
	<Polypeptide start=320 end=735>

If we want to bind those peptides together, let's try to define an empty
polypeptide :
	pp1=Polypeptide.Polypeptide([])
and extend it with the peptides we get :

	pp1=Polypeptide.Polypeptide([])
	for pp in ppd.build_peptides(structurecomplex[0][chaineR]):
                      pp1.extend(pp)
	print pp1
	seq=pp1.get_sequence()
	print seq.tostring()

We have :
	<Polypeptide start=16 end=735>
SQGLLGYYFSDLNFQAPMVVTSSTTGDLSIPSSELENIPSENQYFQSAIWSGFIKVKKSDEYTFATSADNHVTMWVDDQEVINKASNSNKIRLEKGRLYQIKIQYQRENPTEKGLDFKLYWTDSQNKKEVISSDNLQLPELKQVPDRDNDGIPDSLEVEGYTVDVKNKRTFLSPWISNIHEKKGLTKYKSSPEKWSTASDPYSDFEKVTGRIDKNVSPEARHPLVAAYPIVHVDMENIILSKNETISKNTSTSRTHTSEVVSAGFSNSNSSTVAIDHSLSLAGERTWAETMGLNTADTARLNANIRYVNTGTAPIYNVLPTTSLVLGKNQTLATIKAKENQLSQILAPNNYYPSKNLAPIALNAQDDFSSTPITMNYNQFLELEKTKQLRLDTDQVYGNIATYNFENGRVRVDTGSNWSEVLPQIQETTARIIFNGKDLNLVERRIAAVNPSDPLETTKPDMTLKEALKIAFGFNEPNGNLQYQGKDITEFDFNFDQQTSQNIKNQLAELNATNIYTVLDKIKLNAKMNILIRDKRFHYDRNNIAVGADESVVKEAHREVINSSTEGLLLNIDKDIRKILSGYIVEIEDTEGLKEVINDRYDMLNISSLRQDGKTFIDFKKYNDKLPLYISNPNYKVNVYAVTKENTIINPSENGDTSTNGIKKILIFSKKGYEIG

i.e.: We totally lose the information of gaps. "pp1" still contains this
information but cannot give it to "seq" even if using the gapped
alphabet.
I know it would be possible to get it from an iteration on residue from
the structure. However, I think it would be better to fill gap with an
'X' or a '-' while doing pp1.get_sequence(). I mean changing the method
get_sequence to handle this case.

Instead of :

	for res in self:
            resname=res.get_resname()
            if to_one_letter_code.has_key(resname):
                resname=to_one_letter_code[resname]
            else:
                resname='X'
            s=s+resname

I think would be nice to iterate over resseq

What do you think ? 

-- 
Julie BERNAUER
Equipe de Génomique Structurale          http://www.genomics.eu.org
IBBMC - UMR 8619 - U.P.S. Bât.430          Tel. : +33 1 69 15 31 57
91405 Orsay - FRANCE                       Fax. : +33 1 69 85 37 15