[Biopython-dev] [Bug 2323] New functions: GCG Checksum and CRC64

bugzilla-daemon at portal.open-bio.org bugzilla-daemon at portal.open-bio.org
Wed Jun 27 14:13:41 UTC 2007


http://bugzilla.open-bio.org/show_bug.cgi?id=2323





------- Comment #7 from sbassi at gmail.com  2007-06-27 10:13 EST -------
Response to comment #6:

Problem 1: There is two FASTA files with several sequences each one and
I want to check if there is a match between sequences from both files. 
IDs can't be used for comparison since data comes from different 
sources and there is no correlation between them. Sequences 
themselves must be compared.

Solution: To avoid comparing whole sequences and make faster 
comparisons, I work with a small digest of each sequence.
Seguid algorithm is based in SHA-1 , which is designed to have
the following property: "it is computationally not feasible to
find two different messages which produce the same message 
digest." 
(see http://bioinformatics.anl.gov/seguid/overview.aspx for 
more information on seguid)

===========================================================
from Bio import SeqIO

seq1=set()
handle=open("14gustavoUniprot.fas","r")
for record in SeqIO.parse(handle,"fasta"):
    seq1.add(seguid(record.seq))
handle.close()

seq2=set()
handle=open("pdbaa","r")
for record in SeqIO.parse(handle,"fasta"):
    seq2.add(seguid(record.seq))
handle.close()

shared_elements=seq1.intersection(seq2)

handle=open("14gustavoUniprot.fas","r")
for record in SeqIO.parse(handle,"fasta"):
    if seguid(record.seq) in shared_elements:
        print record.id

handle.close()
===========================================================
Output:

P00700|LYSC_COLVI
P02185|MYG_PHYCA
P03521|NCAP_VSIVA
P04050|RPB1_YEAST
P05803|NRAM_IAWHM
P0A5Y6|INHA_MYCTU
P0A5Y7|INHA_MYCBO
P0AA04|PTHP_ECOLI
P0AE72|CHPR_ECOLI
P0C0S5|H2AZ_HUMAN
P14223|ALF_PLAFA
P17313|VG31_BPT4
P17670|SODF_MYCTU
P19821|DPO1_THEAQ
P25786|PSA1_HUMAN
P31939|PUR9_HUMAN
P62314|SMD1_HUMAN
P62826|RAN_HUMAN
Q08129|CATA_MYCTU
Q99497|PARK7_HUMAN
===========================================================

Problem 2: I want to include GCG Checksum information in the 
description field. As an integrity check and for comparison
against other GCG files.

Solution: Read the input file, add the GCG checksum into the
description and write the output file in FASTA format.

===========================================================
from Bio import SeqIO

seqs=[]

handle=open("14gustavoUniprot.fas","r")
for record in SeqIO.parse(handle,"fasta"):
    record.description=record.description+" "+str(gcg(record.seq))
    seqs.append(record)
handle.close()

output_handle = open("uniprotGCG.fas", "w")
SeqIO.write(seqs, output_handle, "fasta")
output_handle.close()


-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.



More information about the Biopython-dev mailing list