[Biopython-dev] Python script

Joshua Klein mobiusklein at gmail.com
Thu Sep 10 18:32:46 UTC 2015


If I understand, your files are lists of names, one name per line. Python
has a builtin set type, which lets you apply set operations (intersection,
union, difference) on collections. Here’s an example of how to use them to
solve your problem:

database_proteins_file = "sorted.a"
organism_proteins_file = "sorted.b"
# set() iterates over the input, storing unique entries
database_proteins = set(open(database_proteins_file))
organism_proteins = set(open(organism_proteins_file))
with open("common_proteins", "w") as common_file: # use with statement
to ensure the file closes after the block ends
    # the & operator of set objects will compute intersections. Alternatively
    # you could write database_proteins.intersect(organism_proteins)
    for prot in sorted(database_proteins & organism_proteins):
         common_file.write(prot)

However, this just produces a common protein list. To build that matrix,
you can adjust the above code:

database_proteins_file = "sorted.a"
organism_proteins_file = "sorted.b"
# set() iterates over the input, storing unique entries
database_proteins = set(open(database_proteins_file))
organism_proteins = set(open(organism_proteins_file))
with open("output_matrix", "w") as matrix_file:
    for prot in sorted(database_proteins):
        line = prot.replace("\n", "") # Trim newline from each entry
in the set, since we need to append to the line
        present = prot in organism_proteins # like for dict objects
the in operator checks for set membership of the first term in the
second term
        line += " 1\n" if present else " 0\n" # add the textual flag
using inline if expression. Also called the ternary operator
        matrix_file.write(line)

​

On Thu, Sep 10, 2015 at 11:48 AM, Naiane Negri <naiannegri at gmail.com> wrote:

> I'm new with python so i'm reaaally struggling in making a script.
>
> So, what I need is to make a comparison between two files. One file
> contains all proteins of some data base, the other contain only some of the
> proteins presents in the other file, because it belongs to a organism. So I
> need to know wich proteins of this data base is present in my organism. For
> that I want to build a output like a matrix, with 0 and 1 referring to
> every protein present in the data base that may or may not be in my
> organism.
>
> Does anybody have any idea of how could I do that? I'm trying to use
> something like this $ cat sorted.a A B C D $ cat sorted.b A D $ join
> sorted.a sorted.b | sed 's/^/1 /' && join -v 1 sorted.a sorted.b | sed
> 's/^/0 /' 1 A 1 D 0 B 0 C
>
> But I'm not being able to use it because sometimes a protein is present
> but its not in the same line. Here is a example:
>
> 1-cysPrx_C
> 14-3-3
> 2-Hacid_dh
> 2-Hacid_dh_C
> 2-oxoacid_dh
> 2H-phosphodiest
> 2OG-FeII_Oxy
> 2OG-FeII_Oxy_3
> 2OG-FeII_Oxy_4
> 2OG-FeII_Oxy_5
> 2OG-Fe_Oxy_2
> 2TM
> 2_5_RNA_ligase2
>
> comparing with
>
> 1-cysPrx_C
> 120_Rick_ant
> 14-03-2003
> 2-Hacid_dh
> 2-Hacid_dh_C
> 2-oxoacid_dh
> 2-ph_phosp
> 2CSK_N
> 2C_adapt
> 2Fe-2S_Ferredox
> 2H-phosphodiest
> 2HCT
> 2OG-FeII_Oxy
>
> Does anyone have an idea of how could I do that? Thanks so far.
>
> _______________________________________________
> Biopython-dev mailing list
> Biopython-dev at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython-dev
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython-dev/attachments/20150910/c608f18d/attachment-0001.html>


More information about the Biopython-dev mailing list