[Biopython] muscle alignment using stdin and stdout - broken pipe error

Wed Oct 28 17:52:41 UTC 2015

Jakub,

This is how I run muscle, without using files:

                seqs = [seq1, seq2]
	        muscle_cmd = ['muscle', "-quiet", "-maxiters", "1", "-diags"]

                muscle = subprocess.Popen(muscle_cmd, stdin=subprocess.PIPE,
                                          stdout=subprocess.PIPE, universal_newlines=True)
                SeqIO.write(seqs, muscle.stdin, "fasta")
                muscle.stdin.close()
                align = AlignIO.read(muscle.stdout, 'fasta')
                muscle.stdout.close()

“import subprocess”, SeqIO, AlignIO, of course.

Brian O.

> On Oct 28, 2015, at 12:23 PM, Jakub Nowak <jakub.nowak at ed.ac.uk> wrote:
> 
> Dear biopython users,
> 
> I am trying to perform consensus motif analysis for my CLIP data. I have list of motifs that I can load to the data frame using pandas module. I can successfully go through the motifs in the column and write them into external fasta file using biopython tools.
> However I wanted to bypass external file and use stdin and stdout to perform alignment with muscle command line. For some reason I am getting pipe error [Errno 32] Broken pipe.
> 
> Any ideas what is the problem and how to fix it?
> 
> Below is the code I used Ipython
> 
> # In[1]:
> import pandas as pd
> import sys
> import subprocess
> from Bio import SeqIO
> from Bio.Seq import Seq
> from Bio.SeqRecord import SeqRecord
> from Bio.Alphabet import generic_dna
> from Bio.Align.Applications import MuscleCommandline
> 
> 
> # In[2]:
> muscle_cline = MuscleCommandline(clwstrict=True)
> print muscle_cline
> 
> 
> # In[3]:
> df = pd.read_csv('Trim25_motifs_k-mers.txt', sep='\t', comment="#", header=None)
> df.columns = ['Motif','Count']
> 
> 
> # In[4]:
> df.head(5)
> Motif	Count
> 0	GAGGAG	1321
> 1	AGGAGG	1168
> 2	GGAGGA	1167
> 3	GGGAGG	1056
> 4	CTGGAG	1038
> 
> # In[5]:
> child = subprocess.Popen(str(muscle_cline),
>                        stdin = subprocess.PIPE,
>                        stdout = subprocess.PIPE,
>                        stderr = subprocess.PIPE,
>                        universal_newlines = True,
>                        shell = (sys.platform!='Darwin'))
> 
> 
> # In[6]:
> Record = []
> for x in range(0, len(df['Motif'])):
>    Record.append(SeqRecord(Seq(df['Motif'].get_value(x), generic_dna), id = "motif_" + str(x+1)))
> SeqIO.write(Record, child.stdin, "fasta")
> 
> 
> # In[7]:
> for x in range(0, len(df['Motif'])):
>    SeqIO.write(SeqRecord(Seq(df['Motif'].get_value(x), generic_dna), id = "motif_" + str(x+1)), child.stdin, 'fasta')
> 
> 
> # In[8]: #this works fine
> Record = []
> for x in range(0, len(df['Motif'])):
>    Record.append(SeqRecord(Seq(df['Motif'].get_value(x), generic_dna), id = "motif_" + str(x+1)))
> SeqIO.write(Record, 'motif.fasta', "fasta")
> 
> 
> In[6] and In[7] generate the same error message as below:
> 
>> IOError
>>                                   Traceback (most recent call last)
>> 
>> <ipython-input-6-90b52142f481> in <module>()
>>      2 for x in range(0, len(df['Motif'])):
>>      3     Record.append(SeqRecord(Seq(df['Motif'].get_value(x), generic_dna), id = "motif_" + str(x+1)))
>> ----> 4 SeqIO.write(Record, child.stdin, "fasta")
>> 
>> /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/__init__.pyc in write(sequences, handle, format)
>>    465         if format in _FormatToWriter:
>>    466             writer_class = _FormatToWriter[format]
>> --> 467             count = writer_class(fp).write_file(sequences)
>>    468         elif format in AlignIO._FormatToWriter:
>>    469             #Try and turn all the records into a single alignment,
>> 
>> /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/Interfaces.pyc in write_file(self, records)
>>    205
>>         """
>>    206         self.write_header()
>> --> 207         count = self.write_records(records)
>>    208         self.write_footer()
>>    209         return count
>> 
>> /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/Interfaces.pyc in write_records(self, records)
>>    190         count = 0
>>    191         for record in records:
>> --> 192             self.write_record(record)
>>    193             count += 1
>>    194         #Mark as true, even if there where no records
>> 
>> /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/Bio/SeqIO/FastaIO.pyc in write_record(self, record)
>>    198         assert "\n" not in title
>>    199         assert "\r" not in title
>> --> 200         self.handle.write(">%s\n" % title)
>>    201 
>>    202         data = self._get_seq_string(record)  # Catches sequence being None
>> 
>> IOError: [Errno 32] Broken pipe
> 
> In[8] Successfully writes to the fasta file all sequences. I guess I can use this file as input but as I said I wanted to bypass external file.
> 
> If you have any suggestion what could have caused that I will be grateful.
> 
> Follow up problem I might run in the future is regarding the occurrence number. If you have some idea how to include occurrence number in alignment calculation as weight. I can populate fasta file with multiple same sequences matching with the number of occurrence events. But this approach sounds quite crude and redundant.
> If you can think of any other solution I will be grateful
> 
> Thanks very much for your help,
> 
> Jakub
> -- 
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
> 
> 
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython