[Biopython] Filter, sort and write FASTA file

Mic mictadlo at gmail.com
Mon Dec 14 09:50:25 UTC 2015


Hello,

   1. I run MACS2 to find out were peaks are located (see attached).
   2. I wrote a script (see attached) which:
   1. read the peaks information
      2. extract the sequence from *genome.fasta* with help of the peak
      information
      3. the extracted sequences are saved in a new FASTA file.

The *MACS_peaks.csv* file contains a column *fold_enrichmen*t and the
script safe this information in *peak.fold_enrichment.*

How is it possible to sort the output sequences with
highest peak.fold_enrichment?

*#!/usr/bin/env python*
*from Bio import SeqIO*
*from Bio.SeqRecord import SeqRecord*
*from collections import defaultdict*
*from collections import OrderedDict*
*from Bio.Alphabet import generic_dna*
*from Bio.Seq import Seq # BioPython Seq object*
*from Bio import SeqIO # BioPython SeqIO for manipulating fasta*
*import itertools*


*class PeakInfo:*

*    start = 0*
*    end = 0*
*    len = 0*
*    pileup = 0.0*
*    pvalue = 0.0*
*    fold_enrichment = 0.0*
*    qvalue = 0.0*
*    peak_name = ""*


*def extract_peaks():*

*    with open('MACS2_peaks.csv') as f:*
*        peaks = defaultdict(PeakInfo)*
*        for line in itertools.islice(f, 26, None):  # start=17, stop=None*

*            try:*
*                parts = [part.strip() for part in line.split(',')]*
*            except IndexError:*
*                continue*

*            peak = PeakInfo()*
*            peak.start = int(parts[1])*
*            peak.end = int(parts[2])*
*            peak.len = int(parts[3])*
*            peak.pvalue = float(parts[4])*
*            peak.pileup = float(parts[5])*
*            peak.fold_enrichment = float(parts[6])*
*            peak.qvalue = float(parts[7])*
*            peak.peak_name = parts[8]*
*            peaks[parts[0]] = peak*

*    return peaks*


*def extract_peak_seqs(fasta_file_name, out_file_name, peaks):*

*    out_file = open(out_file_name, "w")*
*    records = [] # list of SeqRecord to be written*

*    for record in SeqIO.parse(open(fasta_file_name, "rU"), "fasta"):*
*        # an additional check to see if there are contigs with no uniref90
reference genes to begin with*
*        if not record.id <http://record.id> in peaks:*
*            continue*

*        peak = peaks[record.id <http://record.id>]*

*        peak.fold_enrichment*

*
records.append(SeqRecord(Seq(str(record.seq[peak.start-1:peak.end]),
generic_dna),*
*                        id=record.id <http://record.id> + "_" +
peak.peak_name, description=""))*

*    SeqIO.write(records, out_file, "fasta")*

*    out_file.close()*


*if __name__ == "__main__":*

*    peaks = extract_peaks()*

*    extract_peak_seqs("genome.fasta", "peaks.fasta", peaks)*


Thank you in advance.

Mic
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20151214/922977c5/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: MACS2_peaks.csv
Type: text/csv
Size: 24021 bytes
Desc: not available
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20151214/922977c5/attachment-0001.csv>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: extract_genes.py
Type: text/x-python-script
Size: 2082 bytes
Desc: not available
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20151214/922977c5/attachment-0001.bin>


More information about the Biopython mailing list