[Biopython] Filter, sort and write FASTA file
Mic
mictadlo at gmail.com
Mon Dec 14 09:50:25 UTC 2015
Hello,
1. I run MACS2 to find out were peaks are located (see attached).
2. I wrote a script (see attached) which:
1. read the peaks information
2. extract the sequence from *genome.fasta* with help of the peak
information
3. the extracted sequences are saved in a new FASTA file.
The *MACS_peaks.csv* file contains a column *fold_enrichmen*t and the
script safe this information in *peak.fold_enrichment.*
How is it possible to sort the output sequences with
highest peak.fold_enrichment?
*#!/usr/bin/env python*
*from Bio import SeqIO*
*from Bio.SeqRecord import SeqRecord*
*from collections import defaultdict*
*from collections import OrderedDict*
*from Bio.Alphabet import generic_dna*
*from Bio.Seq import Seq # BioPython Seq object*
*from Bio import SeqIO # BioPython SeqIO for manipulating fasta*
*import itertools*
*class PeakInfo:*
* start = 0*
* end = 0*
* len = 0*
* pileup = 0.0*
* pvalue = 0.0*
* fold_enrichment = 0.0*
* qvalue = 0.0*
* peak_name = ""*
*def extract_peaks():*
* with open('MACS2_peaks.csv') as f:*
* peaks = defaultdict(PeakInfo)*
* for line in itertools.islice(f, 26, None): # start=17, stop=None*
* try:*
* parts = [part.strip() for part in line.split(',')]*
* except IndexError:*
* continue*
* peak = PeakInfo()*
* peak.start = int(parts[1])*
* peak.end = int(parts[2])*
* peak.len = int(parts[3])*
* peak.pvalue = float(parts[4])*
* peak.pileup = float(parts[5])*
* peak.fold_enrichment = float(parts[6])*
* peak.qvalue = float(parts[7])*
* peak.peak_name = parts[8]*
* peaks[parts[0]] = peak*
* return peaks*
*def extract_peak_seqs(fasta_file_name, out_file_name, peaks):*
* out_file = open(out_file_name, "w")*
* records = [] # list of SeqRecord to be written*
* for record in SeqIO.parse(open(fasta_file_name, "rU"), "fasta"):*
* # an additional check to see if there are contigs with no uniref90
reference genes to begin with*
* if not record.id <http://record.id> in peaks:*
* continue*
* peak = peaks[record.id <http://record.id>]*
* peak.fold_enrichment*
*
records.append(SeqRecord(Seq(str(record.seq[peak.start-1:peak.end]),
generic_dna),*
* id=record.id <http://record.id> + "_" +
peak.peak_name, description=""))*
* SeqIO.write(records, out_file, "fasta")*
* out_file.close()*
*if __name__ == "__main__":*
* peaks = extract_peaks()*
* extract_peak_seqs("genome.fasta", "peaks.fasta", peaks)*
Thank you in advance.
Mic
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20151214/922977c5/attachment-0001.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: MACS2_peaks.csv
Type: text/csv
Size: 24021 bytes
Desc: not available
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20151214/922977c5/attachment-0001.csv>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: extract_genes.py
Type: text/x-python-script
Size: 2082 bytes
Desc: not available
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20151214/922977c5/attachment-0001.bin>
More information about the Biopython
mailing list