[Biopython] Filter, sort and write FASTA file

Peter Cock p.j.a.cock at googlemail.com
Mon Dec 14 15:08:59 UTC 2015


On Mon, Dec 14, 2015 at 9:50 AM, Mic <mictadlo at gmail.com> wrote:
> Hello,
>
> I run MACS2 to find out were peaks are located (see attached).
> I wrote a script (see attached) which:
>
> read the peaks information
> extract the sequence from genome.fasta with help of the peak information
> the extracted sequences are saved in a new FASTA file.
>
> The MACS_peaks.csv file contains a column fold_enrichment and the script
> safe this information in peak.fold_enrichment.
>
> How is it possible to sort the output sequences with highest
> peak.fold_enrichment?

You are building a list of SeqRecord objects in memory, so you
would want to sort that before passing it to SeqIO.write.

e.g. Right now your code does something like this:

records = [] # list of SeqRecord to be written
for record in SeqIO.parse(open(fasta_file_name, "rU"), "fasta"):
    if ...:
        records.append(SeqRecord(...))
SeqIO.write(records, out_file, "fasta")



The easiest way to sort the records would be to make a list
of tuples (score, SeqRecord) which you can then sort. Then
once sorted, pull out just the SeqRecord objects for output.


score_and_records = []  # list of (score, record)
for record in SeqIO.parse(open(fasta_file_name, "rU"), "fasta"):
    if ...:
        score_and_records.append((peak.fold_enrichment, SeqRecord(...)))
# This will sort by score since that is first in the tuples:
score_and_records.sort()
records = [record for (score, record) in score_and_records]
SeqIO.write(records, out_file, "fasta")


Does that make sense?

Peter


More information about the Biopython mailing list