[Biopython] Cookbook suggestion
Justin Gibbons
jgibbons1 at mail.usf.edu
Sun Apr 14 17:53:26 UTC 2013
My only goal was to demonstrate how to use SeqIO.write without holding all
of the sequence records in memory by using a generator expression:
SeqIO.write( (indexed_fasta[seq_id] for seq_id in seq_ids),
new_file_path,'fasta')
Everything else was just to provide context for the SeqIO.write() function,
but it just ended up just being confusing.
I am assuming that you want to check the individual fasta records for
specific criteria and then write those that match the criteria to a new
file. Which is why I wrote this:
for seq_record in SeqIO.parse(file_path, 'fasta'):
#Filter according to some critria:
seq_ids.add(seq_record.id)
For example you can create individual sets holding the sequence IDs of
sequences that are within a given size range, and aren't repetitive. So
that seq_ids=correct_length_set.intersection(non_repetitive_set)
You need the indexed fasta so that you can get a copy of the sequence
records that match your criteria:
ndexed_fasta=SeqIO.index(
file_path, 'fasta') #Can be searched by sequence
ID but is not held in memory
On Sat, Apr 13, 2013 at 4:27 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
> Hi Justin,
>
> On Sat, Apr 13, 2013 at 9:13 PM, Justin Gibbons <jgibbons1 at mail.usf.edu>
> wrote:
> > I want to add the following to the cookbook but I am unable to create an
> > account.
>
> Hmm - we should fix that. Is there a specific error message
> from the wiki?
>
> > #using SeqIO.write() without holding records in memory.
> >
> > from Bio import SeqIO
> >
> >
> > seq_ids=set() #create an empty set to hold the sequence IDs.
> > indexed_fasta=SeqIO.index(file_path, 'fasta') #Can be searched by
> sequence
> > ID but is not held in memory
> >
> > for seq_record in SeqIO.parse(file_path, 'fasta'):
> > #Filter according to some critria:
> > seq_ids.add(seq_record.id)
>
> Why do call SeqIO.index, but not use it and instead get
> the ID list by doing a full parse of the file? Note that calling
> SeqIO.index is likely faster than SeqIO.parse because the
> index code doesn't actually load the sequence information
> etc - just the record identifier. This speed difference is more
> obvious on heavier file formats like GenBank. e.g. These
> single lines both get all the identifiers as a list:
>
> seq_ids = SeqIO.parse(file_path, 'fasta').keys()
>
> vs:
>
> seq_ids = [rec.id for rec in SeqIO.parse(file_path, 'fasta')]
>
> Also note that using a set rather than a list for the ids
> means the order is lost - which may be important.
>
> > #write the fasta records to a new file using SeqIO.write()
> >
> > SeqIO.write([indexed_fasta[seq_id] for seq_id in seq_ids], new_file_path,
> > 'fasta')
> >
>
> That last line uses a list comprehension,
> [indexed_fasta[seq_id] for seq_id in seq_ids]
>
> That will therefore load all the records into memory as a list of
> SeqRecord objects, which can be avoided with a list comprehension:
>
> (indexed_fasta[seq_id] for seq_id in seq_ids)
>
> i.e. round brackets not square.
>
> > So if someone who can edit the cookbook wants to add it feel free to.
> >
> > Justin Gibbons
>
> Feedback on the documentation and efforts to improve it
> are always welcome. However, I'm not sure what your example
> is trying to do yet - it seems to rewrite a FASTA file with the
> records in a new order (with the order given by however
> Python sorts the set of IDs).
>
> Thanks,
>
> Peter
>
More information about the Biopython
mailing list