[Biopython] Cookbook suggestion
Justin Gibbons
jgibbons1 at mail.usf.edu
Sun Apr 14 17:58:53 UTC 2013
Sorry I accidentally sent the last email.
You need the indexed fasta to get a copy of the sequence records that match
your criteria:
indexed_fasta=SeqIO.index(file_path, 'fasta')
SeqIO.write( (indexed_fasta[seq_id] for seq_id in seq_ids),
new_file_path,'fasta')
As for editing the wiki when I click on "Login with OpenID" I get sent to a
blank page. I also tried clicking on "Login" and tired to create a new
account and was told "The action you have requested is limited to users in
the group: Administrators<http://biopython.org/w/index.php?title=Biopython:Administrators&action=edit&redlink=1>
."
On Sun, Apr 14, 2013 at 1:53 PM, Justin Gibbons <jgibbons1 at mail.usf.edu>wrote:
> My only goal was to demonstrate how to use SeqIO.write without holding all
> of the sequence records in memory by using a generator expression:
>
> SeqIO.write( (indexed_fasta[seq_id] for seq_id in seq_ids),
> new_file_path,'fasta')
>
> Everything else was just to provide context for the SeqIO.write()
> function, but it just ended up just being confusing.
>
> I am assuming that you want to check the individual fasta records for
> specific criteria and then write those that match the criteria to a new
> file. Which is why I wrote this:
>
> for seq_record in SeqIO.parse(file_path, 'fasta'):
> #Filter according to some critria:
> seq_ids.add(seq_record.id)
>
> For example you can create individual sets holding the sequence IDs of
> sequences that are within a given size range, and aren't repetitive. So
> that seq_ids=correct_length_set.intersection(non_repetitive_set)
>
> You need the indexed fasta so that you can get a copy of the sequence
> records that match your criteria:
>
> ndexed_fasta=SeqIO.index(
> file_path, 'fasta') #Can be searched by sequence
> ID but is not held in memory
>
>
>
>
>
> On Sat, Apr 13, 2013 at 4:27 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
>
>> Hi Justin,
>>
>> On Sat, Apr 13, 2013 at 9:13 PM, Justin Gibbons <jgibbons1 at mail.usf.edu>
>> wrote:
>> > I want to add the following to the cookbook but I am unable to create an
>> > account.
>>
>> Hmm - we should fix that. Is there a specific error message
>> from the wiki?
>>
>> > #using SeqIO.write() without holding records in memory.
>> >
>> > from Bio import SeqIO
>> >
>> >
>> > seq_ids=set() #create an empty set to hold the sequence IDs.
>> > indexed_fasta=SeqIO.index(file_path, 'fasta') #Can be searched by
>> sequence
>> > ID but is not held in memory
>> >
>> > for seq_record in SeqIO.parse(file_path, 'fasta'):
>> > #Filter according to some critria:
>> > seq_ids.add(seq_record.id)
>>
>> Why do call SeqIO.index, but not use it and instead get
>> the ID list by doing a full parse of the file? Note that calling
>> SeqIO.index is likely faster than SeqIO.parse because the
>> index code doesn't actually load the sequence information
>> etc - just the record identifier. This speed difference is more
>> obvious on heavier file formats like GenBank. e.g. These
>> single lines both get all the identifiers as a list:
>>
>> seq_ids = SeqIO.parse(file_path, 'fasta').keys()
>>
>> vs:
>>
>> seq_ids = [rec.id for rec in SeqIO.parse(file_path, 'fasta')]
>>
>> Also note that using a set rather than a list for the ids
>> means the order is lost - which may be important.
>>
>> > #write the fasta records to a new file using SeqIO.write()
>> >
>> > SeqIO.write([indexed_fasta[seq_id] for seq_id in seq_ids],
>> new_file_path,
>> > 'fasta')
>> >
>>
>> That last line uses a list comprehension,
>> [indexed_fasta[seq_id] for seq_id in seq_ids]
>>
>> That will therefore load all the records into memory as a list of
>> SeqRecord objects, which can be avoided with a list comprehension:
>>
>> (indexed_fasta[seq_id] for seq_id in seq_ids)
>>
>> i.e. round brackets not square.
>>
>> > So if someone who can edit the cookbook wants to add it feel free to.
>> >
>> > Justin Gibbons
>>
>> Feedback on the documentation and efforts to improve it
>> are always welcome. However, I'm not sure what your example
>> is trying to do yet - it seems to rewrite a FASTA file with the
>> records in a new order (with the order given by however
>> Python sorts the set of IDs).
>>
>> Thanks,
>>
>> Peter
>>
>
>
More information about the Biopython
mailing list