[Bioperl-l] Output a subset of FASTA data from a single large file

Fri Jun 9 14:52:59 UTC 2006

Michael Oldham wrote:
> Dear all,
> 
> I am a total Bioperl newbie struggling to accomplish a conceptually simple
> task.  I have a single large fasta file containing about 200,000 probe
> sequences (from an Affymetrix microarray), each of which looks like this:
> 
>> probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631; Antisense;
> TGGCTCCTGCTGAGGTCCCCTTTCC
> 
> What I would like to do is extract from this file a subset of ~130,800
> probes
[snip]
> #!/usr/bin/perl -w
> 
>  # script 1: create the index
> 
>  use Bio::Index::Fasta;
[snip]
> I'm not sure if this is the most sensible approach, and even if it is, I'm
> not sure what to do next.  Any help would be greatly appreciated!

I'd say you're on the right lines. Next, you should continue reading the 
  rest of the synopsis and description in the docs for Bio::Index::Fasta.

Perhaps it's not clear, but you don't need to say 
$inx->make_index(@ARGV); if you've already provided -file to new() and 
are only dealing with one file. You also can't supply -file to new() if 
you want to change the id_parser (which you do, since you need to tell 
it how to detect your probe set ID).

Having indexed your file you can then output the desired sequences, just 
like the foreach loop suggested in the synopsis. (You could have that in 
the same script.)

One thing I'm not clear on is why it needs -write_flag => 1. Why can't 
it index a read-only database? Even when you set -write_flag allowing it 
to work, it doesn't write anything...