[Bioperl-l] Output a subset of FASTA data from a single large file

Fri Jun 9 15:01:05 UTC 2006

> -----Original Message-----
> From: bioperl-l-bounces at lists.open-bio.org 
> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of 
> Michael Oldham
> Sent: 09 June 2006 03:08
> To: bioperl-l at lists.open-bio.org
> Subject: [Bioperl-l] Output a subset of FASTA data from a 
> single large file
> 
> Dear all,
> 
> I am a total Bioperl newbie struggling to accomplish a 
> conceptually simple task.  I have a single large fasta file 
> containing about 200,000 probe sequences (from an Affymetrix 
> microarray), each of which looks like this:
> 
> >probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631; 
> >Antisense;
> TGGCTCCTGCTGAGGTCCCCTTTCC

Unfortunately that's not Fasta format (which only has a single header
line starting with a '>'.  I'd imagine that most programs which deal
with fasta which read that entry would see it as two sequences, the
first of which is empty.

> What I would like to do is extract from this file a subset of 
> ~130,800 probes (both the header and the sequence) and output 
> this subset into a new fasta file.  These 130,800 probes 
> correspond to 8,175 probe set IDs ("1138_at" is the probe set 
> ID in the header listed above)

If you're only having to do this once then it should be fairly quick to
knock up a one off script to do this.  Since you've only got 8000ish
probeset ids then you can probably just read those into a hash to start
with then parse through your big sequence file with something like;

#!perl
use warnings;
use strict;

my %probe_ids;

# Add real code here to populate your hash
$probe_ids{1138_at} = 1;
##########################################

open (IN,'your_affy_file.txt') or die "Can't read affy file: $!";

open (OUT,'>','probe_list.txt') or die "Can't write output: $!";

while (<IN>) {

  if (/^>probe/) {
    # This assumes there are always 3 lines per probe entry
    if (exists $probe_ids{(split(/:/))[2]}) {
      print OUT;
      print OUT scalar <IN>;
      print OUT scalar <IN>;
    }
  }
}