[Bioperl-l] Output a subset of FASTA data from a single large file

Sun Jun 11 04:32:04 UTC 2006

What happens if you just print $idmatch or $1 (i.e. check to see if  
the regex matches anything)?  If there is nothing printed then either  
the regex isn't working as expected or there is something logically  
wrong.  The problem may be that the captured string must match the id  
exactly, the id being the key to the %ID hash; any extra characters  
picked up by the regex outside of your id key and you will not get  
anything.  Looking at Malcolm's regex it should work just fine, but  
we only had one example sequence to try here.

If your while loop is set up like this won't it only print only the  
matched description lines to the outfile (no sequence) even if there  
is a match?  Or is this what you wanted?   If you want the sequence  
you should add 'print OUT <PROBES>;' after the 'print OUT;' line.

Chris

On Jun 9, 2006, at 8:39 PM, Michael Oldham wrote:

> Thanks to everyone for their helpful advice.  I think I am getting  
> closer,
> but no cigar quite yet.  The script below runs quickly with no  
> errors--but
> the output file is empty.  It seems that the problem must lie  
> somewhere in
> the 'while' loop, and I'm sure it's quite obvious to a more  
> experienced
> eye--but not to mine!  Any suggestions?  Thanks again for your help.
>
> --Mike O.
>
>
> #!/usr/bin/perl -w
>
> use strict;
>
> my $IDs = 'ID.dat.txt';
>
> unless (open(IDFILE, $IDs)) {
> 	print "Could not open file $IDs!\n";
> 	}
>
> my $probes = 'HG_U95Av2_probe_fasta.txt';
>
> unless (open(PROBES, $probes)) {
> 	print "Could not open file $probes!\n";
> 	}
>
> open (OUT,'>','probe_subset.txt') or die "Can't write output: $!";
>
> my @ID = <IDFILE>;
> chomp @ID;
> my %ID = map {($_, 1)} @ID; #Note: This creates a hash with  
> keys=PSIDs and
> all values=1.
>
> 	while (<PROBES>) {
> 		my $idmatch = exists($ID{$1}) if /^>probe:\w+:(\w+):/;
> 		if ($idmatch){
> 			print OUT;
> 		}
> 	}
> exit;
>
>
> -----Original Message-----
> From: Cook, Malcolm [mailto:MEC at stowers-institute.org]
> Sent: Friday, June 09, 2006 7:58 AM
> To: Michael Oldham; bioperl-l at lists.open-bio.org
> Subject: RE: [Bioperl-l] Output a subset of FASTA data from a  
> single large
> file
>
>
>
> I wouldn't bioperl for this, or create an index.  Perl would do  
> fine and
> probably be faster.
>
> Assuming your ids are one per line in a file named id.dat looking like
> this
>
> 1138_at
> 1134_at
> etc..
>
> this should work:
>
> perl -n -e 'BEGIN{open(idfile, shift) or die "no can open"; @ID =
> <idfile>; chomp @ID; %ID = map {($_, 1)} @ID;}  $inmatch =
> exists($ID{$1}) if /^>probe:\w+:(\w+):/; print if $inmatch' id.dat
> mybigfile.fa
>
> good luck
>
> --Malcolm Cook
>
>> -----Original Message-----
>> From: bioperl-l-bounces at lists.open-bio.org
>> [mailto:bioperl-l-bounces at lists.open-bio.org] On Behalf Of
>> Michael Oldham
>> Sent: Thursday, June 08, 2006 9:08 PM
>> To: bioperl-l at lists.open-bio.org
>> Subject: [Bioperl-l] Output a subset of FASTA data from a
>> single large file
>>
>> Dear all,
>>
>> I am a total Bioperl newbie struggling to accomplish a
>> conceptually simple
>> task.  I have a single large fasta file containing about 200,000  
>> probe
>> sequences (from an Affymetrix microarray), each of which looks
>> like this:
>>
>>> probe:HG_U95Av2:1138_at:395:301; Interrogation_Position=2631;
>> Antisense;
>> TGGCTCCTGCTGAGGTCCCCTTTCC
>>
>> What I would like to do is extract from this file a subset of  
>> ~130,800
>> probes (both the header and the sequence) and output this
>> subset into a new
>> fasta file.  These 130,800 probes correspond to 8,175 probe set IDs
>> ("1138_at" is the probe set ID in the header listed above); I
>> have these
>> 8,175 IDs listed in a separate file.  I *think* that I managed
>> to create an
>> index of all 200,000 probes in the original fasta file using
>> the following
>> script:
>>
>> #!/usr/bin/perl -w
>>
>> # script 1: create the index
>>
>> use Bio::Index::Fasta;
>> use strict;
>> my $Index_File_Name = shift;
>> my $inx = Bio::Index::Fasta->new(
>>     -filename => $Index_File_Name,
>>     -write_flag => 1);
>> $inx->make_index(@ARGV);
>>
>> I'm not sure if this is the most sensible approach, and even
>> if it is, I'm
>> not sure what to do next.  Any help would be greatly appreciated!
>>
>> Many thanks,
>> Mike O.
>>
>>
>>
>>
>> --
>> No virus found in this outgoing message.
>> Checked by AVG Free Edition.
>> Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date:  
>> 6/8/2006
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
> --
> No virus found in this incoming message.
> Checked by AVG Free Edition.
> Version: 7.1.394 / Virus Database: 268.8.3/359 - Release Date:  
> 6/8/2006
>
> --
> No virus found in this outgoing message.
> Checked by AVG Free Edition.
> Version: 7.1.394 / Virus Database: 268.8.3/360 - Release Date:  
> 6/9/2006
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign