[Bioperl-l] match patterns to retrieve seqs

Florent Angly florent.angly at gmail.com
Tue Mar 20 01:11:04 UTC 2012


Sure, this is a valid way to approach the problem. However, the line 
that gives you grief is most likely this one:
> if ($fastaseq ->id =~ /(@queries)/)
You cannot use an array like this in a regexp and hope that it will 
match any of its elements. Most likely, it is evaluated as a scalar, 
i.e. the number of elements in the array, just like if you were doing: 
my $val = @queries;

My solution would be to build a regular expression containing all the 
queries:
> my $re;
> for my $query (@queries) {
>    $re .= "$query|"; # query 1, OR query 2, ...
> }
> $re =~ s/\|$//;    # remove trailing |
> $re = '^'.$re.'$'; # anchor the regexp for full-matches only (no 
> partial matches)
> $re = qr/$re/;     # compile regexp for added speed

Then you can use this regular expression to match the sequence IDs:
> if ($fastaseq->id =~ /$re/) {

Regards,
Florent


On 19/03/12 06:36, Philipp Schiffer wrote:
> Sorry,
>
> at least my $queries should read my queries at .
>
>
> Am 18.03.12 17:49, schrieb Philipp Schiffer:
>> Hi!
>>
>> I am trying to write a small script to retrieve such sequences
>> (including headers) from a fasta file that match the headers I listed in
>> another file.
>> The fasta file looks like:
>> >maker-scaffold15249_size4120-snap-gene-0.2-mRNA-1 protein
>> AED:0.0448532948532949 eAED:0.0448532948532949 QI:42|-1|0|1|-1|0|1|1|191
>> MDILSLPLSFVCLMTIAWVVLAAALFLLWENGWSYFNSFYFTVVSFSTVGLGDMTPDYTR............. 
>>
>> >snap_masked-scaffold15249_size4120-abinit-gene-0.1-mRNA-1 protein
>> AED:1 eAED:1 QI:0|0|0|0|1|1|4|0|108
>> MHAMWMIRKFRLQWRWCMHGRPRDEQPEHHCVMGFLAVTHANRAAYNTDTLPTMLTERRK............
>> >snap_masked-scaffold15249_size4120-abinit-gene-0.0-mRNA-1 protein
>> AED:1 eAED:1 QI:55|0|0|0|1|1|2|227|5
>> MLHHR.......
>>
>> while the list file is:
>> maker-scaffold539_size40162-snap-gene-0.4-mRNA-1
>> snap_masked-scaffold2197_size27871-abinit-gene-0.3-mRNA-1
>> maker-scaffold1000_size34843-snap-gene-0.7-mRNA-1
>> maker-scaffold10087_size13457-snap-gene-0.3-mRNA-1
>>
>> My perl so far would be:
>>
>> #! /usr/bin/perl -w
>>
>> use Bio::SeqIO;
>>
>> my $usage = "list_from_fasta.pl fastafile format headerlistfile\n";
>> my $file = shift or die $usage;
>> my $format = shift or die $usage;
>> my $headerlist = shift or die $usage;
>>
>> open LIST, "<$headerlist" or die $!;
>> my $queries = <LIST>; #might be a hash as well, don't know what's better
>>
>> my $fastafile = Bio::SeqIO->new(-file => "<$file", -format => $format);
>> while (my $fastaseq = $fastafile->next_seq){
>>
>> if ($fastaseq ->id =~ /(@queries)/) # here is the problem, after much
>> trying it seems actually to look for matches, but does not find any,
>> still there must be
>>
>> {print $fastaseq-> id,"\n" and print $fastaseq->seq,"\n";}
>> ;
>> };
>>
>> So I am wondering, is this the right way to tackle the problem. Will it
>> work? Should I use another bioPerl module? My thinking is, that the
>> pattern matching operation after the 'if' statement is wrong.
>>
>> Any help highly appreciated!
>>
>> Thanks a lot.
>>
>> Philipp
>>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l




More information about the Bioperl-l mailing list