[Bioperl-l] match patterns to retrieve seqs

Frank Schwach fs5 at sanger.ac.uk
Mon Mar 19 14:36:37 UTC 2012


as you correctly guessed, your regex
/(@queries)/
isn't doing what you were hoping it would do, i.e. match against all 
elements of the array. You could compile a regex with all the 
alternative patterns from your array but I think that's not very 
efficient, so you could try something like this instead:

my $does_match = 0;
$fastaseq ->id =~ /$_/ && $does_match = 1 for @queries;
if ($does_match){
...
}

Frank




On 18/03/12 16:49, Philipp Schiffer wrote:
> Hi!
>
> I am trying to write a small script to retrieve such sequences
> (including headers) from a fasta file that match the headers I listed in
> another file.
> The fasta file looks like:
>  >maker-scaffold15249_size4120-snap-gene-0.2-mRNA-1 protein
> AED:0.0448532948532949 eAED:0.0448532948532949 QI:42|-1|0|1|-1|0|1|1|191
> MDILSLPLSFVCLMTIAWVVLAAALFLLWENGWSYFNSFYFTVVSFSTVGLGDMTPDYTR.............
>  >snap_masked-scaffold15249_size4120-abinit-gene-0.1-mRNA-1 protein
> AED:1 eAED:1 QI:0|0|0|0|1|1|4|0|108
> MHAMWMIRKFRLQWRWCMHGRPRDEQPEHHCVMGFLAVTHANRAAYNTDTLPTMLTERRK............
>  >snap_masked-scaffold15249_size4120-abinit-gene-0.0-mRNA-1 protein
> AED:1 eAED:1 QI:55|0|0|0|1|1|2|227|5
> MLHHR.......
>
> while the list file is:
> maker-scaffold539_size40162-snap-gene-0.4-mRNA-1
> snap_masked-scaffold2197_size27871-abinit-gene-0.3-mRNA-1
> maker-scaffold1000_size34843-snap-gene-0.7-mRNA-1
> maker-scaffold10087_size13457-snap-gene-0.3-mRNA-1
>
> My perl so far would be:
>
> #! /usr/bin/perl -w
>
> use Bio::SeqIO;
>
> my $usage = "list_from_fasta.pl fastafile format headerlistfile\n";
> my $file = shift or die $usage;
> my $format = shift or die $usage;
> my $headerlist = shift or die $usage;
>
> open LIST, "<$headerlist" or die $!;
> my $queries = <LIST>; #might be a hash as well, don't know what's better
>
> my $fastafile = Bio::SeqIO->new(-file => "<$file", -format => $format);
> while (my $fastaseq = $fastafile->next_seq){
>
> if ($fastaseq ->id =~ /(@queries)/) # here is the problem, after much
> trying it seems actually to look for matches, but does not find any,
> still there must be
>
> {print $fastaseq-> id,"\n" and print $fastaseq->seq,"\n";}
> ;
> };
>
> So I am wondering, is this the right way to tackle the problem. Will it
> work? Should I use another bioPerl module? My thinking is, that the
> pattern matching operation after the 'if' statement is wrong.
>
> Any help highly appreciated!
>
> Thanks a lot.
>
> Philipp
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l


-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 



More information about the Bioperl-l mailing list