[Bioperl-l] how to remove indentical sequences from a dataset

Shaohua Fan lengjingmao at gmail.com
Tue Aug 5 03:36:23 EDT 2008


Hi, there ,

I have a sequence dataset which contains about 200 sequences. there are some identical sequences in this. is there any bioperl modules  which can remove those identical sequences?

thanks a lot. 
yours,
shaohua
----- Original Message ----- 
From: "Benbo" <btemperton at googlemail.com>
To: <Bioperl-l at lists.open-bio.org>
Sent: Sunday, August 03, 2008 4:05 AM
Subject: [Bioperl-l] Finding possible primers regex


> 
> Hi there, 
> I'm trying to write a perl script to scan an aligned multiple entry fasta
> file and find possible primers. So far I've produced a string which contains
> bases which match all sequences and * where they don't match e.g.
> 1) TTAGCCTAA
> 2) TTAGCAGAA
> 3) TTACCCTAA
> 
> would give TTA*C**AA.
> 
> I want to parse this string and pull out all sequences which are 18-21 bp in
> length and have no more than 4 * in them.
> 
> So far, I've got this:
> 
> while($fragment_match =~ /([GTAC*]{18,21})/g){
> print "$1\n";
> }
> 
> hoping to match all fragments 18-21 characters in length. However even that
> doesn't work as it has essentially chunked it into 21 char blocks, rather
> than what I hoped for of
> 0-18
> 0-19
> 0-20
> 0-21
> 1-19
> 1-20
> 1-21
> 1-22
> 
> etc.
> 
> Can anyone let me know if this is already possible in BioPerl, or how one
> would go about it with regex. Sadly I'm fairly new to perl and getting to
> grips with BioPerl, so please treat me gently :).
> 
> Many thanks,
> 
> Ben
> 
> 
> 
> -- 
> View this message in context: http://www.nabble.com/Finding-possible-primers-regex-tp18792782p18792782.html
> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l



More information about the Bioperl-l mailing list