[Bioperl-l] how to remove indentical sequences from a dataset

Bernd Web bernd.web at gmail.com
Tue Aug 5 05:49:55 EDT 2008


Hi,

There is a BioPerl Utility script doing this.
See http://www.bioperl.org/wiki/Bioperl_scripts under the Utilities header.

" scripts/utilities/bp_nrdb.PLS
    Make a non-redundant database based on sequence, not id. Requires
Digest::MD5."

Alternatively, you can make a hash using the sequences as keys.


Regards,
Bernd

On Tue, Aug 5, 2008 at 9:36 AM, Shaohua Fan <lengjingmao at gmail.com> wrote:
> Hi, there ,
>
> I have a sequence dataset which contains about 200 sequences. there are some identical sequences in this. is there any bioperl modules  which can remove those identical sequences?
>
> thanks a lot.
> yours,
> shaohua
> ----- Original Message -----
> From: "Benbo" <btemperton at googlemail.com>
> To: <Bioperl-l at lists.open-bio.org>
> Sent: Sunday, August 03, 2008 4:05 AM
> Subject: [Bioperl-l] Finding possible primers regex
>
>
>>
>> Hi there,
>> I'm trying to write a perl script to scan an aligned multiple entry fasta
>> file and find possible primers. So far I've produced a string which contains
>> bases which match all sequences and * where they don't match e.g.
>> 1) TTAGCCTAA
>> 2) TTAGCAGAA
>> 3) TTACCCTAA
>>
>> would give TTA*C**AA.
>>
>> I want to parse this string and pull out all sequences which are 18-21 bp in
>> length and have no more than 4 * in them.
>>
>> So far, I've got this:
>>
>> while($fragment_match =~ /([GTAC*]{18,21})/g){
>> print "$1\n";
>> }
>>
>> hoping to match all fragments 18-21 characters in length. However even that
>> doesn't work as it has essentially chunked it into 21 char blocks, rather
>> than what I hoped for of
>> 0-18
>> 0-19
>> 0-20
>> 0-21
>> 1-19
>> 1-20
>> 1-21
>> 1-22
>>
>> etc.
>>
>> Can anyone let me know if this is already possible in BioPerl, or how one
>> would go about it with regex. Sadly I'm fairly new to perl and getting to
>> grips with BioPerl, so please treat me gently :).
>>
>> Many thanks,
>>
>> Ben
>>
>>
>>
>> --
>> View this message in context: http://www.nabble.com/Finding-possible-primers-regex-tp18792782p18792782.html
>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>


More information about the Bioperl-l mailing list