[Bioperl-l] how to remove indentical sequences from a dataset
Chris Fields
cjfields at uiuc.edu
Tue Aug 5 11:19:54 EDT 2008
Here are two links which go into detail (the last is a specific
implementation):
http://en.wikipedia.org/wiki/Sequence_clustering
http://www.bioinformatics.org/cd-hit/
chris
On Aug 5, 2008, at 5:28 AM, Diego Mauricio Riano Pachon wrote:
> Hi all,
>
> Or you might try a non-bioperl solution that works pretty well, check:
>
> http://blast.wustl.edu/pub/nrdb/executables/nrdb.linux-x86
>
> Best,
>
> Diego
>
> Bernd Web wrote:
>> Hi,
>> There is a BioPerl Utility script doing this.
>> See http://www.bioperl.org/wiki/Bioperl_scripts under the Utilities
>> header.
>> " scripts/utilities/bp_nrdb.PLS
>> Make a non-redundant database based on sequence, not id. Requires
>> Digest::MD5."
>> Alternatively, you can make a hash using the sequences as keys.
>> Regards,
>> Bernd
>> On Tue, Aug 5, 2008 at 9:36 AM, Shaohua Fan <lengjingmao at gmail.com>
>> wrote:
>>> Hi, there ,
>>>
>>> I have a sequence dataset which contains about 200 sequences.
>>> there are some identical sequences in this. is there any bioperl
>>> modules which can remove those identical sequences?
>>>
>>> thanks a lot.
>>> yours,
>>> shaohua
>>> ----- Original Message -----
>>> From: "Benbo" <btemperton at googlemail.com>
>>> To: <Bioperl-l at lists.open-bio.org>
>>> Sent: Sunday, August 03, 2008 4:05 AM
>>> Subject: [Bioperl-l] Finding possible primers regex
>>>
>>>
>>>> Hi there,
>>>> I'm trying to write a perl script to scan an aligned multiple
>>>> entry fasta
>>>> file and find possible primers. So far I've produced a string
>>>> which contains
>>>> bases which match all sequences and * where they don't match e.g.
>>>> 1) TTAGCCTAA
>>>> 2) TTAGCAGAA
>>>> 3) TTACCCTAA
>>>>
>>>> would give TTA*C**AA.
>>>>
>>>> I want to parse this string and pull out all sequences which are
>>>> 18-21 bp in
>>>> length and have no more than 4 * in them.
>>>>
>>>> So far, I've got this:
>>>>
>>>> while($fragment_match =~ /([GTAC*]{18,21})/g){
>>>> print "$1\n";
>>>> }
>>>>
>>>> hoping to match all fragments 18-21 characters in length. However
>>>> even that
>>>> doesn't work as it has essentially chunked it into 21 char
>>>> blocks, rather
>>>> than what I hoped for of
>>>> 0-18
>>>> 0-19
>>>> 0-20
>>>> 0-21
>>>> 1-19
>>>> 1-20
>>>> 1-21
>>>> 1-22
>>>>
>>>> etc.
>>>>
>>>> Can anyone let me know if this is already possible in BioPerl, or
>>>> how one
>>>> would go about it with regex. Sadly I'm fairly new to perl and
>>>> getting to
>>>> grips with BioPerl, so please treat me gently :).
>>>>
>>>> Many thanks,
>>>>
>>>> Ben
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context: http://www.nabble.com/Finding-possible-primers-regex-tp18792782p18792782.html
>>>> Sent from the Perl - Bioperl-L mailing list archive at Nabble.com.
>>>>
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
> --
> ___________________________________
> Diego Mauricio Riaño Pachón
> Biologist - PhD student
> AG Mueller-Roeber
> Institute for Biochemistry and Biology
> University of Potsdam
>
> Address: Karl-Liebknecht-Str. 24-25
> Haus 20
> 14476 Golm
> Germany
>
> Tel: +49 331 977 2809
> Fax: +49 331 977 2512
>
> web: http://www.geocities.com/dmrp.geo
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Christopher Fields
Postdoctoral Researcher
Lab of Dr. Marie-Claude Hofmann
College of Veterinary Medicine
University of Illinois Urbana-Champaign
More information about the Bioperl-l
mailing list