[Bioperl-l] How to remove redundancy ?

Marc Logghe Marc.Logghe@devgen.com
Fri, 15 Nov 2002 17:34:03 +0100


I think he means a non-redundant data set on the sequence level, like nr and
nt of genbank.
Actually I have been looking for that a few days ago. In my search I bumped
into 'Cleanup' but that appears only to work on nucleotide sequences; plus
it is not available anymore for download. You can only make use of the web
interface http://bighost.area.ba.cnr.it/BIG/CleanUP/. Too bad. 
What I did was obtain all gi numbers from ncbi of the subset I needed and
fed it to fastacmd like this:
fastacmd -d nr -i gi_list | infoseq -filter -only -name | ./make_unique
make_unique is a little perl script generating a unique identifier set
#!/usr/bin/perl -w
my %seen;
while (<>)
{
  chomp;
  $seen{$_}++;

}
print join "\n", keys %seen;

In that way you can get a non-redundant sequence dataset by feeding the
non-redundant identifier list to fastacmd (to get the sequences themselves)
or to ncbi blast directly (-l subset option)
Hope the 'off-topic'-level is not too high with this answer ;-)
Regards, 
Marc 


> -----Original Message-----
> From: nkuipers [mailto:nkuipers@uvic.ca]
> Sent: Friday, November 15, 2002 5:12 PM
> To: Giuseppe Torelli
> Cc: bioperl-l@bioperl.org
> Subject: RE: [Bioperl-l] How to remove redundancy ?
> 
> 
> Perhaps you could be more specific by what you mean by 
> "redundancy"?  And what 
> format your data set is in?  For example, assuming fasta 
> format and redundancy 
> meaning duplications in the data set, are you referring to 
> primary IDs, 
> accession numbers, descriptions, or the sequences themselves? 
>  If this was the 
> case you could roll a solution with BioSeqIO.  Read in the 
> file, pull out the 
> information of interest (what you are defining as redundant) 
> with one of the 
> "get property" sorts of methods (like $obj->desc) and test 
> that information 
> against a hash populated as you go.  If it already exists, 
> move to the next 
> one, otherwise write it out to a new file.
> 
> Regards,
> 
> Nathanael Kuipers
> ---
> Center for Biomedical Research,
> Dept. of Biology,
> University of Victoria
> 
> 
> >===== Original Message From Giuseppe Torelli 
> <torelli@alpha.szn.it> =====
> >Hi,
> >
> >which software do you use to remove redundancy
> >from a gene dataset ?
> >
> >Thank you,
> >--
> >Giuseppe Torelli
> >
> >Bioinformatic Programmer
> >Laboratory of Molecular Evolution
> >Stazione Zoologica A. Dohrn
> >Villa Comunale
> >80121 Naples - Italy
> >Tel.  0039 81 5833311
> >Fax: 0039 81 7641355
> >_______________________________________________
> >Bioperl-l mailing list
> >Bioperl-l@bioperl.org
> >http://bioperl.org/mailman/listinfo/bioperl-l
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>