[Bioperl-l] Indexing nr database

Ross KK Leung ross at cuhk.edu.hk
Tue Sep 7 09:18:16 UTC 2010


The reason is that I have to retrieve the specific information of the
matched sequences, e.g. extract the 64th amino acid of the top matched
sequence. Is there any way to achieve that?

-----Original Message-----
From: Hans-Rudolf Hotz [mailto:hrh at fmi.ch] 
Sent: Tuesday, September 07, 2010 5:09 PM
To: bioperl-l at lists.open-bio.org; ross at cuhk.edu.hk
Subject: Re: [Bioperl-l] Indexing nr database

Hi


why don't you use the pre-indexed BLAST files from NCBI:

ftp://ftp.ncbi.nih.gov/blast/db/

you can use them to fetch individual sequences by gi number or accession 
with the tool "blastdbcmd" from blast+ binaries:

ftp://ftp.ncbi.nih.gov/blast/executables/blast+/


regards, Hans



On 09/07/2010 10:28 AM, Ross KK Leung wrote:
> By the following codes, I wanna index the 4G nr database, however, the
index
> file is>  1T and the job has been running for weeks and still hasn't
> finished. Could anybody tell me how you accomplish the goal? Thanks in
> advance.
>
>      use strict;
>
>       use Bio::DB::Flat::BinarySearch;
>
>
>
>       (my $baseDir, my $dbName, my $seqFile, my $testId, my $testGi) =
@ARGV;
>
>
>
>       # use single quotes so you don't have to write
>
>       # regular expressions like "gi\\|(\\d+)"
>
>       #my $primary_pattern = '^>(\S+)';
>
>       #if ($fullHeader == 1) {
>
>          my $primary_pattern = '^>(.+)';
>
>       #}
>
>       my $string = "gi|41353971|emb|AL123456.2| Mycobacterium tuberculosis
> H37Rv complete genome";
> #$string =~ s/$primary_pattern/RRR/g;
>
>       #print "$string\n";
>
>
>
>       # one or more patterns stored in a hash:
>
>       my $secondary_patterns = {GI =>  'gi\|(\d+)'};
>
>
>
>       my $db = Bio::DB::Flat::BinarySearch->new(
>
>                             -directory          =>  $baseDir,
>
>                             -dbname             =>  $dbName,
>
>                             -write_flag         =>  1,
>
>                             -primary_pattern    =>  $primary_pattern,
>
>                             -primary_namespace  =>  'ACC',
>
>                             -secondary_patterns =>  $secondary_patterns,
>
>                             -verbose            =>  1,
>
>                             -format             =>  'fasta'  );
>
>
>
>       $db->build_index($seqFile);
>
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l





More information about the Bioperl-l mailing list