[Bioperl-l] Fwd: gene extraction script
Adam Witney
awitney at sgul.ac.uk
Wed Jul 21 08:02:01 UTC 2010
Hi Nicholas,
If you just want to retrieve sequences from GenBank, try this:
http://github.com/bioperl/bioperl-live/blob/master/examples/db/getGenBank.pl
HTH
adam
On 20 Jul 2010, at 23:05, Florent Angly wrote:
>
>
> -------- Original Message --------
> Subject: gene extraction script
> Date: Tue, 20 Jul 2010 15:56:36 +0100
> From: Nicholas Brown <n.brown at nhm.ac.uk>
> To: <florent.angly at gmail.com>
>
>
>
> Hi,
>
> Sorry to distrub, I'm currently working at the natural history museum on some DNA sequencing techniques and I came accross a script you posted ( see below) which I think maybe helpful. The problem I have is that I have lots of sequnces of the same Mitochondrial genomes of different species aligned and I would like to extract the genes from each of the sequences easily, I wondered if there was an easy modificaiton to your script that I could perform that would allow me to do this? The sequences for the genes are already submitted to genbank, so i was just wondering if this would be a quicker simplier way than doing it manually?
> Thanks so much for your time in reading this.
>
> Kind Regards,
>
> Nicholas Brown
>
> Script
> http://github.com/bioperl/bioperl-live/blob/master/examples/tools/extract_genes.pl<https://webmail.nhm.ac.uk/exchweb/bin/redir.asp?URL=http://github.com/bioperl/bioperl-live/blob/master/examples/tools/extract_genes.pl>
> #!/usr/bin/perl -w
> # $Id$
> =pod
>
>
> =head1 NAME
>
>
> extract_genes.pl - extract genomic sequences from NCBI files
> using BioPerl
>
>
> =head1 DESCRIPTION
>
>
> This script is a simple solution to the problem of
> extracting genomic regions corresponding to genes. There are other
> solutions, this particular approach uses genomic sequence
> files from NCBI and gene coordinates from Entrez Gene.
>
>
> The first time this script is run it will be slow as it will
> extract species-specific data from the gene2accession file and create
> a storable hash (retrieving the positional data from this hash is
> significantly faster than reading gene2accession each time the script
> runs). The subsequent runs should be fast.
>
>
> =head1 INSTALLATION
>
>
> =head2
>
>
> Install BioPerl, full instructions at http://bioperl.org.
>
>
> =head2 Download gene2accession.gz
>
>
> Download this file from ftp://ftp.ncbi.nlm.nih.gov/gene/DATA into
> your working directory and gunzip it.
>
>
> =head2 Download sequence files
>
>
> Create one or more species directories in the working directory, the
> directory names do not have to match those at NCBI (e.g. "Sc", "Hs").
>
>
> Download the nucleotide fasta files for a given species from its CHR*
> directories at ftp://ftp.ncbi.nlm.nih.gov/genomes and put these files into a
> species directory. The sequence files will have the suffix ".fna" or
> "fa.gz", gunzip if necessary.
>
>
> =head2 Determine Taxon id
>
>
> Determine the taxon id for the given species. This id is the first column
> in the gene2accession file. Modify the %species hash in this script
> such that name of your species directory is a key and the taxon id is the
> value.
>
>
> =head2 Command-line options
>
>
> -i Gene id
> -s Name of species directory
> -h Help
>
>
> Example:
>
>
> extract_genes.pl -i 850302 -s Sc
>
>
> =cut
>
>
> use strict;
> use Bio::DB::Fasta;
> use Getopt::Long;
> use Storable;
>
> my %species = ( "Sc" => 4932, # Saccharomyces cerevisiae
> "Ec" => 83333, # Escherichia coli K12
> "Hs" => 9606 # H. sapiens
> );
>
> my ($help,$id,$name);
>
> GetOptions( "s=s" => \$name,
> "i=i" => \$id,
> "h" => \$help );
>
> usage() if ($help || !$id || !$name);
>
> my $storedHash = $name . ".dump";
>
> # create index for a directory of fasta files
> my $db = Bio::DB::Fasta->new($name, -makeid => \&make_my_id);
>
> # extract species-specific data from gene2accession
> unless (-e $storedHash) {
> my $ref;
> # extract species-specific information from gene2accession
> open MYIN,"gene2accession" or die "No gene2accession file\n";
> while (<MYIN>) {
> my @arr = split "\t",$_;
> if ($arr[0] == $species{$name}&& $arr[9] =~ /\d+/&& $arr[10] =~ /\d+/) {
> ($ref->{$arr[1]}->{"start"}, $ref->{$arr[1]}->{"end"},
> $ref->{$arr[1]}->{"strand"}, $ref->{$arr[1]}->{"id"}) =
> ($arr[9], $arr[10], $arr[11], $arr[7]);
> }
> }
> # save species-specific information using Storable
> store $ref, $storedHash;
> }
>
> # retrieve the species-specific data from a stored hash
> my $ref = retrieve($storedHash);
>
> # retrieve sequence and sub-sequence
> if (defined $ref->{$id}) {
> my $chr = $db->get_Seq_by_id($ref->{$id}->{"id"});
> my $seq = $chr->trunc($ref->{$id}->{"start"},$ref->{$id}->{"end"});
> $seq = $seq->revcom if ($ref->{$id}->{"strand"} eq "-");
>
> # Insert SeqIO options here...
> print $seq->seq,"\n";
> } else {
> print "Cannot find id: $id\n";
> }
>
> sub make_my_id {
> my $line = shift;
> $line =~ /ref\|([^|]+)/;
> $1;
> }
>
> sub usage {
> system "perldoc $0";
> exit;
> }
>
> __END__
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
More information about the Bioperl-l
mailing list