[Bioperl-l] Fwd: Fwd: gene extraction script
Adam Witney
awitney at sgul.ac.uk
Wed Jul 21 09:43:35 UTC 2010
forwarding to the original poster and the list...
Begin forwarded message:
> From: Laurent MANCHON <lmanchon at univ-montp2.fr>
> Date: 21 July 2010 10:05:16 GMT+01:00
> To: Adam Witney <awitney at sgul.ac.uk>
> Subject: Re: [Bioperl-l] Fwd: gene extraction script
> Reply-To: lmanchon at univ-montp2.fr
>
> another program you can use.
>
>
> #!/usr/bin/perl
> # retrieve sequences from genbank
> # read list of GenBank access number from a file and write to output.txt file
>
> use LWP::UserAgent;
> use Bio::SeqIO;
> use Bio::DB::GenBank;
> $cpt=0;
> $outfile = "outfile.txt";
> chomp($outfile);
> $gb = new Bio::DB::GenBank;
> while (<>) {
> ($Fld1) = split(' ', $_, 9999);
> $out = Bio::SeqIO->new('-file' => ">>$outfile",'-format' => 'fasta');
> $acc = $Fld1;
> $cpt++;
> print $acc,"\t$cpt"," retrieved\n";
> $seqio = $gb->get_Stream_by_acc([$acc]);
> while( $seq = $seqio->next_seq() ) {
> open(LOG,">>$outfile");
> printf LOG '%s ', ">" . $acc;
> $out->write_seq($seq);
> close(LOG);
> }
>
>
>
>
>
> Adam Witney a écrit :
>>
>> Hi Nicholas,
>>
>> If you just want to retrieve sequences from GenBank, try this:
>>
>> http://github.com/bioperl/bioperl-live/blob/master/examples/db/getGenBank.pl
>>
>> HTH
>>
>> adam
>>
>>
>> On 20 Jul 2010, at 23:05, Florent Angly wrote:
>>
>>
>>> -------- Original Message --------
>>> Subject: gene extraction script
>>> Date: Tue, 20 Jul 2010 15:56:36 +0100
>>> From: Nicholas Brown <n.brown at nhm.ac.uk>
>>> To: <florent.angly at gmail.com>
>>>
>>>
>>>
>>> Hi,
>>>
>>> Sorry to distrub, I'm currently working at the natural history museum on some DNA sequencing techniques and I came accross a script you posted ( see below) which I think maybe helpful. The problem I have is that I have lots of sequnces of the same Mitochondrial genomes of different species aligned and I would like to extract the genes from each of the sequences easily, I wondered if there was an easy modificaiton to your script that I could perform that would allow me to do this? The sequences for the genes are already submitted to genbank, so i was just wondering if this would be a quicker simplier way than doing it manually?
>>> Thanks so much for your time in reading this.
>>>
>>> Kind Regards,
>>>
>>> Nicholas Brown
>>>
>>> Script
>>> http://github.com/bioperl/bioperl-live/blob/master/examples/tools/extract_genes.pl<https://webmail.nhm.ac.uk/exchweb/bin/redir.asp?URL=http://github.com/bioperl/bioperl-live/blob/master/examples/tools/extract_genes.pl>
>>> #!/usr/bin/perl -w
>>> # $Id$
>>> =pod
>>>
>>>
>>> =head1 NAME
>>>
>>>
>>> extract_genes.pl - extract genomic sequences from NCBI files
>>> using BioPerl
>>>
>>>
>>> =head1 DESCRIPTION
>>>
>>>
>>> This script is a simple solution to the problem of
>>> extracting genomic regions corresponding to genes. There are other
>>> solutions, this particular approach uses genomic sequence
>>> files from NCBI and gene coordinates from Entrez Gene.
>>>
>>>
>>> The first time this script is run it will be slow as it will
>>> extract species-specific data from the gene2accession file and create
>>> a storable hash (retrieving the positional data from this hash is
>>> significantly faster than reading gene2accession each time the script
>>> runs). The subsequent runs should be fast.
>>>
>>>
>>> =head1 INSTALLATION
>>>
>>>
>>> =head2
>>>
>>>
>>> Install BioPerl, full instructions at http://bioperl.org.
>>>
>>>
>>> =head2 Download gene2accession.gz
>>>
>>>
>>> Download this file from ftp://ftp.ncbi.nlm.nih.gov/gene/DATA into
>>> your working directory and gunzip it.
>>>
>>>
>>> =head2 Download sequence files
>>>
>>>
>>> Create one or more species directories in the working directory, the
>>> directory names do not have to match those at NCBI (e.g. "Sc", "Hs").
>>>
>>>
>>> Download the nucleotide fasta files for a given species from its CHR*
>>> directories at ftp://ftp.ncbi.nlm.nih.gov/genomes and put these files into a
>>> species directory. The sequence files will have the suffix ".fna" or
>>> "fa.gz", gunzip if necessary.
>>>
>>>
>>> =head2 Determine Taxon id
>>>
>>>
>>> Determine the taxon id for the given species. This id is the first column
>>> in the gene2accession file. Modify the %species hash in this script
>>> such that name of your species directory is a key and the taxon id is the
>>> value.
>>>
>>>
>>> =head2 Command-line options
>>>
>>>
>>> -i Gene id
>>> -s Name of species directory
>>> -h Help
>>>
>>>
>>> Example:
>>>
>>>
>>> extract_genes.pl -i 850302 -s Sc
>>>
>>>
>>> =cut
>>>
>>>
>>> use strict;
>>> use Bio::DB::Fasta;
>>> use Getopt::Long;
>>> use Storable;
>>>
>>> my %species = ( "Sc" => 4932, # Saccharomyces cerevisiae
>>> "Ec" => 83333, # Escherichia coli K12
>>> "Hs" => 9606 # H. sapiens
>>> );
>>>
>>> my ($help,$id,$name);
>>>
>>> GetOptions( "s=s" => \$name,
>>> "i=i" => \$id,
>>> "h" => \$help );
>>>
>>> usage() if ($help || !$id || !$name);
>>>
>>> my $storedHash = $name . ".dump";
>>>
>>> # create index for a directory of fasta files
>>> my $db = Bio::DB::Fasta->new($name, -makeid => \&make_my_id);
>>>
>>> # extract species-specific data from gene2accession
>>> unless (-e $storedHash) {
>>> my $ref;
>>> # extract species-specific information from gene2accession
>>> open MYIN,"gene2accession" or die "No gene2accession file\n";
>>> while (<MYIN>) {
>>> my @arr = split "\t",$_;
>>> if ($arr[0] == $species{$name}&& $arr[9] =~ /\d+/&& $arr[10] =~ /\d+/) {
>>> ($ref->{$arr[1]}->{"start"}, $ref->{$arr[1]}->{"end"},
>>> $ref->{$arr[1]}->{"strand"}, $ref->{$arr[1]}->{"id"}) =
>>> ($arr[9], $arr[10], $arr[11], $arr[7]);
>>> }
>>> }
>>> # save species-specific information using Storable
>>> store $ref, $storedHash;
>>> }
>>>
>>> # retrieve the species-specific data from a stored hash
>>> my $ref = retrieve($storedHash);
>>>
>>> # retrieve sequence and sub-sequence
>>> if (defined $ref->{$id}) {
>>> my $chr = $db->get_Seq_by_id($ref->{$id}->{"id"});
>>> my $seq = $chr->trunc($ref->{$id}->{"start"},$ref->{$id}->{"end"});
>>> $seq = $seq->revcom if ($ref->{$id}->{"strand"} eq "-");
>>>
>>> # Insert SeqIO options here...
>>> print $seq->seq,"\n";
>>> } else {
>>> print "Cannot find id: $id\n";
>>> }
>>>
>>> sub make_my_id {
>>> my $line = shift;
>>> $line =~ /ref\|([^|]+)/;
>>> $1;
>>> }
>>>
>>> sub usage {
>>> system "perldoc $0";
>>> exit;
>>> }
>>>
>>> __END__
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>
>
>
More information about the Bioperl-l
mailing list