[Bioperl-l] exporting contigs with CDSes, stored via Bio::DB::GFF, into individual GenBank records?

Erich Schwarz schwarz at tenaya.caltech.edu
Mon Sep 29 05:57:43 UTC 2008


Hi all,

    I have newly sequenced contigs, with CDS predictions, loaded 
into a Bio::DB::GFF-readable format (i.e., loaded into a MySQL 
database via Bio::DB::GFF).  I'd like to export each contig, with 
its annotated CDSes, into a single GenBank-formatted record for each 
contig (in order to be able submit this stuff to GenBank, without 
having to waste time with Sequin).  Is there some straightforward 
way of getting Bio::DB::GFF to do that?

    Some time ago, when I last had to decipher BioPerl, I came up 
with code that would let me export protein translations of the 
contigs' CDSes in GenBank format:

-------------------------------------------------------------------

#!/usr/bin/env perl

use strict;
use warnings;
use Bio::Seq;
use Bio::SeqIO;
use Bio::DB::GFF;

my $query_database = $ARGV[0];
my $dna = q{};
my $db = Bio::DB::GFF->new( -dsn => $query_database);

my $gb_file = 'example.gb';
my $seq_out = Bio::SeqIO->new( -file => ">$gb_file", -format => 'genbank', );

my @contigs = sort 
              map { $_->display_id }
              $db->features( -types => 'contig:assembly' );

foreach my $contig (@contigs) { 
    my $segment1 = $db->segment($contig);
    my @p_txs = $segment1->features('processed_transcript');
    foreach my $p_t (sort @p_txs) {
        $dna = q{};
        my @CDSes = $p_t->CDS;
        my $cds_name = $CDSes[0]->display_id();
        foreach my $cds (@CDSes) { 
            # $cds->seq == Bio::PrimarySeq, *not* plain nt seq.!
            $dna = $dna . $cds->seq->seq;
        }
        my $full_cds = Bio::Seq->new( -display_id => $cds_name, 
                                      -seq => $dna, );
        my $prot = $full_cds->translate;
        $seq_out->write_seq($prot);
    }
}

--------------------------------------------------------------------

    Returning to this, I tried using

    $db->get_Seq_by_id($contig)

to give me a Bio::Seq object for each contig (which I could then 
output into GenBank form), but that proved futile.

    I'm willing to work this out on my own if I have to, but if 
somebody can answer this in 30 seconds, it will save me a lot of 
time -- and be very appreciated!


--Erich 




More information about the Bioperl-l mailing list