[Bioperl-l] exporting contigs with CDSes, stored via Bio::DB::GFF, into individual GenBank records?

Tue Sep 30 12:48:17 EDT 2008

Eric.
CC-ing Gbrowse since this is regarding Gbrowse data-store.

I've definitely done exactly this although I remember I had to tweak  
the features a bit to make sure i had some some of the necessary stuff  
to sequin.

If you want to get a specific segment you just do what you already  
have in your code:
my $segment = $db->segment($contig_name);

Or you can iterate through all the features - depends on how you named  
your segments/contigs/chromsomes, I named mine "contig:scaffold" for  
type:source
  my $iterator = $dbh->get_seq_stream(-type=>'scaffold');
   while (my $s = $iterator->next_seq) {
   }

Now You *should* be able to pass this segment object to $seqio- 
 >write_seq($segment);
However Bio::DB::GFF::Feature doesn't implement the whole SeqI APi so  
you probably have to create your own sequence and move the features  
over:

y $iterator = $dbh->get_seq_stream(-type=>'scaffold');
   while (my $s = $iterator->next_seq) {
	my $seq = Bio::Seq->new();
	$seq->primary_seq($s->seq);
	for my $feature ( $s->features('processed_transcript') ) {
		my $f = Bio::SeqFeature::Generic->new(-location => $feature->location,
						      -primary_tag => $feature->primary_tag,
						      -source_tag  => $feature->primary_tag,
						      -score => $feature->score,
						      -seq_id => $feature->seq_id);
		$f->add_tag_value('locus_tag',$feature->name);
		# might also add all other tag/value pairs from this feature like  
DBXREF, etc.
		# derive a CDS feature from this feature as well.
		# perhaps derive a gene feature that only has start/end for the  
feature
		# might add a translation tag/value pair for CDS features
		$seq->add_SeqFeature($f);
	}
	$out->write_seq($seq);
  }

I suspect you'll have to edit the feature objects some to
a) remove the ones you don't want to output
b) add additional info like translation frame -- I think there is a  
slightly different way that NCBI wants things labeled wrt translation  
frame
c) add in other annotation information that may or may not be encoded  
as tag/values that

It is also possible I am missing some things in here, but hopefully it  
gets you started.  I think it would be quite useful for us to try and  
write a generic script for this but maybe focus on Bio::DB::SeqFeature  
dbs since it will be easier to assume that all the 3-level gene->mRNA- 
 >CDS/exon relationships will be explicitly specified.

-jason
On Sep 28, 2008, at 10:57 PM, Erich Schwarz wrote:

> Hi all,
>
>    I have newly sequenced contigs, with CDS predictions, loaded
> into a Bio::DB::GFF-readable format (i.e., loaded into a MySQL
> database via Bio::DB::GFF).  I'd like to export each contig, with
> its annotated CDSes, into a single GenBank-formatted record for each
> contig (in order to be able submit this stuff to GenBank, without
> having to waste time with Sequin).  Is there some straightforward
> way of getting Bio::DB::GFF to do that?
>
>    Some time ago, when I last had to decipher BioPerl, I came up
> with code that would let me export protein translations of the
> contigs' CDSes in GenBank format:
>
> -------------------------------------------------------------------
>
> #!/usr/bin/env perl
>
> use strict;
> use warnings;
> use Bio::Seq;
> use Bio::SeqIO;
> use Bio::DB::GFF;
>
> my $query_database = $ARGV[0];
> my $dna = q{};
> my $db = Bio::DB::GFF->new( -dsn => $query_database);
>
> my $gb_file = 'example.gb';
> my $seq_out = Bio::SeqIO->new( -file => ">$gb_file", -format =>  
> 'genbank', );
>
> my @contigs = sort
>              map { $_->display_id }
>              $db->features( -types => 'contig:assembly' );
>
> foreach my $contig (@contigs) {
>    my $segment1 = $db->segment($contig);
>    my @p_txs = $segment1->features('processed_transcript');
>    foreach my $p_t (sort @p_txs) {
>        $dna = q{};
>        my @CDSes = $p_t->CDS;
>        my $cds_name = $CDSes[0]->display_id();
>        foreach my $cds (@CDSes) {
>            # $cds->seq == Bio::PrimarySeq, *not* plain nt seq.!
>            $dna = $dna . $cds->seq->seq;
>        }
>        my $full_cds = Bio::Seq->new( -display_id => $cds_name,
>                                      -seq => $dna, );
>        my $prot = $full_cds->translate;
>        $seq_out->write_seq($prot);
>    }
> }
>
> --------------------------------------------------------------------
>
>    Returning to this, I tried using
>
>    $db->get_Seq_by_id($contig)
>
> to give me a Bio::Seq object for each contig (which I could then
> output into GenBank form), but that proved futile.
>
>    I'm willing to work this out on my own if I have to, but if
> somebody can answer this in 30 seconds, it will save me a lot of
> time -- and be very appreciated!
>
>
> --Erich
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

Jason Stajich
jason at bioperl.org