[Bioperl-l] exporting contigs with CDSes, stored via Bio::DB::GFF, into individual GenBank records?
Jason Stajich
jason at bioperl.org
Tue Sep 30 12:48:17 EDT 2008
Eric.
CC-ing Gbrowse since this is regarding Gbrowse data-store.
I've definitely done exactly this although I remember I had to tweak
the features a bit to make sure i had some some of the necessary stuff
to sequin.
If you want to get a specific segment you just do what you already
have in your code:
my $segment = $db->segment($contig_name);
Or you can iterate through all the features - depends on how you named
your segments/contigs/chromsomes, I named mine "contig:scaffold" for
type:source
my $iterator = $dbh->get_seq_stream(-type=>'scaffold');
while (my $s = $iterator->next_seq) {
}
Now You *should* be able to pass this segment object to $seqio-
>write_seq($segment);
However Bio::DB::GFF::Feature doesn't implement the whole SeqI APi so
you probably have to create your own sequence and move the features
over:
y $iterator = $dbh->get_seq_stream(-type=>'scaffold');
while (my $s = $iterator->next_seq) {
my $seq = Bio::Seq->new();
$seq->primary_seq($s->seq);
for my $feature ( $s->features('processed_transcript') ) {
my $f = Bio::SeqFeature::Generic->new(-location => $feature->location,
-primary_tag => $feature->primary_tag,
-source_tag => $feature->primary_tag,
-score => $feature->score,
-seq_id => $feature->seq_id);
$f->add_tag_value('locus_tag',$feature->name);
# might also add all other tag/value pairs from this feature like
DBXREF, etc.
# derive a CDS feature from this feature as well.
# perhaps derive a gene feature that only has start/end for the
feature
# might add a translation tag/value pair for CDS features
$seq->add_SeqFeature($f);
}
$out->write_seq($seq);
}
I suspect you'll have to edit the feature objects some to
a) remove the ones you don't want to output
b) add additional info like translation frame -- I think there is a
slightly different way that NCBI wants things labeled wrt translation
frame
c) add in other annotation information that may or may not be encoded
as tag/values that
It is also possible I am missing some things in here, but hopefully it
gets you started. I think it would be quite useful for us to try and
write a generic script for this but maybe focus on Bio::DB::SeqFeature
dbs since it will be easier to assume that all the 3-level gene->mRNA-
>CDS/exon relationships will be explicitly specified.
-jason
On Sep 28, 2008, at 10:57 PM, Erich Schwarz wrote:
> Hi all,
>
> I have newly sequenced contigs, with CDS predictions, loaded
> into a Bio::DB::GFF-readable format (i.e., loaded into a MySQL
> database via Bio::DB::GFF). I'd like to export each contig, with
> its annotated CDSes, into a single GenBank-formatted record for each
> contig (in order to be able submit this stuff to GenBank, without
> having to waste time with Sequin). Is there some straightforward
> way of getting Bio::DB::GFF to do that?
>
> Some time ago, when I last had to decipher BioPerl, I came up
> with code that would let me export protein translations of the
> contigs' CDSes in GenBank format:
>
> -------------------------------------------------------------------
>
> #!/usr/bin/env perl
>
> use strict;
> use warnings;
> use Bio::Seq;
> use Bio::SeqIO;
> use Bio::DB::GFF;
>
> my $query_database = $ARGV[0];
> my $dna = q{};
> my $db = Bio::DB::GFF->new( -dsn => $query_database);
>
> my $gb_file = 'example.gb';
> my $seq_out = Bio::SeqIO->new( -file => ">$gb_file", -format =>
> 'genbank', );
>
> my @contigs = sort
> map { $_->display_id }
> $db->features( -types => 'contig:assembly' );
>
> foreach my $contig (@contigs) {
> my $segment1 = $db->segment($contig);
> my @p_txs = $segment1->features('processed_transcript');
> foreach my $p_t (sort @p_txs) {
> $dna = q{};
> my @CDSes = $p_t->CDS;
> my $cds_name = $CDSes[0]->display_id();
> foreach my $cds (@CDSes) {
> # $cds->seq == Bio::PrimarySeq, *not* plain nt seq.!
> $dna = $dna . $cds->seq->seq;
> }
> my $full_cds = Bio::Seq->new( -display_id => $cds_name,
> -seq => $dna, );
> my $prot = $full_cds->translate;
> $seq_out->write_seq($prot);
> }
> }
>
> --------------------------------------------------------------------
>
> Returning to this, I tried using
>
> $db->get_Seq_by_id($contig)
>
> to give me a Bio::Seq object for each contig (which I could then
> output into GenBank form), but that proved futile.
>
> I'm willing to work this out on my own if I have to, but if
> somebody can answer this in 30 seconds, it will save me a lot of
> time -- and be very appreciated!
>
>
> --Erich
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
Jason Stajich
jason at bioperl.org
More information about the Bioperl-l
mailing list