[Bioperl-l] concatenate two embl sequence files

Roy Chaudhuri roy at colibase.bham.ac.uk
Fri Jan 27 15:31:50 UTC 2006


> I've committed the code and tests. See t/SeqUtils.t. The idea is to test all 
> methods and a reasonable portion of all edge values to be sure that the 
> method works as it should.
Cool, thanks for that. My first proper contribution to BioPerl 8^).
The tests look good- I'll know better for next time.

> Note that the code does not create a new sequence object. It modifies the 
> existing one. Therefore it is best not to return that object. The users would 
> assign that to a variable that points to the same structure and get confused. 
> The method now returns true upon completeion.
> 
> Creating a new sequence object is problematic because one needs to add one 
> more dependency (e.g. Clone) and will not work anyway if the sequence 
> implementation is using a database back end. It is better the way you have 
> written it.
Yes, that makes sense. Although with that interface it might be more 
natural in Bio::Seq? If it is a method that will modify a sequence in 
place then it seems more intuitive to call $seq->cat(@seqs) [or even 
$seq->append(@seqs)] rather than Bio::SeqUtils->cat($seq, @seqs).


> I added code to move over the annotations from secondary sequences, but did 
> not do anything remove duplicates if the same sequence gets added twice. I 
> wrote a note about this so that users know to be prepared if that affects 
> them.
I'm not convinced about this- perhaps it should be optional? In practice 
many of the annotations for each subsequence are only going to be 
applicable to that sequence, not the concatenated whole. Some of them 
may also be duplicated even between non-identical sequences. I think 
it'd be better by default to keep just the annotation from the first 
sequence (which probably would still need to be changed, but could at 
least act as a placeholder).

There were a couple of problems with renamed variables/subroutines that 
hadn't all been updated, I've fixed those and pasted the new version below.

> No feature coordinates should always be within the sequence. Fuzzy is the 
> correct solution to this.
Okay, I'll have a go and let you know how I get on.

Cheers.
Roy.

--
Dr. Roy Chaudhuri
Bioinformatics Research Fellow
Division of Immunity and Infection
University of Birmingham, U.K.

http://xbase.bham.ac.uk




=head2 cat

   Title   : cat
   Usage   : my $catseq = Bio::SeqUtils->cat(@seqs)
   Function: Concatenates an array of Bio::Seq objects, using the first 
sequence
             as a target. Annotations and sequence features are copied over
             from any additional objects. Adjusts the coordinates of copied
             features.
   Returns : a boolean
   Args    : array of sequence objects

-
Note that annotations have no sequence region. If you concatenate the
same sequence more than once, you will have its annotations
duplicated.

=cut

sub cat {
     my ($self, $seq, @seqs) = @_;
     $self->throw('Object [$seq] '. 'of class ['. ref($seq).
                  '] should be a Bio::PrimarySeqI ')
         unless $seq->isa('Bio::PrimarySeqI');


     for my $catseq (@seqs) {
         $self->throw('Object [$catseq] '. 'of class ['. ref($catseq).
                      '] should be a Bio::PrimarySeqI ')
             unless $catseq->isa('Bio::PrimarySeqI');

         $self->throw('Trying to concatenate sequences with different 
alphabets: '.
                      $seq->display_id. '('. $seq->alphabet. ') and '. 
$catseq->display_id.
                      '('. $catseq->alphabet. ')')
             unless $catseq->alphabet eq $seq->alphabet;


         my $length=$seq->length;
         $seq->seq($seq->seq.$catseq->seq);

         # move annotations
         if ($seq->isa("Bio::AnnotatableI") and 
$catseq->isa("Bio::AnnotatableI")) {
             foreach my $key ( 
$catseq->annotation->get_all_annotation_keys() ) {

                 foreach my $value ( 
$catseq->annotation->get_Annotations($key) ) {
                     $seq->annotation->add_Annotation($key, $value);
                 }
             }
         }

         # move SeqFeatures
         if ( $seq->isa('Bio::SeqI') and $catseq->isa('Bio::SeqI')) {
             for my $feat ($catseq->get_SeqFeatures) {
                 $seq->add_SeqFeature($self->_coord_adjust($feat, $length));
             }
         }

     }
     1;
}


=head2 _coord_adjust

   Title   : _coord_adjust
   Usage   : my $newfeat=Bio::SeqUtils->_coord_adjust($feature, 100);
   Function: Recursive subroutine to adjust the coordinates of a feature
             and all its subfeatures.
   Returns : A Bio::SeqFeatureI compliant object.
   Args    : A Bio::SeqFeatureI compliant object,
             the number of bases to add to the coordinates


=cut

sub _coord_adjust {
     my ($self, $feat, $add)=@_;
     $self->throw('Object [$feat] '. 'of class ['. ref($feat).
                  '] should be a Bio::SeqFeatureI ')
	unless $feat->isa('Bio::SeqFeatureI');
     my @adjsubfeat;
     for my $subfeat ($feat->remove_SeqFeatures) {
	push @adjsubfeat, Bio::SeqUtils->_coord_adjust($add, $subfeat);
     }
     my @loc=$feat->location->each_Location;
     map {
	my @coords=($_->start, $_->end);
	map s/(\d+)/$add+$1/ge, @coords;
	$_->start(shift @coords);
	$_->end(shift @coords);
     } @loc;
     if (@loc==1) {
	$feat->location($loc[0])
     } else {
	my $loc=Bio::Location::Split->new;
	$loc->add_sub_Location(@loc);
	$feat->location($loc);
     }
     $feat->add_SeqFeature($_) for @adjsubfeat;
     return $feat;
}



More information about the Bioperl-l mailing list