[Bioperl-l] concatenate two embl sequence files
Roy Chaudhuri
roy at colibase.bham.ac.uk
Fri Jan 27 10:31:50 EST 2006
> I've committed the code and tests. See t/SeqUtils.t. The idea is to test all
> methods and a reasonable portion of all edge values to be sure that the
> method works as it should.
Cool, thanks for that. My first proper contribution to BioPerl 8^).
The tests look good- I'll know better for next time.
> Note that the code does not create a new sequence object. It modifies the
> existing one. Therefore it is best not to return that object. The users would
> assign that to a variable that points to the same structure and get confused.
> The method now returns true upon completeion.
>
> Creating a new sequence object is problematic because one needs to add one
> more dependency (e.g. Clone) and will not work anyway if the sequence
> implementation is using a database back end. It is better the way you have
> written it.
Yes, that makes sense. Although with that interface it might be more
natural in Bio::Seq? If it is a method that will modify a sequence in
place then it seems more intuitive to call $seq->cat(@seqs) [or even
$seq->append(@seqs)] rather than Bio::SeqUtils->cat($seq, @seqs).
> I added code to move over the annotations from secondary sequences, but did
> not do anything remove duplicates if the same sequence gets added twice. I
> wrote a note about this so that users know to be prepared if that affects
> them.
I'm not convinced about this- perhaps it should be optional? In practice
many of the annotations for each subsequence are only going to be
applicable to that sequence, not the concatenated whole. Some of them
may also be duplicated even between non-identical sequences. I think
it'd be better by default to keep just the annotation from the first
sequence (which probably would still need to be changed, but could at
least act as a placeholder).
There were a couple of problems with renamed variables/subroutines that
hadn't all been updated, I've fixed those and pasted the new version below.
> No feature coordinates should always be within the sequence. Fuzzy is the
> correct solution to this.
Okay, I'll have a go and let you know how I get on.
Cheers.
Roy.
--
Dr. Roy Chaudhuri
Bioinformatics Research Fellow
Division of Immunity and Infection
University of Birmingham, U.K.
http://xbase.bham.ac.uk
=head2 cat
Title : cat
Usage : my $catseq = Bio::SeqUtils->cat(@seqs)
Function: Concatenates an array of Bio::Seq objects, using the first
sequence
as a target. Annotations and sequence features are copied over
from any additional objects. Adjusts the coordinates of copied
features.
Returns : a boolean
Args : array of sequence objects
-
Note that annotations have no sequence region. If you concatenate the
same sequence more than once, you will have its annotations
duplicated.
=cut
sub cat {
my ($self, $seq, @seqs) = @_;
$self->throw('Object [$seq] '. 'of class ['. ref($seq).
'] should be a Bio::PrimarySeqI ')
unless $seq->isa('Bio::PrimarySeqI');
for my $catseq (@seqs) {
$self->throw('Object [$catseq] '. 'of class ['. ref($catseq).
'] should be a Bio::PrimarySeqI ')
unless $catseq->isa('Bio::PrimarySeqI');
$self->throw('Trying to concatenate sequences with different
alphabets: '.
$seq->display_id. '('. $seq->alphabet. ') and '.
$catseq->display_id.
'('. $catseq->alphabet. ')')
unless $catseq->alphabet eq $seq->alphabet;
my $length=$seq->length;
$seq->seq($seq->seq.$catseq->seq);
# move annotations
if ($seq->isa("Bio::AnnotatableI") and
$catseq->isa("Bio::AnnotatableI")) {
foreach my $key (
$catseq->annotation->get_all_annotation_keys() ) {
foreach my $value (
$catseq->annotation->get_Annotations($key) ) {
$seq->annotation->add_Annotation($key, $value);
}
}
}
# move SeqFeatures
if ( $seq->isa('Bio::SeqI') and $catseq->isa('Bio::SeqI')) {
for my $feat ($catseq->get_SeqFeatures) {
$seq->add_SeqFeature($self->_coord_adjust($feat, $length));
}
}
}
1;
}
=head2 _coord_adjust
Title : _coord_adjust
Usage : my $newfeat=Bio::SeqUtils->_coord_adjust($feature, 100);
Function: Recursive subroutine to adjust the coordinates of a feature
and all its subfeatures.
Returns : A Bio::SeqFeatureI compliant object.
Args : A Bio::SeqFeatureI compliant object,
the number of bases to add to the coordinates
=cut
sub _coord_adjust {
my ($self, $feat, $add)=@_;
$self->throw('Object [$feat] '. 'of class ['. ref($feat).
'] should be a Bio::SeqFeatureI ')
unless $feat->isa('Bio::SeqFeatureI');
my @adjsubfeat;
for my $subfeat ($feat->remove_SeqFeatures) {
push @adjsubfeat, Bio::SeqUtils->_coord_adjust($add, $subfeat);
}
my @loc=$feat->location->each_Location;
map {
my @coords=($_->start, $_->end);
map s/(\d+)/$add+$1/ge, @coords;
$_->start(shift @coords);
$_->end(shift @coords);
} @loc;
if (@loc==1) {
$feat->location($loc[0])
} else {
my $loc=Bio::Location::Split->new;
$loc->add_sub_Location(@loc);
$feat->location($loc);
}
$feat->add_SeqFeature($_) for @adjsubfeat;
return $feat;
}
More information about the Bioperl-l
mailing list