[Bioperl-l] concatenate two embl sequence files

Heikki Lehvaslaiho heikki at sanbi.ac.za
Fri Jan 27 10:06:52 UTC 2006


On Thursday 26 January 2006 15:18, Roy Chaudhuri wrote:
> Heikki Lehvaslaiho wrote:
> > Thanks Roy!
> >
> > I'll check to code in tomorrow when I am less sleepy and can go through
> > the code in detail. In principle the code looks good. It definitely needs
> > tests. If you have written any please do post them.
>
> Not too sure about how to go about writing tests, any suggestions?

I've committed the code and tests. See t/SeqUtils.t. The idea is to test all 
methods and a reasonable portion of all edge values to be sure that the 
method works as it should.


Note that the code does not create a new sequence object. It modifies the 
existing one. Therefore it is best not to return that object. The users would 
assign that to a variable that points to the same structure and get confused. 
The method now returns true upon completeion.

Creating a new sequence object is problematic because one needs to add one 
more dependency (e.g. Clone) and will not work anyway if the sequence 
implementation is using a database back end. It is better the way you have 
written it.

I added code to move over the annotations from secondary sequences, but did 
not do anything remove duplicates if the same sequence gets added twice. I 
wrote a note about this so that users know to be prepared if that affects 
them.


> It did occur to me that my _coordAdjust method could be adapted to allow
> the Bio::Seq trunc method to retain sequence features (since there's no
> reason why the $add argument can't be negative). This would probably
> need a bit more work to cope with the situation where a feature overlaps
> the trunc coordinates, for example if we truncate to coordinates 1..400,
> but there's a feature 300..500. I guess the 'correct' behaviour might be
> to convert that feature to a fuzzy location of 300..>400? Or is it
> acceptable to have features with coordinates outside of a sequence?

No feature coordinates should always be within the sequence. Fuzzy is the 
correct solution to this.

> If we did that then an obvious test would be to cat a sequence to
> itself, then trunc to retain just the second half of the new sequence
> and see if you got back what you started with.

Go ahead an try it!

> > A few more checks to make sure seq_>alphabet is the same in all sequences
> > might be a good idea.
>
> That's easy to implement. Just put the line:
> 	$self->throw('Trying to concatenate sequences with different alphabets:
> '.$seq->display_id.' ('.$seq->alphabet.') and ' .$_->display_id.'
> ('.$_->alphabet.')') unless $_->alphabet eq $seq->alphabet;
>
> at the start of the for(@seqs) loop of the cat subroutine.

Added.

Thanks,

	-Heikki
> Roy.
> --
> Dr. Roy Chaudhuri
> Bioinformatics Research Fellow
> Division of Immunity and Infection
> University of Birmingham, U.K.
>
> http://xbase.bham.ac.uk
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l

-- 
______ _/      _/_____________________________________________________
      _/      _/
     _/  _/  _/  Heikki Lehvaslaiho    heikki at_sanbi _ac _za
    _/_/_/_/_/  Associate Professor    skype: heikki_lehvaslaiho
   _/  _/  _/  SANBI, South African National Bioinformatics Institute
  _/  _/  _/  University of Western Cape, South Africa
     _/      Phone: +27 21 959 2096   FAX: +27 21 959 2512
___ _/_/_/_/_/________________________________________________________



More information about the Bioperl-l mailing list