[Bioperl-l] Parsing contig information from HTGs

simon andrews (BI) simon.andrews@bbsrc.ac.uk
Tue, 11 Dec 2001 12:22:11 -0000


Dear All,

I'm trying to parse out contig information from EMBL HTG flatfiles using
BioPerl.  I can read the file OK with Bio::SeqIO and get myself a Bio::Seq
object.  The problem I'm having is getting at the contig info.

I don't think I can use the usual feature methods as for some bizarre reason
the contigs are often only identified in the comments section of the
entries, eg:

CC   *        1     8591: contig of 8591 bp in length
CC   *                    gap of unknown length
CC   *     8592    28835: contig of 20244 bp in length
CC   *                    gap of unknown length
CC   *    28836    40356: contig of 11521 bp in length
CC   *                    gap of unknown length
CC   *    40357    58902: contig of 18546 bp in length
CC   *                    gap of unknown length
CC   *    58903    61812: contig of 2910 bp in length
CC   *                    gap of unknown length
CC   *    61813    71640: contig of 9828 bp in length
CC   *                    gap of unknown length
CC   *    71641    75199: contig of 3559 bp in length
CC   *                    gap of unknown length
CC   *    75200    91638: contig of 16439 bp in length.

...and this information *doesn't* appear in the feature table!!

Trying to parse this, I've found I can get the comments section from the
Bio::Seq object using;

  my $annot = $seq->annotation();

  foreach my $comment($annot->each_Comment){
	print $comment->text . "\n";
  }

..but the each_Comment iterator only returns one comment per database entry,
and this is a concatenation of all of the comment lines from the original
entry.  Removing the line breaks makes the resulting string a lot harder to
process.

So my questions are:

1) Is there a better way to get at the contig information through the
existing objects (wishful thinking??).

2) Am I retrieving the comments the right way? ..and if so is there
a reason why the newlines are stripped upon processing?  My assumption was
that the each_Comment iterator would give me back the original comments one
line at a time, which I could then process to extract the contig info.

This is all using BioPerl 0.7.0 (I think..)

Any help is much appreciated

Simon.

----
Simon Andrews PhD
Bioinformatics Dept
The Babraham Institute

simon.andrews@bbsrc.ac.uk
+44 (0)1223 496463