[Bioperl-l] patch for Bio::SeqIO:embl.pm - read/write 'secondary accession nu mbers'

Malcolm Cook mcook@dna.com
Thu, 11 Oct 2001 11:47:20 -0700


This message is in MIME format. Since your mail reader does not understand
this format, some or all of this message may not be legible.

------_=_NextPart_000_01C15285.287A5B80
Content-Type: text/plain;
	charset="iso-8859-1"

Please excuse the resend - finger trouble on my part....

Ewan, 

I see your name as the carer-forer of Bio::SeqIO::embl.pm.  Can you vet my
changes for me and incorporate them as appropriate?

As far as I can tell, Bio::SeqIO::embl.pm was not parsing the AC line for
semi-colon separated 'secondary accession numbers' as provided for by
http://www.ebi.ac.uk/embl/Documentation/User_manual/ac_line.html

The attached copy of embl.pm,v 1.31 taken from bioperl-live has edits made
by me to accomodate them.  I've marked them in the code with 'mec' for easy
findability.

The code now assumes that the first \S+ on the first AC line encountered in
the sequece is the primary.  All others, whether on the same AC line or one
of the subsequent AC lines (also provided for) are considered 'secondary'.
I'm not sure this is the correct interpretation of the spec (above).  

Also, it alows for space in addition to semi-colon to delimit accession
numbers.  I have run across such entires.

Thanks?  Comments?

-Malcolm Cook


------_=_NextPart_000_01C15285.287A5B80
Content-Type: application/octet-stream;
	name="embl.pm"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: attachment;
	filename="embl.pm"

# $Id: embl.pm,v 1.31 2001/06/29 21:05:49 jason Exp $=0A=
#=0A=
# BioPerl module for Bio::SeqIO::EMBL=0A=
#=0A=
# Cared for by Ewan Birney <birney@ebi.ac.uk>=0A=
#=0A=
# Copyright Ewan Birney=0A=
#=0A=
# You may distribute this module under the same terms as perl itself=0A=
=0A=
# POD documentation - main docs before the code=0A=
=0A=
=3Dhead1 NAME=0A=
=0A=
Bio::SeqIO::embl - EMBL sequence input/output stream=0A=
=0A=
=3Dhead1 SYNOPSIS=0A=
=0A=
It is probably best not to use this object directly, but=0A=
rather go through the AnnSeqIO handler system. Go:=0A=
=0A=
    $stream =3D Bio::SeqIO->new(-file =3D> $filename, -format =3D> =
'EMBL');=0A=
=0A=
    while ( (my $seq =3D $stream->next_seq()) ) {=0A=
	# do something with $seq=0A=
    }=0A=
=0A=
=3Dhead1 DESCRIPTION=0A=
=0A=
This object can transform Bio::Seq objects to and from EMBL flat=0A=
file databases.=0A=
=0A=
There is alot of flexibility here about how to dump things which I =
need=0A=
to document fully.=0A=
=0A=
There should be a common object that this and genbank share =
(probably=0A=
with swissprot). Too much of the magic is identical. =0A=
=0A=
=3Dhead2 Optional functions=0A=
=0A=
=3Dover 3=0A=
=0A=
=3Ditem _show_dna()=0A=
=0A=
(output only) shows the dna or not=0A=
=0A=
=3Ditem _post_sort()=0A=
=0A=
(output only) provides a sorting func which is applied to the =
FTHelpers=0A=
before printing=0A=
=0A=
=3Ditem _id_generation_func()=0A=
=0A=
This is function which is called as =0A=
=0A=
   print "ID   ", $func($annseq), "\n";=0A=
=0A=
To generate the ID line. If it is not there, it generates a sensible =
ID=0A=
line using a number of tools.=0A=
=0A=
=3Dback=0A=
=0A=
=3Dhead1 FEEDBACK=0A=
=0A=
=3Dhead2 Mailing Lists=0A=
=0A=
User feedback is an integral part of the evolution of this=0A=
and other Bioperl modules. Send your comments and suggestions =
preferably=0A=
 to one of the Bioperl mailing lists.=0A=
Your participation is much appreciated.=0A=
=0A=
  bioperl-l@bioperl.org                 - General discussion=0A=
  http://www.bioperl.org/MailList.shtml - About the mailing lists=0A=
=0A=
=3Dhead2 Reporting Bugs=0A=
=0A=
Report bugs to the Bioperl bug tracking system to help us keep track=0A=
 the bugs and their resolution.=0A=
 Bug reports can be submitted via email or the web:=0A=
=0A=
  bioperl-bugs@bio.perl.org=0A=
  http://bio.perl.org/bioperl-bugs/=0A=
=0A=
=3Dhead1 AUTHOR - Ewan Birney=0A=
=0A=
Email birney@ebi.ac.uk=0A=
=0A=
Describe contact details here=0A=
=0A=
=3Dhead1 APPENDIX=0A=
=0A=
The rest of the documentation details each of the object methods. =
Internal methods are usually preceded with a _=0A=
=0A=
=3Dcut=0A=
=0A=
=0A=
# Let the code begin...=0A=
=0A=
=0A=
package Bio::SeqIO::embl;=0A=
use vars qw(@ISA);=0A=
use strict;=0A=
use Bio::Seq::RichSeq;=0A=
use Bio::SeqIO::FTHelper;=0A=
use Bio::SeqFeature::Generic;=0A=
use Bio::Species;=0A=
=0A=
@ISA =3D qw(Bio::SeqIO);=0A=
=0A=
sub _initialize {=0A=
  my($self,@args) =3D @_;=0A=
=0A=
  $self->SUPER::_initialize(@args);  =0A=
  # hash for functions for decoding keys.=0A=
  $self->{'_func_ftunit_hash'} =3D {}; =0A=
  $self->_show_dna(1); # sets this to one by default. People can change =
it =0A=
}=0A=
=0A=
=0A=
=3Dhead2 next_seq=0A=
=0A=
 Title   : next_seq=0A=
 Usage   : $seq =3D $stream->next_seq()=0A=
 Function: returns the next sequence in the stream=0A=
 Returns : Bio::Seq object=0A=
 Args    :=0A=
=0A=
=0A=
=3Dcut=0A=
=0A=
sub next_seq {=0A=
   my ($self,@args) =3D @_;=0A=
   my ($pseq,$c,$line,$name,$desc,$acc,$seqc,$mol,$div, =0A=
       $date, $comment, @date_arr);=0A=
   my $seq =3D Bio::Seq::RichSeq->new(-verbose =
=3D>$self->verbose());=0A=
=0A=
=0A=
   $line =3D $self->_readline;   # This needs to be before the first =
eof() test=0A=
=0A=
   if( !defined $line ) {=0A=
       return undef; # no throws - end of file=0A=
   }=0A=
=0A=
   if( $line =3D~ /^\s+$/ ) {=0A=
       while( defined ($line =3D $self->_readline) ) {=0A=
	   $line =3D~/\S/ && last;=0A=
       }=0A=
   }   =0A=
   if( !defined $line ) {=0A=
       return undef; # end of file=0A=
   }=0A=
   $line =3D~ /^ID\s+\S+/ || $self->throw("EMBL stream with no ID. Not =
embl in my book");=0A=
   $line =3D~ /^ID\s+(\S+)\s+\S+\;\s+(\S+)\;\s+(\S+)\;/;=0A=
   $name =3D $1;=0A=
   $mol =3D $2;=0A=
   $div =3D $3;=0A=
   if(! $name) {=0A=
       $name =3D "unknown id";=0A=
   }=0A=
    # this is important to have the id for display in e.g. FTHelper, =
otherwise=0A=
    # you won't know which entry caused an error=0A=
   $seq->display_id($name);=0A=
   if($mol) {=0A=
       $seq->molecule($mol);=0A=
       my $moltype;=0A=
       if (defined $seq->molecule) {=0A=
	   my $mol =3D$seq->molecule;=0A=
	   if ($mol =3D~ /DNA/) {=0A=
	       $moltype=3D'dna';=0A=
	   }=0A=
	   elsif ($mol =3D~ /RNA/) {=0A=
	       $moltype=3D'rna';=0A=
	   }=0A=
	   elsif ($mol =3D~ /AA/) {=0A=
	       $moltype=3D'protein';=0A=
	   }=0A=
       }=0A=
       if ($moltype) {=0A=
	   $seq->primary_seq->moltype($moltype);=0A=
       }=0A=
=0A=
   }=0A=
   if ($div) {=0A=
       $seq->division($div);=0A=
   }=0A=
   =0A=
#   $self->warn("not parsing upper annotation in EMBL file yet!");=0A=
   my $buffer =3D $line;=0A=
 =0A=
   =0A=
   BEFORE_FEATURE_TABLE :=0A=
   until( !defined $buffer ) {=0A=
       $_ =3D $buffer;=0A=
=0A=
       # Exit at start of Feature table=0A=
       last if /^FH/;=0A=
=0A=
       # Description line(s)=0A=
       if (/^DE\s+(\S.*\S)/) {=0A=
           $desc .=3D $desc ? " $1" : $1;=0A=
       }=0A=
=0A=
       #accession number - mec - patched to handle secondary accession =
numbers=0A=
       if( /^AC\s+(.*)$/ ) {=0A=
			my @accs =3D split /[; ]+/ , $1; # allow space in addition to =
semi-colon (not strictly legal?)=0A=
			$seq->accession_number(shift(@accs)) if "unknown" eq  =
$seq->accession_number;=0A=
			$seq->add_secondary_accession($_) foreach @accs;=0A=
       }=0A=
=0A=
#        #accession number=0A=
#        if( /^AC\s+(\S+);?/ ) {=0A=
# 	   $acc =3D $1;=0A=
# 	   $acc =3D~ s/\;//;=0A=
# 	   $seq->accession_number($acc);=0A=
#        }=0A=
       =0A=
       #version number=0A=
       if( /^SV\s+\S+\.(\d+);?/ ) {=0A=
	   my $sv =3D $1;=0A=
	   $sv =3D~ s/\;//;=0A=
	   $seq->seq_version($sv);=0A=
       }=0A=
=0A=
       #date (NOTE: takes last date line)=0A=
       if( /^DT\s+(.+)$/ ) {=0A=
	   my $date =3D $1;=0A=
	   $date =3D~ s/\;$//;=0A=
	   $seq->add_date($date);=0A=
       }=0A=
       =0A=
       #keywords=0A=
       if( /^KW   (.*)\S*$/ ) {=0A=
	   my $keywords =3D $1;=0A=
	   $seq->keywords($keywords);=0A=
       }=0A=
=0A=
       # Organism name and phylogenetic information=0A=
       elsif (/^O[SC]/) {=0A=
           my $species =3D $self->_read_EMBL_Species(\$buffer);=0A=
           $seq->species( $species );=0A=
       }=0A=
=0A=
       # References=0A=
       elsif (/^R/) {=0A=
	   my @refs =3D $self->_read_EMBL_References(\$buffer);=0A=
	   $seq->annotation->add_Reference(@refs);=0A=
       }=0A=
       =0A=
       # DB Xrefs=0A=
       elsif (/^DR/) {=0A=
	   my @links =3D $self->_read_EMBL_DBLink(\$buffer);=0A=
	   $seq->annotation->add_DBLink(@links);=0A=
       }=0A=
       =0A=
       # Comments=0A=
       elsif (/^CC\s+(.*)/) {=0A=
	   $comment .=3D $1;=0A=
	   $comment .=3D " ";=0A=
	   while (defined ($_ =3D $self->_readline) ) {=0A=
	       if (/^CC\s+(.*)/) {=0A=
		   $comment .=3D $1;=0A=
		   $comment .=3D " ";=0A=
	       }=0A=
	       else { =0A=
		   last;=0A=
	       }=0A=
	   }=0A=
	   my $commobj =3D Bio::Annotation::Comment->new();=0A=
	   $commobj->text($comment);=0A=
	   $seq->annotation->add_Comment($commobj);=0A=
	   $comment =3D "";=0A=
       }=0A=
=0A=
       # Get next line.=0A=
       $buffer =3D $self->_readline;=0A=
   }=0A=
    $seq->desc($desc);=0A=
=0A=
   while( defined ($_ =3D $self->_readline) ) {=0A=
       /^FT   \w/ && last;=0A=
       /^SQ / && last;=0A=
   }=0A=
   $buffer =3D $_;=0A=
      =0A=
   if (defined($buffer) && $buffer =3D~ /^FT /) {=0A=
     until( !defined ($buffer) ) {=0A=
	 my $ftunit =3D $self->_read_FTHelper_EMBL(\$buffer);=0A=
	 # process ftunit=0A=
	 $ftunit->_generic_seqfeature($seq);=0A=
=0A=
	 if( $buffer !~ /^FT/ ) {=0A=
	     last;=0A=
	 }=0A=
     }=0A=
   }=0A=
	=0A=
   if( $buffer !~ /^SQ/  ) {=0A=
       while( defined ($_ =3D $self->_readline) ) {=0A=
	   /^SQ/ && last;=0A=
       }=0A=
   }=0A=
=0A=
   $seqc =3D "";	       =0A=
   while( defined ($_ =3D $self->_readline) ) {=0A=
       /^\/\// && last;=0A=
       $_ =3D uc($_);=0A=
       s/[^A-Za-z]//g;=0A=
       $seqc .=3D $_;=0A=
   }=0A=
=0A=
   $seq->seq($seqc);=0A=
   return $seq;=0A=
}=0A=
=0A=
=3Dhead2 write_seq=0A=
=0A=
 Title   : write_seq=0A=
 Usage   : $stream->write_seq($seq)=0A=
 Function: writes the $seq object (must be seq) to the stream=0A=
 Returns : 1 for success and 0 for error=0A=
 Args    : Bio::Seq=0A=
=0A=
=0A=
=3Dcut=0A=
=0A=
sub write_seq {=0A=
    my ($self,$seq) =3D @_;=0A=
=0A=
    if( !defined $seq ) {=0A=
        $self->throw("Attempting to write with no seq!");=0A=
    }=0A=
=0A=
    if( ! ref $seq || ! $seq->isa('Bio::SeqI') ) {=0A=
        $self->warn(" $seq is not a SeqI compliant module. Attempting =
to dump, but may fail!");=0A=
    }=0A=
=0A=
    my $str =3D $seq->seq;=0A=
=0A=
    my $mol;=0A=
    my $div;=0A=
    my $len =3D $seq->length();=0A=
=0A=
    if ($seq->can('division') && defined $seq->division) {=0A=
        $div =3D $seq->division();=0A=
    }=0A=
    $div ||=3D 'UNK';=0A=
    =0A=
    if ($seq->can('molecule')) {=0A=
        $mol =3D $seq->molecule();=0A=
    }=0A=
    elsif (defined $seq->primary_seq->moltype) {=0A=
	my $moltype =3D$seq->primary_seq->moltype;=0A=
	if ($moltype eq 'dna') {=0A=
	    $mol =3D'DNA';=0A=
	}=0A=
	elsif ($moltype eq 'rna') {=0A=
	    $mol=3D'RNA';=0A=
	}=0A=
	elsif ($moltype eq 'protein') {=0A=
	    $mol=3D'AA';=0A=
	}=0A=
    }=0A=
    $mol ||=3D 'XXX';=0A=
   =0A=
    my $temp_line;=0A=
    if( $self->_id_generation_func ) {=0A=
        $temp_line =3D &{$self->_id_generation_func}($seq);=0A=
    } else {=0A=
        $temp_line =3D sprintf("%-11sstandard; $mol; $div; %d BP.", =
$seq->id(), $len);=0A=
    } =0A=
=0A=
    $self->_print( "ID   $temp_line\n","XX\n");=0A=
=0A=
    # Write the accession line if present=0A=
    my( $acc );=0A=
    {=0A=
        if( my $func =3D $self->_ac_generation_func ) {=0A=
            $acc =3D &{$func}($seq);=0A=
        } elsif( $seq->isa('Bio::Seq::RichSeqI') && =0A=
		 defined($seq->accession_number) ) {=0A=
            $acc =3D $seq->accession_number;=0A=
				$acc =3D join(";", $acc, $seq->get_secondary_accessions); # mec - =
patched to handle secondary accession numbers=0A=
        }=0A=
=0A=
        if (defined $acc) {=0A=
            $self->_print("AC   $acc;\n",=0A=
			  "XX\n");=0A=
        }=0A=
    }=0A=
=0A=
    # Write the sv line if present=0A=
    {=0A=
        my( $sv );=0A=
        if (my $func =3D $self->_sv_generation_func) {=0A=
            $sv =3D &{$func}($seq);=0A=
        } elsif($seq->isa('Bio::Seq::RichSeqI') && =0A=
		defined($seq->seq_version)) {=0A=
            $sv =3D "$acc.". $seq->seq_version();=0A=
        }	=0A=
        if (defined $sv) {=0A=
            $self->_print( "SV   $sv\n",=0A=
			   "XX\n");=0A=
        }=0A=
    }=0A=
=0A=
    # Date lines=0A=
    my $switch=3D0;=0A=
    if( $seq->can('get_dates') ) {=0A=
	foreach my $dt ( $seq->get_dates() ) {=0A=
	    $self->_write_line_EMBL_regex("DT   ","DT   =
",$dt,'\s+|$',80);#'=0A=
            $switch=3D1;=0A=
        }=0A=
        if ($switch =3D=3D 1) {=0A=
            $self->_print("XX\n");=0A=
        }=0A=
    }=0A=
=0A=
    # Description lines=0A=
    $self->_write_line_EMBL_regex("DE   ","DE   =
",$seq->desc(),'\s+|$',80); #'=0A=
    $self->_print( "XX\n");=0A=
=0A=
    # if there, write the kw line=0A=
    {=0A=
        my( $kw );=0A=
        if( my $func =3D $self->_kw_generation_func ) {=0A=
            $kw =3D &{$func}($seq);=0A=
        } elsif( $seq->can('keywords') ) {=0A=
	    $kw =3D $seq->keywords;=0A=
        }=0A=
        if (defined $kw) {=0A=
            $self->_print( "KW   $kw\n",=0A=
			   "XX\n");=0A=
        }=0A=
    }=0A=
   =0A=
    # Organism lines=0A=
=0A=
    if ($seq->can('species') && (my $spec =3D $seq->species)) {=0A=
        my($species, @class) =3D $spec->classification();=0A=
        my $genus =3D $class[0];=0A=
        my $OS =3D "$genus $species";=0A=
	if (my $ssp =3D $spec->sub_species) {=0A=
            $OS .=3D " $ssp";=0A=
        }=0A=
        if (my $common =3D $spec->common_name) {=0A=
            $OS .=3D " ($common)";=0A=
        }=0A=
        $self->_print("OS   $OS\n");=0A=
        my $OC =3D join('; ', reverse(@class)) .'.';=0A=
        $self->_write_line_EMBL_regex("OC   ","OC   ",$OC,'; |$',80); =
#'=0A=
	if ($spec->organelle) {=0A=
	    $self->_write_line_EMBL_regex("OG   ","OG   ",$spec->organelle,'; =
|$',80); #'=0A=
	}=0A=
        $self->_print("XX\n");=0A=
    }=0A=
   =0A=
    # Reference lines=0A=
    my $t =3D 1;=0A=
    if ( defined $seq->annotation ) {=0A=
	foreach my $ref ( $seq->annotation->each_Reference() ) {=0A=
	    $self->_print( "RN   [$t]\n");=0A=
	    =0A=
	    # Having no RP line is legal, but we need both=0A=
	    # start and end for a valid location.=0A=
	    my $start =3D $ref->start;=0A=
	    my $end   =3D $ref->end;=0A=
	    if ($start and $end) {=0A=
		$self->_print( "RP   $start-$end\n");=0A=
	    } elsif ($start or $end) {=0A=
		$self->throw("Both start and end are needed for a valid RP line.  =
Got: start=3D'$start' end=3D'$end'");=0A=
	    }=0A=
	    =0A=
	    if (my $med =3D $ref->medline) {=0A=
		$self->_print( "RX   MEDLINE; $med.\n");=0A=
	    }=0A=
	    if (my $pm =3D $ref->pubmed) {=0A=
		$self->_print( "RX   PUBMED; $pm.\n");=0A=
	    }=0A=
	    $self->_write_line_EMBL_regex("RA   ", "RA   ", =0A=
					  $ref->authors . ";",=0A=
					  '\s+|$', 80); #'=0A=
=0A=
           # If there is no title to the reference, it appears=0A=
           # as a single semi-colon.  All titles must end in=0A=
           # a semi-colon.=0A=
           my $ref_title =3D $ref->title || '';=0A=
           $ref_title =3D~ s/[\s;]*$/;/;=0A=
           $self->_write_line_EMBL_regex("RT   ", "RT   ", $ref_title,  =
  '\s+|$', 80); #'=0A=
	   $self->_write_line_EMBL_regex("RL   ", "RL   ", $ref->location, =
'\s+|$', 80); #'=0A=
           if ($ref->comment) {=0A=
	       $self->_write_line_EMBL_regex("RC   ", "RC   ", $ref->comment, =
'\s+|$', 80); #' =0A=
           }=0A=
           $self->_print("XX\n");=0A=
           $t++;=0A=
       }=0A=
        =0A=
=0A=
       # DB Xref lines=0A=
       if (my @db_xref =3D $seq->annotation->each_DBLink) {=0A=
           foreach my $dr (@db_xref) {=0A=
              my $db_name =3D $dr->database;=0A=
              my $prim    =3D $dr->primary_id;=0A=
              my $opt     =3D $dr->optional_id || '';=0A=
            =0A=
              my $line =3D "$db_name; $prim; $opt.";=0A=
              $self->_write_line_EMBL_regex("DR   ", "DR   ", $line, =
'\s+|$', 80); #'=0A=
          }=0A=
          $self->_print("XX\n");=0A=
       }=0A=
=0A=
       # Comment lines=0A=
       foreach my $comment ( $seq->annotation->each_Comment() ) {=0A=
           $self->_write_line_EMBL_regex("CC   ", "CC   ", =
$comment->text, '\s+|$', 80); #'=0A=
           $self->_print("XX\n");=0A=
       }=0A=
    }=0A=
    # "\\s\+\|\$"=0A=
=0A=
    ## FEATURE TABLE=0A=
=0A=
    $self->_print("FH   Key             Location/Qualifiers\n");=0A=
    $self->_print("FH\n");=0A=
=0A=
    if( defined $self->_post_sort ) {=0A=
        # we need to read things into an array. Process. Sort them. =
Print 'em=0A=
=0A=
        my $post_sort_func =3D $self->_post_sort();=0A=
        my @fth;=0A=
=0A=
        foreach my $sf ( $seq->top_SeqFeatures ) {=0A=
	    push(@fth,Bio::SeqIO::FTHelper::from_SeqFeature($sf,$seq));=0A=
        }=0A=
=0A=
        @fth =3D sort { &$post_sort_func($a,$b) } @fth;=0A=
=0A=
        foreach my $fth ( @fth ) {=0A=
	    $self->_print_EMBL_FTHelper($fth);=0A=
        }=0A=
    } else {=0A=
        # not post sorted. And so we can print as we get them.=0A=
        # lower memory load...=0A=
=0A=
        foreach my $sf ( $seq->top_SeqFeatures ) {=0A=
	    my @fth =3D Bio::SeqIO::FTHelper::from_SeqFeature($sf,$seq);=0A=
	    foreach my $fth ( @fth ) {=0A=
	        $self->_print_EMBL_FTHelper($fth);=0A=
	    }=0A=
        }=0A=
    }=0A=
=0A=
    $self->_print( "XX\n");=0A=
=0A=
    if( $self->_show_dna() =3D=3D 0 ) {=0A=
        return;=0A=
    }=0A=
=0A=
    # finished printing features.=0A=
=0A=
    $str =3D~ tr/A-Z/a-z/;=0A=
    =0A=
    # Count each nucleotide=0A=
    my $alen =3D $str =3D~ tr/a/a/;=0A=
    my $clen =3D $str =3D~ tr/c/c/;=0A=
    my $glen =3D $str =3D~ tr/g/g/;=0A=
    my $tlen =3D $str =3D~ tr/t/t/;=0A=
=0A=
    my $olen =3D $len - ($alen + $tlen + $clen + $glen);=0A=
    if( $olen < 0 ) {=0A=
        $self->warn("Weird. More atgc than bases. Problem!");=0A=
    }=0A=
=0A=
    $self->_print("SQ   Sequence $len BP; $alen A; $clen C; $glen G; =
$tlen T; $olen other;\n");=0A=
    =0A=
    my $nuc =3D 60;               # Number of nucleotides per line=0A=
    my $whole_pat =3D 'a10' x 6;  # Pattern for unpacking a whole =
line=0A=
    my $out_pat   =3D 'A11' x 6;  # Pattern for packing a line=0A=
    my $length =3D length($str);=0A=
    =0A=
    # Calculate the number of nucleotides which fit on whole lines=0A=
    my $whole =3D int($length / $nuc) * $nuc;=0A=
=0A=
    # Print the whole lines=0A=
    my( $i );=0A=
    for ($i =3D 0; $i < $whole; $i +=3D $nuc) {=0A=
        my $blocks =3D pack $out_pat,=0A=
                     unpack $whole_pat,=0A=
                     substr($str, $i, $nuc);=0A=
        $self->_print(sprintf("     $blocks%9d\n", $i + $nuc));=0A=
    }=0A=
=0A=
    # Print the last line=0A=
    if (my $last =3D substr($str, $i)) {=0A=
        my $last_len =3D length($last);=0A=
        my $last_pat =3D 'a10' x int($last_len / 10) .'a'. $last_len % =
10;=0A=
        my $blocks =3D pack $out_pat,=0A=
                     unpack($last_pat, $last);=0A=
        $self->_print(sprintf("     $blocks%9d\n", $length));    # Add =
the length to the end=0A=
    }=0A=
=0A=
    $self->_print( "//\n");=0A=
    return 1;=0A=
}=0A=
=0A=
=3Dhead2 _print_EMBL_FTHelper=0A=
=0A=
 Title   : _print_EMBL_FTHelper=0A=
 Usage   :=0A=
 Function:=0A=
 Example :=0A=
 Returns : =0A=
 Args    :=0A=
=0A=
=0A=
=3Dcut=0A=
=0A=
sub _print_EMBL_FTHelper {=0A=
   my ($self,$fth,$always_quote) =3D @_;=0A=
   $always_quote ||=3D 0;=0A=
   =0A=
   if( ! ref $fth || ! $fth->isa('Bio::SeqIO::FTHelper') ) {=0A=
       $fth->warn("$fth is not a FTHelper class. Attempting to print, =
but there could be tears!");=0A=
   }=0A=
=0A=
=0A=
   #$self->_print( "FH   Key             Location/Qualifiers\n");=0A=
   #$self->_print( sprintf("FT   %-15s  %s\n",$fth->key,$fth->loc));=0A=
   $self->_write_line_EMBL_regex(sprintf("FT   %-15s ",$fth->key),"FT   =
                ",$fth->loc,'\,|$',80); #'=0A=
   foreach my $tag ( keys %{$fth->field} ) {=0A=
       if( ! defined $fth->field->{$tag} ) { next; } =0A=
       foreach my $value ( @{$fth->field->{$tag}} ) {=0A=
	   $value =3D~ s/\"/\"\"/g;=0A=
	   if ($value eq "_no_value") {=0A=
	       $self->_write_line_EMBL_regex("FT                   ","FT       =
            ","/$tag",'.|$',80); #'=0A=
	   }=0A=
           elsif( $always_quote =3D=3D 1 || $value !~ /^\d+$/ ) {=0A=
              my $pat =3D $value =3D~ /\s/ ? '\s|$' : '.|$';=0A=
	      $self->_write_line_EMBL_regex("FT                   ","FT        =
           ","/$tag=3D\"$value\"",$pat,80);=0A=
           }=0A=
           else {=0A=
              $self->_write_line_EMBL_regex("FT                   ","FT =
                  ","/$tag=3D$value",'.|$',80); #'=0A=
           }=0A=
	  # $self->_print( "FT                   /", $tag, "=3D\"", $value, =
"\"\n");=0A=
       }=0A=
   }=0A=
=0A=
}=0A=
=0A=
#'=0A=
=3Dhead2 _read_EMBL_References=0A=
=0A=
 Title   : _read_EMBL_References=0A=
 Usage   :=0A=
 Function: Reads references from EMBL format. Internal function =
really=0A=
 Example :=0A=
 Returns : =0A=
 Args    :=0A=
=0A=
=0A=
=3Dcut=0A=
=0A=
sub _read_EMBL_References {=0A=
   my ($self,$buffer) =3D @_;=0A=
   my (@refs);=0A=
   =0A=
   # assumme things are starting with RN=0A=
=0A=
   if( $$buffer !~ /^RN/ ) {=0A=
       warn("Not parsing line '$$buffer' which maybe important");=0A=
   }=0A=
   my $b1;=0A=
   my $b2;=0A=
   my $title;=0A=
   my $loc;=0A=
   my $au;=0A=
   my $med;=0A=
   my $pm;=0A=
   my $com;=0A=
=0A=
   while( defined ($_ =3D $self->_readline) ) {=0A=
       /^R/ || last;=0A=
       /^RP   (\d+)-(\d+)/ && do {$b1=3D$1;$b2=3D$2;};=0A=
       /^RX   MEDLINE;\s+(\d+)/ && do {$med=3D$1};=0A=
       /^RX   PUBMED;\s+(\d+)/ && do {$pm=3D$1};       =0A=
       /^RA   (.*)/ && do {=0A=
	   $au =3D $self->_concatenate_lines($au,$1); next;=0A=
       };=0A=
       /^RT   (.*)/ && do {=0A=
	   $title =3D $self->_concatenate_lines($title,$1); next;=0A=
       };=0A=
       /^RL   (.*)/ && do {=0A=
	   $loc =3D $self->_concatenate_lines($loc,$1); next;=0A=
       };=0A=
       /^RC   (.*)/ && do {=0A=
	   $com =3D $self->_concatenate_lines($com,$1); next;=0A=
       };=0A=
   }=0A=
   =0A=
   my $ref =3D new Bio::Annotation::Reference;=0A=
   $au =3D~ s/;\s*$//g;=0A=
   $title =3D~ s/;\s*$//g;=0A=
=0A=
   $ref->start($b1);=0A=
   $ref->end($b2);=0A=
   $ref->authors($au);=0A=
   $ref->title($title);=0A=
   $ref->location($loc);=0A=
   $ref->medline($med);=0A=
   $ref->comment($com);=0A=
   $ref->pubmed($pm);=0A=
=0A=
   push(@refs,$ref);=0A=
   $$buffer =3D $_;=0A=
   =0A=
   return @refs;=0A=
}=0A=
=0A=
=3Dhead2 _read_EMBL_Species=0A=
=0A=
 Title   : _read_EMBL_Species=0A=
 Usage   :=0A=
 Function: Reads the EMBL Organism species and classification=0A=
           lines.=0A=
 Example :=0A=
 Returns : A Bio::Species object=0A=
 Args    :=0A=
=0A=
=3Dcut=0A=
=0A=
sub _read_EMBL_Species {=0A=
    my( $self, $buffer ) =3D @_;=0A=
    my $org;=0A=
=0A=
    $_ =3D $$buffer;=0A=
    my( $sub_species, $species, $genus, $common, @class );=0A=
    while (defined( $_ ||=3D $self->_readline )) {=0A=
        =0A=
        if =
(/^OS\s+(\S+)(?:\s+([^\(]\S*))?(?:\s+([^\(]\S*))?(?:\s+\((.*)\))?/) =
{=0A=
            $genus   =3D $1;=0A=
	    $species =3D $2 || 'sp.';=0A=
	    $sub_species =3D $3 if $3;=0A=
            $common      =3D $4 if $4;=0A=
        }=0A=
        elsif (s/^OC\s+//) {=0A=
	    # only split on ';' or '.' so that =0A=
	    # classification that is 2 words will =0A=
	    # still get matched=0A=
	    # use map to remove trailing/leading spaces=0A=
	    chomp;=0A=
            push(@class,  map { s/^\s+//; s/\s+$//; $_; } split =
/[;\.]+/);=0A=
        }=0A=
	elsif (/^OG\s+(.*)/) {=0A=
	    $org =3D $1;=0A=
	}=0A=
        else {=0A=
            last;=0A=
        }=0A=
        =0A=
        $_ =3D undef; # Empty $_ to trigger read of next line=0A=
    }=0A=
    =0A=
    $$buffer =3D $_;=0A=
    =0A=
    # Don't make a species object if it is "Unknown" or "None"=0A=
    return if $genus =3D~ /^(Unknown|None)$/i;=0A=
=0A=
    # Bio::Species array needs array in Species -> Kingdom direction=0A=
    if ($class[$#class] eq $genus) {=0A=
        push( @class, $species );=0A=
    } else {=0A=
        push( @class, $genus, $species );=0A=
    }=0A=
    @class =3D reverse @class;=0A=
    =0A=
    my $make =3D Bio::Species->new();=0A=
    $make->classification( @class );=0A=
    $make->common_name( $common      ) if $common;=0A=
    $make->sub_species( $sub_species ) if $sub_species;=0A=
    $make->organelle  ( $org         ) if $org;=0A=
    return $make;=0A=
}=0A=
=0A=
=3Dhead2 _read_EMBL_DBLink=0A=
=0A=
 Title   : _read_EMBL_DBLink=0A=
 Usage   :=0A=
 Function: Reads the EMBL database cross reference ("DR") lines=0A=
 Example :=0A=
 Returns : A list of Bio::Annotation::DBLink objects=0A=
 Args    :=0A=
=0A=
=3Dcut=0A=
=0A=
sub _read_EMBL_DBLink {=0A=
    my( $self,$buffer ) =3D @_;=0A=
    my( @db_link );=0A=
=0A=
    $_ =3D $$buffer;=0A=
    while (defined( $_ ||=3D $self->_readline )) {=0A=
        =0A=
        if (my($databse, $prim_id, $sec_id)=0A=
                =3D /^DR   ([^\s;]+);\s*([^\s;]+);\s*([^\s;]+)?\.$/) =
{=0A=
            my $link =3D Bio::Annotation::DBLink->new();=0A=
            $link->database   ( $databse );=0A=
            $link->primary_id ( $prim_id );=0A=
            $link->optional_id( $sec_id  ) if $sec_id;=0A=
            push(@db_link, $link);=0A=
	}=0A=
        else {=0A=
            last;=0A=
        }=0A=
        =0A=
        $_ =3D undef; # Empty $_ to trigger read of next line=0A=
    }=0A=
    =0A=
    $$buffer =3D $_;=0A=
    =0A=
    return @db_link;=0A=
}=0A=
=0A=
=3Dhead2 _filehandle=0A=
=0A=
 Title   : _filehandle=0A=
 Usage   : $obj->_filehandle($newval)=0A=
 Function: =0A=
 Example : =0A=
 Returns : value of _filehandle=0A=
 Args    : newvalue (optional)=0A=
=0A=
=0A=
=3Dcut=0A=
=0A=
sub _filehandle{=0A=
   my ($obj,$value) =3D @_;=0A=
   if( defined $value) {=0A=
      $obj->{'_filehandle'} =3D $value;=0A=
    }=0A=
    return $obj->{'_filehandle'};=0A=
=0A=
}=0A=
=0A=
=3Dhead2 _read_FTHelper_EMBL=0A=
=0A=
 Title   : _read_FTHelper_EMBL=0A=
 Usage   : _read_FTHelper_EMBL($buffer)=0A=
 Function: reads the next FT key line=0A=
 Example :=0A=
 Returns : Bio::SeqIO::FTHelper object =0A=
 Args    : filehandle and reference to a scalar=0A=
=0A=
=0A=
=3Dcut=0A=
=0A=
sub _read_FTHelper_EMBL {=0A=
    my ($self,$buffer) =3D @_;=0A=
    =0A=
    my ($key,   # The key of the feature=0A=
        $loc,   # The location line from the feature=0A=
        @qual,  # An arrray of lines making up the qualifiers=0A=
        );=0A=
    =0A=
    if ($$buffer =3D~ /^FT   (\S+)\s+(\S+)/) {=0A=
        $key =3D $1;=0A=
        $loc =3D $2;=0A=
        # Read all the lines up to the next feature=0A=
        while ( defined($_ =3D $self->_readline) ) {=0A=
            if (/^FT(\s+)(.+?)\s*$/) {=0A=
                # Lines inside features are preceeded by 19 spaces=0A=
                # A new feature is preceeded by 3 spaces=0A=
                if (length($1) > 4) {=0A=
                    # Add to qualifiers if we're in the qualifiers=0A=
                    if (@qual) {=0A=
                        push(@qual, $2);=0A=
                    }=0A=
                    # Start the qualifier list if it's the first =
qualifier=0A=
                    elsif (substr($2, 0, 1) eq '/') {=0A=
                        @qual =3D ($2);=0A=
                    }=0A=
                    # We're still in the location line, so append to =
location=0A=
                    else {=0A=
                        $loc .=3D $2;=0A=
                    }=0A=
                } else {=0A=
                    # We've reached the start of the next feature=0A=
                    last;=0A=
                }=0A=
            } else {=0A=
                # We're at the end of the feature table=0A=
                last;=0A=
            }=0A=
        }=0A=
    } else {=0A=
        # No feature key=0A=
        return;=0A=
    } =0A=
    =0A=
    # Put the first line of the next feature into the buffer=0A=
    $$buffer =3D $_;=0A=
=0A=
    # Make the new FTHelper object=0A=
    my $out =3D new Bio::SeqIO::FTHelper(-verbose =3D> =
$self->verbose());=0A=
    $out->key($key);=0A=
    $out->loc($loc);=0A=
=0A=
    # Now parse and add any qualifiers.  (@qual is kept=0A=
    # intact to provide informative error messages.)=0A=
  QUAL: for (my $i =3D 0; $i < @qual; $i++) {=0A=
        $_ =3D $qual[$i];=0A=
        my( $qualifier, $value ) =3D m{^/([^=3D]+)(?:=3D(.+))?}=0A=
            or $self->throw("Can't see new qualifier in: =
$_\nfrom:\n"=0A=
                . join('', map "$_\n", @qual));=0A=
        if (defined $value) {=0A=
            # Do we have a quoted value?=0A=
            if (substr($value, 0, 1) eq '"') {=0A=
                # Keep adding to value until we find the trailing =
quote=0A=
                # and the quotes are balanced=0A=
                while ($value !~ /"$/ or $value =3D~ tr/"/"/ % 2) { =
#"=0A=
                    $i++;=0A=
                    my $next =3D $qual[$i];=0A=
                    unless (defined($next)) {=0A=
                        warn("Unbalanced quote in:\n", map("$_\n", =
@qual),=0A=
                            "No further qualifiers will be added for =
this feature");=0A=
                        last QUAL;=0A=
                    }=0A=
=0A=
                    # Join to value with space if value or next line =
contains a space=0A=
                    $value .=3D (grep /\s/, ($value, $next)) ? " $next" =
: $next;=0A=
                }=0A=
                # Trim leading and trailing quotes=0A=
                $value =3D~ s/^"|"$//g;=0A=
                # Undouble internal quotes=0A=
                $value =3D~ s/""/"/g; #"=0A=
            }=0A=
        } else {=0A=
            $value =3D '_no_value';=0A=
        }=0A=
=0A=
        # Store the qualifier=0A=
        $out->field->{$qualifier} ||=3D [];=0A=
        push(@{$out->field->{$qualifier}},$value);=0A=
    }   =0A=
=0A=
    return $out;=0A=
}=0A=
=0A=
=3Dhead2 _write_line_EMBL=0A=
=0A=
 Title   : _write_line_EMBL=0A=
 Usage   :=0A=
 Function: internal function=0A=
 Example :=0A=
 Returns : =0A=
 Args    :=0A=
=0A=
=0A=
=3Dcut=0A=
=0A=
sub _write_line_EMBL {=0A=
   my ($self,$pre1,$pre2,$line,$length) =3D @_;=0A=
=0A=
   $length || die "Miscalled write_line_EMBL without length. =
Programming error!";=0A=
   my $subl =3D $length - length $pre2;=0A=
   my $linel =3D length $line;=0A=
   my $i;=0A=
=0A=
   my $sub =3D substr($line,0,$length - length $pre1);=0A=
=0A=
   $self->_print( "$pre1$sub\n");=0A=
   =0A=
   for($i=3D ($length - length $pre1);$i < $linel;) {=0A=
       $sub =3D substr($line,$i,($subl));=0A=
       $self->_print( "$pre2$sub\n");=0A=
       $i +=3D $subl;=0A=
   }=0A=
=0A=
}=0A=
=0A=
=3Dhead2 _write_line_EMBL_regex=0A=
=0A=
 Title   : _write_line_EMBL_regex=0A=
 Usage   :=0A=
 Function: internal function for writing lines of specified=0A=
           length, with different first and the next line =0A=
           left hand headers and split at specific points in the=0A=
           text=0A=
 Example :=0A=
 Returns : nothing=0A=
 Args    : file handle, first header, second header, text-line, regex =
for line breaks, total line length=0A=
=0A=
=0A=
=3Dcut=0A=
=0A=
sub _write_line_EMBL_regex {=0A=
    my ($self,$pre1,$pre2,$line,$regex,$length) =3D @_;=0A=
=0A=
    #print STDOUT "Going to print with $line!\n";=0A=
=0A=
    $length || die "Programming error - called write_line_EMBL_regex =
without length.";=0A=
=0A=
    if( length $pre1 !=3D length $pre2 ) {=0A=
        die "Programming error - called write_line_EMBL_regex with =
different length pre1 and pre2 tags!";=0A=
    }=0A=
=0A=
    my $subl =3D $length - (length $pre1) -1 ;=0A=
=0A=
=0A=
=0A=
    my( @lines );=0A=
    while(defined $line && =0A=
	  $line =3D~ m/(.{1,$subl})($regex)/g) {=0A=
	push(@lines, $1.$2);=0A=
    }=0A=
    foreach (@lines) { s/\s+$//; }=0A=
    =0A=
    # Print first line=0A=
    my $s =3D shift(@lines) || '';    =0A=
    $self->_print( "$pre1$s\n");=0A=
    =0A=
    # Print the rest=0A=
    foreach my $s ( @lines ) {=0A=
	$s =3D '' if( !defined $s );=0A=
        $self->_print( "$pre2$s\n");=0A=
    }=0A=
}=0A=
=0A=
=3Dhead2 _post_sort=0A=
=0A=
 Title   : _post_sort=0A=
 Usage   : $obj->_post_sort($newval)=0A=
 Function: =0A=
 Returns : value of _post_sort=0A=
 Args    : newvalue (optional)=0A=
=0A=
=0A=
=3Dcut=0A=
=0A=
sub _post_sort{=0A=
   my $obj =3D shift;=0A=
   if( @_ ) {=0A=
      my $value =3D shift;=0A=
      $obj->{'_post_sort'} =3D $value;=0A=
    }=0A=
    return $obj->{'_post_sort'};=0A=
=0A=
}=0A=
=0A=
=3Dhead2 _show_dna=0A=
=0A=
 Title   : _show_dna=0A=
 Usage   : $obj->_show_dna($newval)=0A=
 Function: =0A=
 Returns : value of _show_dna=0A=
 Args    : newvalue (optional)=0A=
=0A=
=0A=
=3Dcut=0A=
=0A=
sub _show_dna{=0A=
   my $obj =3D shift;=0A=
   if( @_ ) {=0A=
      my $value =3D shift;=0A=
      $obj->{'_show_dna'} =3D $value;=0A=
    }=0A=
    return $obj->{'_show_dna'};=0A=
=0A=
}=0A=
=0A=
=3Dhead2 _id_generation_func=0A=
=0A=
 Title   : _id_generation_func=0A=
 Usage   : $obj->_id_generation_func($newval)=0A=
 Function: =0A=
 Returns : value of _id_generation_func=0A=
 Args    : newvalue (optional)=0A=
=0A=
=0A=
=3Dcut=0A=
=0A=
sub _id_generation_func{=0A=
   my $obj =3D shift;=0A=
   if( @_ ) {=0A=
      my $value =3D shift;=0A=
      $obj->{'_id_generation_func'} =3D $value;=0A=
    }=0A=
    return $obj->{'_id_generation_func'};=0A=
=0A=
}=0A=
=0A=
=3Dhead2 _ac_generation_func=0A=
=0A=
 Title   : _ac_generation_func=0A=
 Usage   : $obj->_ac_generation_func($newval)=0A=
 Function: =0A=
 Returns : value of _ac_generation_func=0A=
 Args    : newvalue (optional)=0A=
=0A=
=0A=
=3Dcut=0A=
=0A=
sub _ac_generation_func{=0A=
   my $obj =3D shift;=0A=
   if( @_ ) {=0A=
      my $value =3D shift;=0A=
      $obj->{'_ac_generation_func'} =3D $value;=0A=
    }=0A=
    return $obj->{'_ac_generation_func'};=0A=
=0A=
}=0A=
=0A=
=3Dhead2 _sv_generation_func=0A=
=0A=
 Title   : _sv_generation_func=0A=
 Usage   : $obj->_sv_generation_func($newval)=0A=
 Function: =0A=
 Returns : value of _sv_generation_func=0A=
 Args    : newvalue (optional)=0A=
=0A=
=0A=
=3Dcut=0A=
=0A=
sub _sv_generation_func{=0A=
   my $obj =3D shift;=0A=
   if( @_ ) {=0A=
      my $value =3D shift;=0A=
      $obj->{'_sv_generation_func'} =3D $value;=0A=
    }=0A=
    return $obj->{'_sv_generation_func'};=0A=
=0A=
}=0A=
=0A=
=3Dhead2 _kw_generation_func=0A=
=0A=
 Title   : _kw_generation_func=0A=
 Usage   : $obj->_kw_generation_func($newval)=0A=
 Function: =0A=
 Returns : value of _kw_generation_func=0A=
 Args    : newvalue (optional)=0A=
=0A=
=0A=
=3Dcut=0A=
=0A=
sub _kw_generation_func{=0A=
   my $obj =3D shift;=0A=
   if( @_ ) {=0A=
      my $value =3D shift;=0A=
      $obj->{'_kw_generation_func'} =3D $value;=0A=
    }=0A=
    return $obj->{'_kw_generation_func'};=0A=
=0A=
}=0A=
=0A=
1;=0A=

------_=_NextPart_000_01C15285.287A5B80--