[Bioperl-l] bioperl article/report

Gabriel Horner cldwalker at chwhat.com
Thu Dec 2 05:13:36 EST 2004


Hi again,
I've written a 1500 word paper on perl and bioinformatics for a bioinformatics class.
It was geared towards my teacher (surprise ;), meaning I talk about Bioperl
more from a biological,qualitative viewpoint. Feel free to post this up
if you think it'll be of use to anyone. I'm enclosing it below and attaching it.

Good day,
Gabriel
				Gabriel Horner
				12/2/4
	Perl and Bioinformatics

	The focus of this paper is to explore what is
available for the bioinformatician in the world of Perl.
First, we'll explore the Bioperl modules which are perl's
main organized bioinformatic code base. Then we'll look at
some programs that use perl.  From these two explorations
I aim to show some of Perl's strengths and weaknesses in
the bioinformatics field and help the bioinformatician
better choose when to use perl.
	The Bioperl modules are under the Bio::* namespace
on CPAN (http://www.cpan.org). These modules began in 1995
when few other biological toolkits existed. After almost a
decade, it has grown to over 300 modules with at least 20
developers. The modules themselves are not meant to be out
of the box programs but rather reusable chunks that can be
combined to create a wide variety of functionality. This
functionality can be divided into the following topics,
each to be covered in more detail:

	Sequences: Manipulate them ie read,write,translate
	Conversion: Convert between different sequence or alignment files
	Databases: Remote and local database access for sequences
	 and references
	Graphics: Draw sequences by displaying discrete ranges on a
	 number line ie annotations and contig maps
	Alignments: Create from sequences, manipulate, multiple alignment
	 analysis
	External analysis: Performs queries through other programs ie
		ClustalW,TCoffee, EMBOSS, BLAST and at least 15 others

	Sequences, usually abbreviated as Seq, are perhaps
the most widely used object.  They are used to represent
DNA,RNA or protein sequences. There are several different
types including Bio::PrimarySeq for lightweight use and
Bio::RichSeq which supports richer annotations. A standard
Bio::Seq object can manipulate and save features and
annotations as well as truncate,translate and reverse
complement the sequence.  Basic annotations are
implemented using Bio::AnnotationI. Basic features are
represented by Bio::SeqFeature::Generic objects. Features
can be associated with other features, called
sub-features, and annotations. Features that associate to
particular locations on a sequence associate to a
sequence's location object.  A location on a sequence is
its own object since locations can be varied enough.  For
example, an exon ,a feature, may have multiple locations
or in an unfinished genome, a location may have some
uncertainty.
	A sequence is usually read from a file or a
database although it is possible to simply create a
sequence with a given string.  Reading in files is done
through Bio::SeqIO. This object also acts as a stream as
it has a sequence iterator method.  Using this class, it
is possible to convert between several different formats
including Ace database, BSML, Chaos XML,EMBL, FASTA,
GenBank, GCG,PIR, PLN, NCBI and SwissProt. Since a SeqIO
object can define a filehandle, converting between formats
is as simple as [1]:

	use Bio::SeqIO;

	$in  = Bio::SeqIO->newFh(-file => "inputfilename" , '-format' => 'Fasta');
	$out = Bio::SeqIO->newFh('-format' => 'EMBL');

	# World's shortest Fasta<->EMBL format  converter:
	print $out $_ while <$in>;

	Reading in sequences from databases can be
approached in two ways. The first way is to use the
correct Bio::DB::* module for the known database type ie
indexed flat-file ,local relational or remote relational.
Some currently supported remote databases include genbank,
genpept,swissprot,biofetch and EMBL. For now, sequences
are mainly retrieved by id or accession number. An example
of retrieving a sequence is [2]:

      $gb = new Bio::DB::GenBank();
      # this returns a Seq object :
      $seq1 = $gb->get_Seq_by_id('MUSIGHBA1');

Bio::Index::* modules are used to read and write local
flat files. Since sequences are local and indexed, this is
a fast way of retrieving sequences by unique keys.  The
second way of accessing data is via the OBDA (Open
Bioinformatics Data Access) Registry system. This system,
used by Bio{Perl,Java,Ruby,Python} programs, allows easily
changing a program's database source by changing
parameters in a configuration file. 
	Displaying sequences is done through Bio::Graphics
modules, an extension of the GD module.  These modules can
draw 'any type of map in which a set of discrete ranges
need to be laid out on the number line.' [3]  Usually the
ranges describe a feature's location on a sequence and map
onto a 'track' spanning the width of the display. It is
possible for more than one feature to occupy the same
track.
	A basic sequence alignment is represented by a
Bio::SimpleAlign object. Some of its methods are
adding/deleting sequences,fetching sequences and
manipulating characters of all sequences. The actual
alignment is done through interfaces to external programs
under Bio::Tools::Run::Alignment::*. There are interfaces
to ClustalW, BLAST's bl2seq, Lagan,
TCoffee,StandAloneFasta and pSW. Like sequences,
alignments can be read from files or other sources via
Bio::AlignIO.
	Bioperl has an optional bundle of modules on CPAN
called bioperl-run, under Bio::Tools::Run. This bundle has
the common theme in that the modules provide a perl
interface to external programs. A majority of the bundle
interfaces to two main applications, EMBOSS and Pise,
which themselves are program bundlers. Pise,
http://www.pasteur.fr/recherche/unites/sis/Pise/ is a
web-interface generator that wraps around commandline
programs. By providing a more user-friendly and uniform
interface to a variety of commands, it aims at overcoming
the difficulty in learning a variety of commandline
programs. Since it was developed with bioinformatic
commands in mind, commands that take longer than a minute
are assigned a job id and email the user when the job is
done. It currently interfaces to about 150 commandline
programs but can easily be extended to others. For a list
of currently-interfaced programs check
http://search.cpan.org/~birney/bioperl-run-1.4/ and look
to everything under the Bio::Tools::Run::PiseApplication
namespace. EMBOSS is a package of about 100 commands aimed
at the molecular biology community. The package covers
areas such as alignment,nucleic structure ,enzyme
kinetics, feature tables,phylogeny and protein structure
and composition.
	Having covered the basics of the Bioperl bundle,
let us examine some common Perl bioinformatic
applications. Ensembl, http://www.ensembl.org, is an
open-source project that aims to organize data around the
sequences of large genomes with an emphasis on human and
mammilian genomes.  In other words,it comprehensively
annotates a genome and tries to link as many similar
functional elements across genomes as possible . This is
done in a thorough manner as they coordinate annotations
with other groups that specialize in parts of a genome.
When no annotations are provided, they generate
annotations.  When a feature is difficult to predict such
as gene structures, a best guess is calculated and the
evidence leading to this guess is linked for users to
explore. The implementation details of Ensembl is beyond
the scope of this paper but there are a few interesting
points that the writers make.  Perl was chosen for a few
reasons, the main ones being its quick implementation time
and its large dependency on the Bioperl toolkit. Ensembl
borrowed heavily from Bioperl's object model and used its
parsing of several sequence formats to its advantage. Four
years since its inception, Ensembl now barely relies on
Bioperl (a few Seq and SeqIO objects when I grepped inside
its main library directory).  According to Stabenau et Al.
[4], disadvantages of using Perl have been its absence of
compile time checking of function prototypes and its
reference-count-based garbage collector. The former reason
has led to many runtime errors.
	Another annotation perl program is GBrowse,
http://www.gmod.org/ggb/ . This program is part of a
larger project called GMOD whose goal is 'to develop
resuable software components for model organism system
databases' [5] or MODs. MODs collect data from research
and experiments in efforts 'to connect genomic features to
the classical biology of the organism' [6]. GBrowse aims
for this goal by providing the biologist with the ability
to view public annotations, search the full text of
features, edit annotations with private annotations and
publish the modified annotations.  Its code is mainly all
based on Bioperl, the rendering of images handled by
Bio::Graphics modules and the communication with databases
handled by Bio::DB modules. Perl was chosen because the
authors believe its users would be more likely to know how
to use it and extend GBrowse than with a language like C.
Another reason was Bioperl's richness in the functionality
it needed, graphics and a variety of database back ends.
	The final program this paper summarizes is MuGeN,
http://www-mig.jouy.inra.fr/bdsi/MuGeN/.  MuGeN can
display multiple annotated genome portions from both local
and remote sources. These maps can be combined with
analysis results loaded from XML files. Some of the
functionality overlaps with the previously mentioned
programs as well as Entrez's Map Viewer and UCSC's Human
Genome Browser.  Unlike most of these programs, MuGeN also
offers a batch mode from the commandline for a series of
annotated images.  Perl was chosen for this program
because Bioperl offered the parsers for sequence files as
well as a decent gui toolkit via Gtk-Perl.
	From this article, we can see that Perl's strength
in the bioinformatics world is largely due to Bioperl. Of
course it's also due to CPAN which offers the variety of
modules that made gui,web and xml programming easy for the
previewed programs. As for Perl's weakness mentioned by
the Ensembl team, I agree weak prototyping can cause
signficant headaches in a large project. But if you test
thoroughly as you write code, most of the headaches can be
avoided.  I must note that not all of Bioperl's
functionality was covered, most notably representation of
non-sequence data.


Footnotes
	1. From perl documentation on Bio::SeqIO, http://search.cpan.org/perldoc?Bio::SeqIO
	2. Reference 1
	3. From perl documentation Bio::Graphics::Panel,
		http://search.cpan.org/perldoc?Bio::Graphics::Panel
	4. Reference 8
	5. Reference 10, pg 1
	6. Reference 10, pg 1
References
	1. Birney, E. BioPerlTutorial. http://search.cpan.org/~birney/bioperl-1.4/bptutorial.pl 
	2. Birney, E. et al. An Overview of Ensembl.
		Genome Research 2004 14: 925-928. 
	3. Hoebeke,M. et al. MuGeN: simultaneous exploration of multiple genomes
		and computer analysis results Bioinformatics, 2003; 19: 859­864.
	4. Letondal, C. Bioperl course. http://www.pasteur.fr/recherche/unites/sis/formation/bioperl/
	5. Letondal, C., 2001. A Web interface generator for molecular biology programs in Unix.
		 Bioinformatics, Jan 2001; 17: 73 - 82.
	6. Osborne, B. http://bioperl.org/HOWTOs/Feature-Annotation/Feature-Annotation.txt
	7.Rice,P et al. EMBOSS: The European Molecular Biology Open Software Suite.
		Trends in Genetics June 2000; Vol 16, No 6. pp.276-277 
	8. Stabenau,A. et al. The Ensembl Core Software Libraries.
		Genome Res. 2004 14: 929-933.
	9. Stajich, J. The Bioperl Toolkit: Perl Modules for the Life Sciences.
	10. Stein,L. et al. The Generic Genome Browser: A Building Block for a Model
		 Organism System Database.

-- 
my looovely website -- http://www.chwhat.com
BTW, IF chwhat.com goes down email me at gabriel.horner at cern.ch
-------------- next part --------------
	The focus of this paper is to explore what is
available for the bioinformatician in the world of Perl.
First, we'll explore the Bioperl modules which are perl's
main organized bioinformatic code base. Then we'll look at
some programs that use perl.  From these two explorations
I aim to show some of Perl's strengths and weaknesses in
the bioinformatics field and help the bioinformatician
better choose when to use perl.
	The Bioperl modules are under the Bio::* namespace
on CPAN (http://www.cpan.org). These modules began in 1995
when few other biological toolkits existed. After almost a
decade, it has grown to over 300 modules with at least 20
developers. The modules themselves are not meant to be out
of the box programs but rather reusable chunks that can be
combined to create a wide variety of functionality. This
functionality can be divided into the following topics,
each to be covered in more detail:

	Sequences: Manipulate them ie read,write,translate
	Conversion: Convert between different sequence or alignment files
	Databases: Remote and local database access for sequences
	 and references
	Graphics: Draw sequences by displaying discrete ranges on a
	 number line ie annotations and contig maps
	Alignments: Create from sequences, manipulate, multiple alignment
	 analysis
	External analysis: Performs queries through other programs ie
		ClustalW,TCoffee, EMBOSS, BLAST and at least 15 others

	Sequences, usually abbreviated as Seq, are perhaps
the most widely used object.  They are used to represent
DNA,RNA or protein sequences. There are several different
types including Bio::PrimarySeq for lightweight use and
Bio::RichSeq which supports richer annotations. A standard
Bio::Seq object can manipulate and save features and
annotations as well as truncate,translate and reverse
complement the sequence.  Basic annotations are
implemented using Bio::AnnotationI. Basic features are
represented by Bio::SeqFeature::Generic objects. Features
can be associated with other features, called
sub-features, and annotations. Features that associate to
particular locations on a sequence associate to a
sequence's location object.  A location on a sequence is
its own object since locations can be varied enough.  For
example, an exon ,a feature, may have multiple locations
or in an unfinished genome, a location may have some
uncertainty.
	A sequence is usually read from a file or a
database although it is possible to simply create a
sequence with a given string.  Reading in files is done
through Bio::SeqIO. This object also acts as a stream as
it has a sequence iterator method.  Using this class, it
is possible to convert between several different formats
including Ace database, BSML, Chaos XML,EMBL, FASTA,
GenBank, GCG,PIR, PLN, NCBI and SwissProt. Since a SeqIO
object can define a filehandle, converting between formats
is as simple as [1]:

	use Bio::SeqIO;

	$in  = Bio::SeqIO->newFh(-file => "inputfilename" , '-format' => 'Fasta');
	$out = Bio::SeqIO->newFh('-format' => 'EMBL');

	# World's shortest Fasta<->EMBL format  converter:
	print $out $_ while <$in>;

	Reading in sequences from databases can be
approached in two ways. The first way is to use the
correct Bio::DB::* module for the known database type ie
indexed flat-file ,local relational or remote relational.
Some currently supported remote databases include genbank,
genpept,swissprot,biofetch and EMBL. For now, sequences
are mainly retrieved by id or accession number. An example
of retrieving a sequence is [2]:

      $gb = new Bio::DB::GenBank();
      # this returns a Seq object :
      $seq1 = $gb->get_Seq_by_id('MUSIGHBA1');

Bio::Index::* modules are used to read and write local
flat files. Since sequences are local and indexed, this is
a fast way of retrieving sequences by unique keys.  The
second way of accessing data is via the OBDA (Open
Bioinformatics Data Access) Registry system. This system,
used by Bio{Perl,Java,Ruby,Python} programs, allows easily
changing a program's database source by changing
parameters in a configuration file. 
	Displaying sequences is done through Bio::Graphics
modules, an extension of the GD module.  These modules can
draw 'any type of map in which a set of discrete ranges
need to be laid out on the number line.' [3]  Usually the
ranges describe a feature's location on a sequence and map
onto a 'track' spanning the width of the display. It is
possible for more than one feature to occupy the same
track.
	A basic sequence alignment is represented by a
Bio::SimpleAlign object. Some of its methods are
adding/deleting sequences,fetching sequences and
manipulating characters of all sequences. The actual
alignment is done through interfaces to external programs
under Bio::Tools::Run::Alignment::*. There are interfaces
to ClustalW, BLAST's bl2seq, Lagan,
TCoffee,StandAloneFasta and pSW. Like sequences,
alignments can be read from files or other sources via
Bio::AlignIO.
	Bioperl has an optional bundle of modules on CPAN
called bioperl-run, under Bio::Tools::Run. This bundle has
the common theme in that the modules provide a perl
interface to external programs. A majority of the bundle
interfaces to two main applications, EMBOSS and Pise,
which themselves are program bundlers. Pise,
http://www.pasteur.fr/recherche/unites/sis/Pise/ is a
web-interface generator that wraps around commandline
programs. By providing a more user-friendly and uniform
interface to a variety of commands, it aims at overcoming
the difficulty in learning a variety of commandline
programs. Since it was developed with bioinformatic
commands in mind, commands that take longer than a minute
are assigned a job id and email the user when the job is
done. It currently interfaces to about 150 commandline
programs but can easily be extended to others. For a list
of currently-interfaced programs check
http://search.cpan.org/~birney/bioperl-run-1.4/ and look
to everything under the Bio::Tools::Run::PiseApplication
namespace. EMBOSS is a package of about 100 commands aimed
at the molecular biology community. The package covers
areas such as alignment,nucleic structure ,enzyme
kinetics, feature tables,phylogeny and protein structure
and composition.
	Having covered the basics of the Bioperl bundle,
let us examine some common Perl bioinformatic
applications. Ensembl, http://www.ensembl.org, is an
open-source project that aims to organize data around the
sequences of large genomes with an emphasis on human and
mammilian genomes.  In other words,it comprehensively
annotates a genome and tries to link as many similar
functional elements across genomes as possible . This is
done in a thorough manner as they coordinate annotations
with other groups that specialize in parts of a genome.
When no annotations are provided, they generate
annotations.  When a feature is difficult to predict such
as gene structures, a best guess is calculated and the
evidence leading to this guess is linked for users to
explore. The implementation details of Ensembl is beyond
the scope of this paper but there are a few interesting
points that the writers make.  Perl was chosen for a few
reasons, the main ones being its quick implementation time
and its large dependency on the Bioperl toolkit. Ensembl
borrowed heavily from Bioperl's object model and used its
parsing of several sequence formats to its advantage. Four
years since its inception, Ensembl now barely relies on
Bioperl (a few Seq and SeqIO objects when I grepped inside
its main library directory).  According to Stabenau et Al.
[4], disadvantages of using Perl have been its absence of
compile time checking of function prototypes and its
reference-count-based garbage collector. The former reason
has led to many runtime errors.
	Another annotation perl program is GBrowse,
http://www.gmod.org/ggb/ . This program is part of a
larger project called GMOD whose goal is 'to develop
resuable software components for model organism system
databases' [5] or MODs. MODs collect data from research
and experiments in efforts 'to connect genomic features to
the classical biology of the organism' [6]. GBrowse aims
for this goal by providing the biologist with the ability
to view public annotations, search the full text of
features, edit annotations with private annotations and
publish the modified annotations.  Its code is mainly all
based on Bioperl, the rendering of images handled by
Bio::Graphics modules and the communication with databases
handled by Bio::DB modules. Perl was chosen because the
authors believe its users would be more likely to know how
to use it and extend GBrowse than with a language like C.
Another reason was Bioperl's richness in the functionality
it needed, graphics and a variety of database back ends.
	The final program this paper summarizes is MuGeN,
http://www-mig.jouy.inra.fr/bdsi/MuGeN/.  MuGeN can
display multiple annotated genome portions from both local
and remote sources. These maps can be combined with
analysis results loaded from XML files. Some of the
functionality overlaps with the previously mentioned
programs as well as Entrez's Map Viewer and UCSC's Human
Genome Browser.  Unlike most of these programs, MuGeN also
offers a batch mode from the commandline for a series of
annotated images.  Perl was chosen for this program
because Bioperl offered the parsers for sequence files as
well as a decent gui toolkit via Gtk-Perl.
	From this article, we can see that Perl's strength
in the bioinformatics world is largely due to Bioperl. Of
course it's also due to CPAN which offers the variety of
modules that made gui,web and xml programming easy for the
previewed programs. As for Perl's weakness mentioned by
the Ensembl team, I agree weak prototyping can cause
signficant headaches in a large project. But if you test
thoroughly as you write code, most of the headaches can be
avoided.  I must note that not all of Bioperl's
functionality was covered, most notably representation of
non-sequence data.


Footnotes
	1. From perl documentation on Bio::SeqIO, http://search.cpan.org/perldoc?Bio::SeqIO
	2. Reference 1
	3. From perl documentation Bio::Graphics::Panel,
		http://search.cpan.org/perldoc?Bio::Graphics::Panel
	4. Reference 8
	5. Reference 10, pg 1
	6. Reference 10, pg 1
References
	1. Birney, E. BioPerlTutorial. http://search.cpan.org/~birney/bioperl-1.4/bptutorial.pl 
	2. Birney, E. et al. An Overview of Ensembl.
		Genome Research 2004 14: 925-928. 
	3. Hoebeke,M. et al. MuGeN: simultaneous exploration of multiple genomes
		and computer analysis results Bioinformatics, 2003; 19: 859­864.
	4. Letondal, C. Bioperl course. http://www.pasteur.fr/recherche/unites/sis/formation/bioperl/
	5. Letondal, C., 2001. A Web interface generator for molecular biology programs in Unix.
		 Bioinformatics, Jan 2001; 17: 73 - 82.
	6. Osborne, B. http://bioperl.org/HOWTOs/Feature-Annotation/Feature-Annotation.txt
	7.Rice,P et al. EMBOSS: The European Molecular Biology Open Software Suite.
		Trends in Genetics June 2000; Vol 16, No 6. pp.276-277 
	8. Stabenau,A. et al. The Ensembl Core Software Libraries.
		Genome Res. 2004 14: 929-933.
	9. Stajich, J. The Bioperl Toolkit: Perl Modules for the Life Sciences.
	10. Stein,L. et al. The Generic Genome Browser: A Building Block for a Model
		 Organism System Database.


More information about the Bioperl-l mailing list