[Bioperl-l] discusion/advice on non-bioperl bioinformatics modules

Sean Quinlan seanq@darwin.bu.edu
Wed, 22 Aug 2001 11:54:31 -0400 (EDT)


Good day all!
	I'm posting this message to the list to point out a modules announcement I posted to comp.lang.perl.modules and comp.lang.perl.misc. As you will note if you read the announcement or the docs, the modules are for use in bioinformatics. I'll discuse the rational a bit here as well. First as a quick introduction, my name is Sean Quinlan. I currently work for both Modular Genetics Inc. (www.modulargenetics.com), a small biotech startup, and for the molecular biology dept. of Mass. General Hospital, primarily involved in the NHLBI PGA work there (pga.mgh.harvard.edu). Most of the work on these modules however occured at Temple F. Smiths BioMolecular Engineering Research Center (bmerc-www.bu.edu) over the last few years.
	A quick description of the modules and their history, and then I'll discuss why I think there is a place for something seperate (but not necisarily unlinked) from the bioperl suite. I have just recently requested the namespaces CompBio and CompBio::Simple so far, but they have not yet been listed, and I am open to suggestions.
	The initial module, CompBio, grew out of a code base that already partially existed when I started at the BMERC nearly three years ago. One of the first tasks I set myself (partially as a learning experience) was to start finding out what functions we used most often, both internally and for the website we were developing, and start moving that code into a module. That module (BMERC::bio) was fairly simple, but made a lot of assumptions about being on our local cluster. Although I have tried to maintian and continue to improve the code and documentation, that module depended on being installed on our local network. By last spring I was starting to need the code elsewhere, and started playing around with a rewrite. After attending this years YAPC, I returned home inspired, not just by the conference in general, but also by Schwern's and Sean Dauge's (sp?) talks on module developement.
	OK, if your not bored yet, and haven't stormed off to flame me already, below is a quick description of what's in CompBio, pasted from my initial posting (well, I did remove the double spacing):
=from posting
Current functions in CompBio.pm:
# note - table format refers to tab delimited, such as id\tsequence[\n|\tother fields\n]

new - create new CompBio object
check_type - try to determine what format sequence data is in
tbl_to_fa - convert sequence data in table format to fasta
tbl_to_ig - convert sequence data in table format to intelligenics
fa_to_tbl - convert sequence data in fasta to table format
ig_to_tbl - convert sequence data in intelligenics format to table
dna_to_protein - convert dna sequence to protein sequence
complement - convert dna sequence to it's compliment
six_frame - translate dna sequence to protein across all six frames
aa_hash - hash lookup of aa using codons as keys - includes ambiguous codes
_stop - internal method used by six_frame
wu_blast - interface to WUBlast; old, ugly and not portable - next project after catching up Simple.pm
_error - internal method for varying error handling behavior without extra typing every time

Planned (in most cases some code already exists in BMERC::bio or elsewhere):

ncbi_blast - interface to NCBI's version of the blast tools
parse_blast - simple blast output parser - may need to be seperate versions for WU and NCBI blasts. Return tab delimited data in consistent format, such as score, %identity, start/stop positions of match, etc.
calculate_scores - calculates %equivalent identities and #effective identities from blast output
dnastar_to_tbl - convert sequence data in dna* format to table
tbl_to_dnastar - and back
gcg_to_tbl - convert sequence data in gcg format to table
tbl_to_gcg - and back
ncbi_to_tbl - convert sequence data in ncbi 'format' (as cut and pasted from .gbk reports or ncbi's website) to table
tbl_to_ncbi - and back

As you can see, nothing earth shattering. And for the most part, not very complex (the blast stuff probably being the biggest exception).

CompBio.pm has a very restricted interface, generally only accepting one type of input and returning on type of output, doing very little error checking. The other module, CompBio/Simple.pm, is intended to handle most of the error and saity checking and more complicated IO.

As I've stated (or at least tried to) in the docs and my postings, I have no intention or desire to replace or compete with bioperl. Nor is it my intention to suggest that my code is better than what's in bioperl. My only point is that bioperl is, well, huge, and I'm lazy (even if perhaps sometimes in the wrong ways). I do believe there are times and situations (and people) where installing and learning something the size and complexity of bioperl is unnecisary. I also feel that because of bioperls size, and the (at least seemingly) complex interdependencies and it's heavy OO style that some people may not use it because they feel it is beyond their ability to understand. I certainly felt that way when I first looked into using bioperl; although that was nearly two years ago, and the project (and docs) have come a long way since.

OK, now I'll try to get to the real reason I am making this post. I'd like to voulenteer. Regardless of whether or not they get listed, I would like to offer any code in the modules, or any of the utilities attached, to the bioperl project. Advice on what you all might find usefull, and how to go about making a version that fits into the project, should anything there be useful, would be appriciated. I am also very interested in looking at your code for handeling blast and perhaps making an adaption of it for the CompBio modules.

If the CompBio stuff does get listed, (and perhaps even if not) I would also like to here suggestions on how to make sure these modules do not pose problems for 'upgrading' to a full bioperl instalation or conflict with bioperl if used along side.

Anywho, there's my 2 cents. TIA for any input,
Sean Quinlan
seanq@darwin.bu.edu