[Bioperl-l] GenBankParser comparison to bioperl parser

John Kloss jkloss@sapiens.wustl.edu
Wed, 11 Sep 2002 14:58:53 -0500 (CDT)


I was asked to provided some coarses tests demonstrating the features and speed of the GenBankParser in comparsion to the parser provided in bioperl. I just
learned how to use the bioperl interface, so please let me know if I did 
anything stupid.

In all tests I used the same flat file

	bct1.seq	size gzipped 73679187 bytes.

All tests run on a realtively loaded Sun Enterprise 4500 8 x 400Mhz sparc with
10Gbs ram.

1st test.  Convert GenBank flat file to fasta.

########################################
Bioperl parser code:

#!/usr/local/bin/perl -w

use Bio::SeqIO;

$in = Bio::SeqIO->newFh(
        '-file'         => "gzip -cd /am/db/gb-ff/gbbct1.seq.gz |",
        '-format'       => 'Genbank' );

$out = Bio::SeqIO->newFh(
        '-file'         => "> output1.txt",
        '-format'       => 'fasta' );

print $out $_ while <$in>;

########################################
Time:

real    17m4.008s
user    15m19.500s
sys     0m20.850s


########################################
GenBankParser code:

#!/usr/local/bin/perl -w

use GenBankParser qw( DEFINITION LOCUS ORIGIN VERSION );

open( FH, "gzip -cd /am/db/gb-ff/gbbct1.seq.gz |" );

(new GenBankParser)->parse_file( \*FH, sub {
    my $parser = shift;

    print ">gi|", $parser->VERSION->gi,
	  "|gb|", $parser->VERSION->accession,
	  "."   , $parser->VERSION->version,
	  "|"   , $parser->LOCUS->locus,
	  " "   , $parser->DEFINITION, "\n";

    my   $seq   = $parser->ORIGIN;
    pos( $seq ) = 0;

    print $1, "\n" while $seq =~ m/\G (.{1,60}) /gx;
} );


########################################
Time:

real    1m35.797s
user    1m13.600s
sys     0m6.350s


Next program, dump out the locus identifer, accession number, molecular type,
gi, and definition line.

########################################
Bioperl code:

#!/usr/local/bin/perl -w

use Bio::SeqIO;

$in = Bio::SeqIO->new(
        '-file'         => "gzip -cd /am/db/gb-ff/gbbct1.seq.gz |",
        '-format'       => 'Genbank' );


while (my $seq = $in->next_seq( )) {

    print $seq->display_id( ),          "\n";
    print $seq->accession_number( ),    "\n";
    print $seq->alphabet( ),            "\n";
    print $seq->primary_id( ),          "\n";
    print $seq->desc( ),                "\n";
}

########################################
Time:

real    16m29.771s
user    14m43.560s
sys     0m18.280s


########################################
GenBankParser code:

#!/usr/local/bin/perl -w

use GenBankParser qw( DEFINITION LOCUS VERSION );

open( FH, "gzip -cd /am/db/gb-ff/gbbct1.seq.gz |" );

(new GenBankParser)->parse_file( \*FH, sub {
    my $parser = shift;

    print $parser->LOCUS->locus,        "\n";
    print $parser->VERSION->accession,  "\n";
    print $parser->LOCUS->type,         "\n";
    print $parser->VERSION->gi,         "\n";
    print $parser->DEFINITION,          "\n";
} );


########################################
Time:

real    0m51.068s
user    0m40.300s
sys     0m3.580s


Last program, print out gt2fasta format.  That is protien in fasta format.
First is GenBankParser code:

#!/usr/local/bin/perl -w

use strict;
use GenBankParser qw( DEFINITION FEATURES SOURCE LOCUS );
use GenBankParser::FEATURES qw( CDS );

open( FH, "gzip -cd /am/db/gb-ff/gbbct.seq.gz |");

(new GenBankParser)->parse_file( \*FH, sub {
	my $Parser = shift;

	my $cdscnt = 0;
	for my $cds (@{ $Parser->FEATURES->CDS }) {

		$cdscnt++;

		#
		# Skip pseudo genes
		#
		next if $cds->pseudo;

		my $definition = 
			  $cds->product ||
			  $cds->gene    ||
			(($cds->note) ? join "; ",
				$Parser->DEFINITION, $cds->note :
				$Parser->DEFINITION );
	
		my $organism = $Parser->SOURCE->organism;

		print	">gi|", $cds->gi,
			"|gp|",	$cds->accession,
			"."   , $cds->version,
			"|"   , $Parser->LOCUS->locus,
			"_"   , $cdscnt,
			" "   , $definition,
			" ["  , $organism,
			"]\n";

		my   $seq   = $cds->translation;
		pos( $seq ) = 0;

		print $1, "\n" while $seq =~ m/\G (.{1,58}) /gx;
	}
});


########################################
Time:

real    3m58.278s
user    2m54.370s
sys     0m5.720s

And BioPerl code.  I just learned the bioperl api so I may not be doing things
the most efficient way.  If so please tell me an I'll rewrite and rerun my
tests.

#!/usr/local/bin/perl -w

use Bio::SeqIO;
use Bio::SeqFeature::Generic;

$in = Bio::SeqIO->new(
	'-file' 	=> "gzip -cd /am/db/gb-ff/gbbct1.seq.gz |",
	'-format'	=> 'Genbank' );

while (my $seq = $in->next_seq( )) {

    my $cdscnt = 0;

    foreach my $feature ( $seq->top_SeqFeatures ) {

	if ( $feature->primary_tag( ) eq 'CDS' ) {
	    
	    $cdscnt++;

	    next if $feature->has_tag( 'pseudo' );

	    my $definition = undef;

	    if    ( $feature->has_tag( 'product' ) ) {
		$definition = ($feature->each_tag_value( 'product' ))[0];
	    }
	    elsif ( $feature->has_tag( 'gene' ) ) {
		$definition = ($feature->each_tag_value( 'gene'    ))[0];
	    }
	    elsif ( $feature->has_tag( 'note' ) ) {
		$definition = join "; ",
			       $seq->desc,
			      ($feature->each_tag_value( 'note'    ))[0];
	    }
	    else {
		$definition = $seq->desc;
	    }

	    my $organism = $seq->species->species;

	    my $accession = undef;
	    my $version   = undef;
	    my $gi        = undef;

	    if ( $feature->has_tag( 'db_xref' ) ) {
		foreach my $tag ( $feature->each_tag_value( 'db_xref' ) ) {
		    $gi = $1, last if $tag =~ /^GI:(\d+)$/;
		}
	    }

	    if ( $feature->has_tag( 'protein_id' ) ) {
		($feature->each_tag_value( 'protein_id' ))[0] =~
		    /^([^.]+)\.(\d+)$/;

		$accession = $1;
		$version   = $2;
	    }

	    print ">gi|", $gi,
		  "|gp|", $accession,
		  "."   , $version,
		  "|"   , $seq->display_id,
		  "_"   , $cdscnt,
		  " "   , $definition,
		  " ["  , $organism,
		  "]\n";

	    my    $seq   = ($feature->each_tag_value( 'translation' ))[0];
	    pos ( $seq ) = 0;

	    print $1, "\n" while $seq =~ m/\G (.{1,58}) /gx;
	}
    }
}

########################################
Time:

real    17m17.897s
user    15m23.400s
sys     0m19.490s


So in all cases the bioperl genbank parsers was 5 to 17 times as slow as the
GenBankParser.  I think that's pretty dramatic.

I also feel that the GenBankParser interface is a little more intuitive-- 
mirroring almost exactly the format of the actual flat file.  Of course, I'm
biased in this regard, being the author and all :).

	John Kloss.