[Bioperl-l] strange error parsing a specific NCBI gff file

William Hsiao william.hsiao at gmail.com
Tue Jun 27 19:52:03 UTC 2006


Hi all,
   I've encountered a strange problem while parsing a gff file from
NCBI using perl.  I'm hoping that someone on the list may have a
solution even though this is not a bioperl issue.  Maybe someone
familiar with gff3 parsing can help :)  Essentially, I'm parsing a gff
file into a nested hash structure using the following functions:

sub parse_gff {
    my $file = shift;
    my %hash_gff;
    open (INFILE, $file) or die "Cannot find file $file\n";
    while(<INFILE>){
	next if (/^\#/);
	chomp;
	my ($seqid, $source, $type, $start, $end, $score, $strand, $phase,
$attributes) = split /\t/;
	my $attri_ref = &process_attributes($attributes);
	my %record = ('seqid'     => $seqid,
		      'source'    => $source,
		      'type'      => $type,
		      'start'     => $start,
		      'end'       => $end,
		      'score'     => $score,
		      'strand'    => $strand,
		      'phase'     => $phase,
		      'attribute' => $attri_ref);
	push @{$hash_gff{$type}}, \%record;
    }
    close INFILE;
    print Dumper %hash_gff;
    return \%hash_gff;
}

sub process_attributes {
    my $attr_string = shift;
    my @attributes = split (/\;/, $attr_string);
    my %attr;
    foreach (@attributes){
	my ($key, $value) = split /=/;
	if ($value=~/\:/){
	    my ($subkey, $subvalue) = split (/:/, $value);
	    $attr{$key}{$subkey}=$subvalue;
	}
	else{
	    $attr{$key}=$value;
	}
    }
    return \%attr;
}

   It works for all the gff files we downloaded from NCBI's microbial
genomes refseq ftp repository.  However, 3 lines from one particular
file NC_005966.gff (of Acinetobacter_sp_ADP1) can not be parsed
properly.  These lines are:

NC_005966.1	RefSeq	CDS	635836	636489	.	-	0	locus_tag=ACIAD0647;function=adaptation%20to%20stress;function=protection%20%28MultiFun:5.5%29;note=Multifun:5.6%0AEvidence%203%20:%20Function%20proposed%20based%20on%20presence%20of%20conserved%20amino%20acid%20motif%2C%20structural%20feature%20or%20limited%20homolgy;inference=non-experimental%20evidence%2C%20no%20additional%20details%20recorded;transl_table=11;product=putative%20antioxidant%20protein;protein_id=YP_045389.1;db_xref=GI:50083879;db_xref=GeneID:2878732;exon_number=1

NC_005966.1	RefSeq	start_codon	636487	636489	.	-	0	locus_tag=ACIAD0647;function=adaptation%20to%20stress;function=protection%20%28MultiFun:5.5%29;note=Multifun:5.6%0AEvidence%203%20:%20Function%20proposed%20based%20on%20presence%20of%20conserved%20amino%20acid%20motif%2C%20structural%20feature%20or%20limited%20homolgy;inference=non-experimental%20evidence%2C%20no%20additional%20details%20recorded;transl_table=11;product=putative%20antioxidant%20protein;protein_id=YP_045389.1;db_xref=GI:50083879;db_xref=GeneID:2878732;exon_number=1

NC_005966.1	RefSeq	stop_codon	635833	635835	.	-	0	locus_tag=ACIAD0647;function=adaptation%20to%20stress;function=protection%20%28MultiFun:5.5%29;note=Multifun:5.6%0AEvidence%203%20:%20Function%20proposed%20based%20on%20presence%20of%20conserved%20amino%20acid%20motif%2C%20structural%20feature%20or%20limited%20homolgy;inference=non-experimental%20evidence%2C%20no%20additional%20details%20recorded;transl_table=11;product=putative%20antioxidant%20protein;protein_id=YP_045389.1;db_xref=GI:50083879;db_xref=GeneID:2878732;exon_number=1

   They generate an error: Can't use string
("adaptation%20to%20stress") as a HASH ref while "strict refs" in use.
 The strange part is that all I have to do is replace the word
"function" in front of "=adaptation%20to%20stress;" with another word
or simply change it to functions or functio or Function, etc, then the
line parses properly.  If I retype the word "function", it doesn't
solve the problem.  For some strange reason, when the word "function"
is there, perl tried to use "adaptation%20to%20stress" as the hash key
and failed.  The word "function" is used in other lines as well so I
don't think the problem is not caused by the word alone.
    Any suggestion on what might be happening would be greatly
appreciated.  Thank you.

Cheers,

Will

-- 
William Hsiao
PhD Student, Brinkman Laboratory
Department of Molecular Biology and Biochemistry
Simon Fraser University, 8888 University Dr. Burnaby, BC, Canada V5A 1S6
Phone: 604-291-4206 Fax: 604-291-5583



More information about the Bioperl-l mailing list