[Bioperl-l] Homologene again...

Thu, 14 Feb 2002 21:54:56 -0500 (EST)

No there aren't objects or parsers for this data in bioperl because
homologene it is just a cluster of LocusLink Ids and accessions.  I
tried to write a basic parser for my own needs last month and sort of gave
up on the data - been happier with the InParanoid Orthologs for what I
needed in the end.

Happy to give you what I started writing (note - I was interested in
drosophila orthologs to human so this is specific for that).

Hope this helps at all - I realize it is not at all sophisticated - and
there are a couple of cases that it fails to parse because for some reason
the file doesn't follow the format all the way through - go figure...

-jason

#!/usr/bin/perl -w
use strict;

# This is from the Homologene readme
#  The field delimiter is  "|".

#  -The first two fields indicate the organisms from which the sequences
#  originate.
#  -The third field indicates the type of similarity.
#  -The fourth (LocusLink ID), fifth (UniGene ID), and sixth (Accession
#  number) fields correspond to the first organism.  One or both of UG ID
#  and LL ID may be present.  Locus Link and UniGene are in one-to-one
#  correspondence in the latter case, so no ambiguity arises through the
#  choice of set identifier.
#  -The seventh(LL), eighth(UG), and ninth(Accession) fields correspond
#  to the second organism.
#  -The tenth field is the percent identity of the alignment, or a URL to
#  the source of a curated ortholog.

#  A similarity between organisms may be a best match of several
#  different types, with the type of match indicated by the sixth
#  character of the record.

#  t indicates best match from the second field to the first.  (when
#  using the second sequence as query, the first sequence is the best
#  match, with percent identity of alignments over 100 nt the score)

#  f indicates best match to the the second field from the first.
#  (when using the first sequence as query, the second sequence is the
#  best match)

#  b indicates reciprocal best match (cluster pairs identified by f and t
#  coincide).

#  B indicates reinforced reciprocal best match (reciprocal best matches
#  between at least three organisms agree).

#  c indicates a curated homology (i.e., one that
#  comes from outside NCBI or froma syntenic association,
#  rather than one that is produced by an automatic process run at NCBI).

#  Nota bene: many curated homologies are between genes rather than
#  between accession numbers; consequently, we've chosen not to display
#  accessions for all curated homologies, since the gene identifier-
#  accession mapping is not always accurately resolvable.

open(HGENE, "hmlg.trip.ftp") or die("cannot open hmlg.trip.ftp");

$/ ="\n>";
while(my $l = <HGENE>) {
    my @data = split(/\n/,$l);
    my ($title,$gene);
    foreach my $line ( @data ) {
	last if( $gene && $title);
	next if( $line =~ /^>/ );
	if( $line =~ /^TITLE/ && $line =~ /Hs\./ ) {
	    (undef,$title) = split(/\s+/,$line);
	} else {
	    next unless ( $line =~ /Dm/ );
	    my ($speciesa,$speciesb, $matchtype,
		$lla,$uga,$acc_a,undef,
		$llb,$ugb,$acc_b, $pid) = split(/\|/,$line);
	    if( lc($speciesa) eq 'dm' ) {
		$lla =~ s/^\s+(\S+)/$1/;
		$lla =~ s/(\S+)\s+$/$1/;
		$gene = $lla;
	    } elsif( lc($speciesb) eq 'dm' ) {
		$llb =~ s/^\s+(\S+)/$1/;
		$llb =~ s/(\S+)\s+$/$1/;
		$gene = $llb;
	    }
	}
    }
    if( $title && $gene ) {
	print "Title: $title Gene:$gene\n";
    }
}

On Fri, 15 Feb 2002, Andrew Macgregor wrote:

> Hello,
>
> I haven't had any feedback on whether bioperl can parse homologene
> files so I'm guessing maybe it can't. Is this the type of thing that
> you want bioperl to do or is it out of scope?
>
> Can anybody point me to perl scripts that do this? If not, I'll be
> writing something to do the job. Is this something that could/should
> get put in bioperl somewhere, or in scripts central or is there just
> not too much interest in doing this?
>
> Cheers, Andrew.
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>

-- 
Jason Stajich
Duke University
jason@cgt.mc.duke.edu