[BioRuby] SPTR problem
Naohisa GOTO
ngoto at gen-info.osaka-u.ac.jp
Fri Jan 15 17:19:12 UTC 2010
Hi,
On Tue, 12 Jan 2010 22:52:42 +1000
Ben Woodcroft <donttrustben at gmail.com> wrote:
> Hi,
>
> While parsing all the yeast UniProt txt files I came across a problem with
> the gn parser - it was returning an array when I expected a hash. Looking at
> the code the problem seems to be this when statement:
>
> when /Name=/,/ORFNames=/
> @data['GN'] = gn_uniprot_parser
> else
> @data['GN'] = gn_old_parser
> end
>
> http://www.uniprot.org/uniprot/A2P2R3.txt has the problem on the 5th line:
>
> GN OrderedLocusNames=YMR084W;
>
> So GN line had OrderedLocusNames= but not Name= or ORFNames=, so it didn't
> use the new parser, like the other entries I came across. Should all 4
> possibilities be tested for in the when statement: (Synonyms= being the
> 4th)?
It seems to be a bug. Perhaps there were no (or very few) entries
which only had OrderedLocusNames= when the code was first written
in 2005. The commit Id in git was b5c3342437ed698f215a87ea72c6cabf0575709d.
The GN format was changed in UniProtKB release 2.0 of 05-Jul-2004.
The document http://www.uniprot.org/docs/sp_news.htm says:
| The new format of the GN line is:
|
| GN Name=<name>; Synonyms=<name1>[, <name2>...]; OrderedLocusNames=<name1>[, <name2>...];
| GN ORFNames=<name1>[, <name2>...];
|
| None of the above four tokens are mandatory. But a "Synonyms" token can only be present if there is a "Name" token.
You are right the 4 possibilities should be considered.
"Synonyms" can be eliminated, but it may be safe to be included.
> Also, while I'm here:
> * why does the returned hash have different keys than are in the file? e.g.
> ORFNames becomes :orfs?
I don't know. Now, I think using the same names as described
in the original entries may be preferred, too.
> * I also found the parsing process for whole genomes quite slow (multiple
> hours for well annotated ones).
Please use profiler to find bottlenecks.
% ruby -rprofile xxx.rb
> * is there any standard way to handle concatenated UniProt files? I wrote my
> own as it was simple.
What type of "concatenated" do you mean?
For simple concatenation, for example, original file distributed
from UniProt FTP site, Bio::FlatFile can be used.
ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz
(please gunzip before reading!)
ff = Bio::FlatFile.open("uniprot_sprot.dat")
ff.each do |e|
puts e.entry_id
end
>
> Thanks,
> ben
Thank you.
--
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ng at bioruby.org
More information about the BioRuby
mailing list