[BioSQL-l] Loading long strings to Oracle

Richard HOLLAND hollandr at gis.a-star.edu.sg
Tue Jun 28 11:33:27 EDT 2005


We had similar trouble in BioJava when Oracle 9 and 10 suddenly stopped supporting the use of setString and getString on CLOB columns. Special code was required to force BioJava to detect the database and use the special Oracle CLOB-specific accession methods, just like your 'quick fix' of setting ora_type does below.

Hilmar is the best guy to talk to here as he uses BioPerl with BioSQL and Oracle in his production db at work.

Your annotation fails because you are using UTF16 as the character set in the database. This means that each character is stored as 16 bits or 2 bytes. As the limit in Oracle is 4000 bytes (note bytes not characters) this means that you can only store strings up to 2000 chars long with this encoding.

cheers,
Richard


-----Original Message-----
From:	biosql-l-bounces at portal.open-bio.org on behalf of Teemu Kivioja
Sent:	Tue 6/28/2005 11:19 PM
To:	biosql-l at open-bio.org
Cc:	
Subject:	[BioSQL-l] Loading long strings to Oracle 

Hi,

I have couple of possibly related problems with loading to the Oracle database.

1.) When trying to load yeast proteins from SGD, I get:

perl load_seqdatabase.pl --host sboracle1.ad.vtt.fi  --driver Oracle 
--testonly --dbname BfxDB --format swiss --printerror test.swiss
Loading test.swiss ...
DBD::Oracle::st execute failed: ORA-01461: can bind a LONG value only for 
insert into a LONG column (DBD ERROR: OCIStmtExecute) [for statement 
``UPDATE biosequence SET version = NVL(?,version), length = NVL(?,length), 
alphabet = NVL(?,alphabet), crc = NVL(?,crc), seq = NVL(?,seq), ent_oid = 
NVL(?,ent_oid) WHERE ent_oid = ?'' with params: 
:p5='MAKQRQTTKSSKRYRYSSFKARIDDLKIEPARNLEKRVHDYVESSHFLASFDQWKEINLSAKFTEFAAEIEHDVQTLPQILYHDKKIFNSLVSFINFHDEFSLQPLLDLLAQFCHDLGPDFLKFYEEAIKTLINLLDAAIEFESSNVFEWGFNCLAYIFKYLSKFLVKKLVLTCDLLIPLLSHSKEYLSRFSAEALSFLVRKCPVSNLREFVRSVFEKLEGDDEQTNLYEGLLILFTESMTSTQETLHSKAKAIMSVLLHEALTKSSPERSVSLLSDIWMNISKYASIESLLPVYEVMYQDFNDSLDATNIDRILKVLTTIVFSESGRKIPDWNKITILIERIMSQSENCASLSQDKVAFLFALFIRNSDVKTLTLFHQKLFNYALTNISDCFLE...', 
:p3='protein', :p6='14404', :p1=undef, :p7='14404', :p4='F6ED4E3E9AE0F468', 
:p2=2493]) at /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BaseDriver.pm 
line 1115, <GEN0> line 51.

The file test.swiss only includes the record:
ID   YBL004W        STANDARD;      PRT;   2494 AA.

It seems that I can get rid of this error message by explicitly telling 
that the type of sequence field is CLOB by adding the code
	    if ($slots[$i] eq "seq") {
	      $self->bind_param($sth, $j, $slotvals->[$i],
				{ ora_type => ORA_CLOB });
	    } else {
	      $self->bind_param($sth, $j, $slotvals->[$i]);
	    }


2.) When trying insert InterPro (interpro 10.0
ftp://ftp.ebi.ac.uk/pub/databases/interpro/interpro.xml) I get:

perl load_ontology.pl --format 'interpro' --host sboracle1.ad.vtt.fi 
--namespace interpro  --driver Oracle --dbname BfxDB --testonly --fmtargs 
"ontology_engine,simple" interpro.xml
...
11900
Loading ontology InterPro:
         ... terms

-------------------- WARNING ---------------------
MSG: insert in Bio::DB::BioSQL::TermAdaptor (driver) failed, values were 
("IPR000911","Ribosomal protein L11","Ribosomes are ...

ORA-01461: can bind a LONG value only for insert into a LONG column (DBD 
ERROR: error possibly near <*> indicator at char 14 in 'INSERT INTO te<*>rm 
(identifier, name, definition, is_obsolete, ont_oid) VALUES (:p1, :p2, :p3, 
:p4, :p5)')
---------------------------------------------------
Could not store term IPR000911, name 'Ribosomal protein L11':

------------- EXCEPTION  -------------
MSG: create: object (Bio::Ontology::InterProTerm) failed to insert or to be 
found by unique key

Again, the annotation is >2000 characters long but well under the 4000 
character limit.

3.) As others have already reported, the memory usage can be high,  the 
above load_ontology process takes about 2.5GB of memory.

I guess the fact that the problems 1 and 2 already arise with strings that 
are <4000 chars long might be related to the local character coding. The code:

my $hash_ref = $dbh->ora_nls_parameters();
my $database_charset = $hash_ref->{NLS_CHARACTERSET};	
my $national_charset = $hash_ref->{NLS_NCHAR_CHARACTERSET};
print "database charset: $database_charset\n";
print "national charset: $national_charset\n";

gives

database charset: WE8ISO8859P1
national charset: AL16UTF16

and

$  locale LC_CTYPE | head
upper;lower;alpha;digit;xdigit;space;print;graph;blank;cntrl;punct;alnum;combining;combining_level3
toupper;tolower;totitle
16
6
UTF-8
70
84
1
0
1

Some details of the system:
Enterprise Linux, 2.4.21-32.0.1.ELsmp (64-bit)
Oracle 10g, version 10.1.0.3.0 - 64bit
Perl, v5.8.0 built for x86_64-linux-thread-multi
Bioperl 1.4
bioperl-db 0.1
DBD::Oracle 1.16
Biosql-schema downloaded on May 10

What would be the best way to solve these problems?

Best regards,
Teemu Kivioja




------------------------------------------------------------------
Teemu Kivioja, Research Scientist
VTT Biotechnology
P.O. Box 1500, FIN-02044 VTT, Finland
(Street address: Tietotie 2, Espoo, Otaniemi)
Email: Teemu.Kivioja at vtt.fi
Phone: +358 20 722 7111
Fax: +358 20 722 7071

_______________________________________________
BioSQL-l mailing list
BioSQL-l at open-bio.org
http://open-bio.org/mailman/listinfo/biosql-l






More information about the BioSQL-l mailing list