[BioSQL-l] Loading long strings to Oracle

Tue Jun 28 11:42:22 EDT 2005

You're probably not using bioperl-db 0.1, judging from the generated  
query. Make sure you use a recent download from CVS.

There is a test in bioperl-db for inserting and retrieving long  
sequences. Have you run the tests and seen a problem?

This may indeed be due to some problem with the character encoding. The  
Oracle-specific layer of the adaptors deal a bit differently with  
sequences longer than 4000 chars. However, if in your case they are  
encoded in Unicode, then maybe the threshold would be half that size?  
Can you check what happens when you truncate the sequence and the other  
troubling string to less than 2000 chars?

Also, to nail down the problem, you could also try to have database and  
OS run under the same locale/encoding.

	-hilmar

On Jun 28, 2005, at 11:19 AM, Teemu Kivioja wrote:

> Hi,
>
> I have couple of possibly related problems with loading to the Oracle  
> database.
>
> 1.) When trying to load yeast proteins from SGD, I get:
>
> perl load_seqdatabase.pl --host sboracle1.ad.vtt.fi  --driver Oracle  
> --testonly --dbname BfxDB --format swiss --printerror test.swiss
> Loading test.swiss ...
> DBD::Oracle::st execute failed: ORA-01461: can bind a LONG value only  
> for insert into a LONG column (DBD ERROR: OCIStmtExecute) [for  
> statement ``UPDATE biosequence SET version = NVL(?,version), length =  
> NVL(?,length), alphabet = NVL(?,alphabet), crc = NVL(?,crc), seq =  
> NVL(?,seq), ent_oid = NVL(?,ent_oid) WHERE ent_oid = ?'' with params:  
> : 
> p5='MAKQRQTTKSSKRYRYSSFKARIDDLKIEPARNLEKRVHDYVESSHFLASFDQWKEINLSAKFTEFA 
> AEIEHDVQTLPQILYHDKKIFNSLVSFINFHDEFSLQPLLDLLAQFCHDLGPDFLKFYEEAIKTLINLLDA 
> AIEFESSNVFEWGFNCLAYIFKYLSKFLVKKLVLTCDLLIPLLSHSKEYLSRFSAEALSFLVRKCPVSNLR 
> EFVRSVFEKLEGDDEQTNLYEGLLILFTESMTSTQETLHSKAKAIMSVLLHEALTKSSPERSVSLLSDIWM 
> NISKYASIESLLPVYEVMYQDFNDSLDATNIDRILKVLTTIVFSESGRKIPDWNKITILIERIMSQSENCA 
> SLSQDKVAFLFALFIRNSDVKTLTLFHQKLFNYALTNISDCFLE...', :p3='protein',  
> :p6='14404', :p1=undef, :p7='14404', :p4='F6ED4E3E9AE0F468',  
> :p2=2493]) at  
> /usr/lib/perl5/site_perl/5.8.0/Bio/DB/BioSQL/BaseDriver.pm line 1115,  
> <GEN0> line 51.
>
> The file test.swiss only includes the record:
> ID   YBL004W        STANDARD;      PRT;   2494 AA.
>
> It seems that I can get rid of this error message by explicitly  
> telling that the type of sequence field is CLOB by adding the code
> 	    if ($slots[$i] eq "seq") {
> 	      $self->bind_param($sth, $j, $slotvals->[$i],
> 				{ ora_type => ORA_CLOB });
> 	    } else {
> 	      $self->bind_param($sth, $j, $slotvals->[$i]);
> 	    }
>
>
> 2.) When trying insert InterPro (interpro 10.0
> ftp://ftp.ebi.ac.uk/pub/databases/interpro/interpro.xml) I get:
>
> perl load_ontology.pl --format 'interpro' --host sboracle1.ad.vtt.fi  
> --namespace interpro  --driver Oracle --dbname BfxDB --testonly  
> --fmtargs "ontology_engine,simple" interpro.xml
> ...
> 11900
> Loading ontology InterPro:
>         ... terms
>
> -------------------- WARNING ---------------------
> MSG: insert in Bio::DB::BioSQL::TermAdaptor (driver) failed, values  
> were ("IPR000911","Ribosomal protein L11","Ribosomes are ...
>
> ORA-01461: can bind a LONG value only for insert into a LONG column  
> (DBD ERROR: error possibly near <*> indicator at char 14 in 'INSERT  
> INTO te<*>rm (identifier, name, definition, is_obsolete, ont_oid)  
> VALUES (:p1, :p2, :p3, :p4, :p5)')
> ---------------------------------------------------
> Could not store term IPR000911, name 'Ribosomal protein L11':
>
> ------------- EXCEPTION  -------------
> MSG: create: object (Bio::Ontology::InterProTerm) failed to insert or  
> to be found by unique key
>
> Again, the annotation is >2000 characters long but well under the 4000  
> character limit.
>
> 3.) As others have already reported, the memory usage can be high,   
> the above load_ontology process takes about 2.5GB of memory.
>
> I guess the fact that the problems 1 and 2 already arise with strings  
> that are <4000 chars long might be related to the local character  
> coding. The code:
>
> my $hash_ref = $dbh->ora_nls_parameters();
> my $database_charset = $hash_ref->{NLS_CHARACTERSET};	
> my $national_charset = $hash_ref->{NLS_NCHAR_CHARACTERSET};
> print "database charset: $database_charset\n";
> print "national charset: $national_charset\n";
>
> gives
>
> database charset: WE8ISO8859P1
> national charset: AL16UTF16
>
> and
>
> $  locale LC_CTYPE | head
> upper;lower;alpha;digit;xdigit;space;print;graph;blank;cntrl;punct; 
> alnum;combining;combining_level3
> toupper;tolower;totitle
> 16
> 6
> UTF-8
> 70
> 84
> 1
> 0
> 1
>
> Some details of the system:
> Enterprise Linux, 2.4.21-32.0.1.ELsmp (64-bit)
> Oracle 10g, version 10.1.0.3.0 - 64bit
> Perl, v5.8.0 built for x86_64-linux-thread-multi
> Bioperl 1.4
> bioperl-db 0.1
> DBD::Oracle 1.16
> Biosql-schema downloaded on May 10
>
> What would be the best way to solve these problems?
>
> Best regards,
> Teemu Kivioja
>
>
>
>
> ------------------------------------------------------------------
> Teemu Kivioja, Research Scientist
> VTT Biotechnology
> P.O. Box 1500, FIN-02044 VTT, Finland
> (Street address: Tietotie 2, Espoo, Otaniemi)
> Email: Teemu.Kivioja at vtt.fi
> Phone: +358 20 722 7111
> Fax: +358 20 722 7071
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------