[Bioperl-l] Loading SwissProt Data into Oracle

Hilmar Lapp hlapp at gnf.org
Sun Apr 20 10:31:57 EDT 2003


On Saturday, April 19, 2003, at 11:07  AM, Ewan Birney wrote:

>
>
> On Fri, 18 Apr 2003, Jason Stajich wrote:
>
>> In the biosql CVS repository you want the script
>> bioperl-db/scripts/biosql/load_seqdatabase.pl
>>
>> see http://cvs.open-bio.org/ for more info on our repository.
>>
>> In the nearish future I expect Hilmar/Aaron/Chris will make a 
>> bioperl-db
>> release of using the current biosql schema.
>
> Yup - in biosql-schema (check out the cvs link above) there are a 
> series
> of docs in the "docs" subdirectory which step through the load system 
> in
> somewhat mind-numbing detail.
>
>
> If you notice the details of the Oracle load is "left as an exercise to
> the reader". If you could update the document about how to load into
> oracle in that real sort of "cookbook" way people like, then that 
> would be
> great.
>

Note that there is an Oracle version of biosql included in 
sql/biosql-ora, with a rough walk through the instantiation in 
sql/biosql-ora/INSTALL. It's certainly not intended for people who are 
unfamiliar with Oracle or even SQL though.

The Oracle version is currently lagging behind the MySQL and PostgreSQL 
versions, which means the Oracle schema is pre-Singapore. I am in the 
process of catching up the schema to the Singapore changes, and I'm 
more or less done with the schema. I've also written a migration script 
(I need it myself), which is in the final stages of testing. I expect 
to be finished with updating the schema within the next 2 weeks; 
primarily what's missing is updating all the views, triggers, and 
PL/SQL API packages.

> [...]
> so, if you have someone looking to get every last iota of detail out of
> swissprot you either:
>
>   (a) have to write the schema yourself and your own loader (not
> recommended)
>
> OR
>
>   (b) help finish off these details in (i) the bioperl object model 
> (the
> details would get added to Bio::Seq::RichSeq) (ii) the parser in
> Bio::SeqIO::swiss and (iii) the BioSQL schema data model (it maybe that
> many of the details can be stored inside the ontology tables, but ... I
> wonder) and bioperl-db bindings
>

There are a few things that the bioperl-db bindings won't handle yet, 
in the sense that they'll silently ignore those pieces:

	- PubMed ID for references in addition to MEDLINE ID (biosql can't 
store both either)
	- optional_id() for db_xrefs (biosql is fine)
	- fuzzy locations (will use the start/end as determined by the active 
CoordinatePolicy) (biosql is fine)

-hilmar

>
>
> However, this is just a heads-up on the challenges invovled in parsing 
> the
> whole of swissprot. by and large Bioperl does a pretty good job.
>
>
>
>>
>> -jason
>>
>> On Fri, 18 Apr 2003, Neil Evans wrote:
>>
>>> Hello,
>>>
>>> I'm interested in loading the raw swissprot data into an Oracle 
>>> database.  This, of course, involves:
>>> 1. The parsing of the data to produce some DB-friendly format 
>>> (possibly SQL)
>>> 2. The design/loading of DB schema to contain the data
>>> 3. The actual loading of the data
>>>
>>> I've done some reading on BioSQL and I think that may be the way to 
>>> go.  I figure there must be some PERL script out there which can 
>>> parse swissprot, and possibly even perform the SQL generation.
>>>
>>> Any pointers?
>>>
>>> thanks!
>>> -Neil.
>>>
>>> --
>>> ===============================================
>>> Neil.Evans at oracle.com
>>> Senior Software Developer
>>> Oracle Web Services UDDI Registry
>>> Oracle Corporation
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at bioperl.org
>>> http://bioperl.org/mailman/listinfo/bioperl-l
>>>
>>
>> --
>> Jason Stajich
>> Duke University
>> jason at cgt.mc.duke.edu
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at bioperl.org
>> http://bioperl.org/mailman/listinfo/bioperl-l
>>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------



More information about the Bioperl-l mailing list