[BioPython] Uniprot Parser

Jonathan Boulais biosql at hotmail.com
Mon Feb 25 17:48:11 UTC 2008


I don't think the parser is the problem Peter, but surely the continuous importing request toward the database.  
I've already wrote a parsing script, where I'm parsing the entire .dat files (Trembl and Swiss Prot) into text files. After the parsing is done, it imports the files into a database schema that I've built on my own, not the BioSQL one. So instead of importing the data after each iteration, it just imports the data in one shot when the entire .dat file is parsed. I've compared the execution time and it's much faster by this way (about 1 hour instead of 3 days for the Trembl .dat file parsing and importing).

Again Peter, I'm definitely not as good as you guys in scripting, so I've used the script lines that you proposed for the bug 2390 to compare. 
But 3 days of parsing and importing is... a little bit too long for me :)

Anyway I hope it could help, 

Jonathan


> Date: Mon, 25 Feb 2008 16:52:31 +0000
> From: biopython at maubp.freeserve.co.uk
> To: biosql at hotmail.com
> Subject: Re: [BioPython] Uniprot Parser
> CC: biopython at lists.open-bio.org
> 
> On Mon, Feb 25, 2008 at 4:32 PM, Jonathan Boulais <biosql at hotmail.com> wrote:
> >
> >  Hi everyone,
> >
> >  I'm a little bit concerned about the speed of the parsing/loading of the Uniprot .dat files
> >  into the Biosql database. It takes a hell of a time...
> 
> What version of Biopython are you using?
> 
> One thing you could try is timing a simple script that only reads in
> the SwissProt file but doesn't do anything with the BioSQL database -
> to try and get a feel for which bit is slow.
> 
> If its the parsing that is slow, you could try commenting out the bit
> which deals with the EBI ** lines (see bug 2353 for details), namely
> line 359 in CVS, self._skip_starstar(uhandle), and see if that makes a
> big difference.
> 
> Peter

_________________________________________________________________





More information about the Biopython mailing list