[Open-bio-l] OBDA redux?
Peter Cock
p.j.a.cock at googlemail.com
Fri Nov 18 06:21:04 EST 2011
On Fri, Nov 18, 2011 at 10:55 AM, Raoul Bonnal <bonnal at ingm.org> wrote:
> On 18/11/11 11.20, "Peter Cock" <p.j.a.cock at googlemail.com> wrote:
>> On Fri, Nov 18, 2011 at 9:35 AM, Raoul Bonnal wrote:
>>> Dear all,
>>> Would be possible to have a test dataset and clear requirements,
>>> functionalities? Not a huge doc, just few points for benchmarking.
>>
>> I was thinking of using the UniProt SProt and TrEMBL datasets
>> as test cases (FASTA, plain text "swiss", and UniProt-XML format).
>> These have 532,792 and 17,651,715 records each (in the version
>> I have on disk - they've just released an update), which is a good
>> size, but not in the scale where we might start to worry about
>> SQLite scaling.
>> ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/
>>
>> So, we'd also want some thing else like some big FASTQ files with
>> 100M -> 500M records (or more). Perhaps we'll have to combine a
>> couple of SRA data files together for that, which is fine.
>>
>> Also a full GenBank download would be good, e.g. the EST dataset
>> files gbest1.seq.gz to gbest209.seq.gz would make a good test of
>> indexing multiple files together as a single database:
>> ftp://ftp.ncbi.nih.gov/genbank/
>>
> It's a stating point.
>
> And which are the information you want to extract once you
> have your index ?
>
Biopython and BioPerl have their SeqIO parsers hooked up
to indexing code. This means you can access a record via its
ID, and it is parsed for you on demand - just like if you'd
iterated over the file in order parsing the records one by one.
Biopython (not sure about BioPerl) can also just fetch the raw
text of that record.
I presume BioRuby has something similar using the OBDA
flatfile / BDB indexes?
Peter
More information about the Open-Bio-l
mailing list