[BioRuby] BioRuby's Bio::FlatFileIndex compatibility with BioPerl's Bio::DB::Flat
Naohisa GOTO
ngoto at gen-info.osaka-u.ac.jp
Sun Jul 22 10:25:00 UTC 2007
Hello,
I'm a maintainer of Bio::FlatFileIndex in bioruby.
On Fri, 20 Jul 2007 14:54:43 -0400
"Aidan Findlater" <aidanfindlater at gmail.com> wrote:
> *Summary:* Attached is a diff that allows Bio::FlatFileIndex to access BDB
> flatfile databases created by BioPerl. I have not changed the way BioRuby
> creates its databases, so this likely breaks access to BioRuby-created
> flatfiles.
>
>
> *Description:* I have some flatfile databases that were created with
> BioPerl, but it seems that BioRuby does things a little differently.
> Specifically, BioRuby tries to get config and fileid information from BDB
> databases; BioPerl stores this information in config.dat.
The OBDA flat-file indexing specification (*1) says that
configiguration data is stored in the BDB database, not config.dat.
(excerpted from indexing.txt (*1))
| 2) The subdirectory contains a file named "config.dat" containing tab
| separated key/value pairs. The first line contains the key "index"
| and value "index\tBerkeleyDB/1". This means the first few characters
| of the config.dat file is "index\tBerkeleyDB/1\n".
|
| There is no other data in this file.
|
| 3) Global configuration data is stored in the database named "config".
The specification text was last modified in 5 years ago,
and it might have been changed in somewhere I don't know.
Does someone know changes of specifications,
or how to get new specification text?
*1 http://cvs.open-bio.org/cgi-bin/viewcvs/viewcvs.cgi/obda-specs/flatfile/indexing.txt?rev=1.3&cvsroot=obf-common&content-type=text/vnd.viewcvs-markup
> As well, it returns sequences shifted one character to the right (the '>'
> from my FASTA file was at the end of the returned sequence, and none was at
> the beginning).
I suppose this is BioPerl's indexer's issue.
I prepared the file /tmp/flat/tmp.fst as below.
-----------------------------------------------------------
>TEST00001 EOL
aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
>TEST00002 EOL
ccccccccccccccccccccccccccccccccccccccccccccccccc
>TEST00003 EOL
ggggggggggggggggggggggggggggggggggggggggggggggggg
>TEST00004 EOL
ttttttttttttttttttttttttttttttttttttttttttttttttt
-----------------------------------------------------------
(Each line of the above file is 50 byte in UNIX).
% bp_bioflat_index.pl --create --format fasta \
--location /tmp/flat --dbname testbdb --indextype bdb \
/tmp/flat/tmp.fst
Then, I confirmed the contents of generated BDB data.
% ruby -r bdb -e 'BDB::Btree.open("/tmp/flat/testbdb/key_ACC").to_a.sort.each { |x| puts x.join("\t") }'
TEST00001 0 0 101
TEST00002 0 101 100
TEST00003 0 201 100
TEST00004 0 301 99
(Each column shows ID, FileID, start position, and size.)
The start positions of TEST00002, TEST00003, and TEST00004
are wrong, and the size of TEST00001 and TEST00004 is wrong.
I'm using BioPerl 1.5.2_102.
% perl -MBio::Root::Version -e 'print $Bio::Root::Version::VERSION,"\n"'
1.005002102
In addition, I also tried flat database.
% bp_bioflat_index.pl --create --format fasta \
--location /tmp/flat --dbname testflat --indextype flat \
/tmp/flat/tmp.fst
% cat testflat2/key_ACC.key
19TEST00001 0 0 100 TEST00002 0 100 100TEST00003 0 200 100TEST00004 0 300 50
It sesms that the index is correctly created.
However, according to the specification (*1),
the first 4 bytes of the key_ACC.key file should be "0019",
but was " 19" in the above index created with BioPerl.
(excerpted from indexing.txt (*1))
| Each record of this file is in a fixed width format. There is no
| special termination character. Instead, the first four bytes of the
| file contain the mapping record size, in bytes, represented as text
| string. The string is left padded with zeros to fit in four bytes, so
| the allowed text strings are "0000", "0001", "0002", ..., "9999".
Regards,
Naohisa Goto
ngoto at gen-info.osaka-u.ac.jp / ngoto at bioruby.org
More information about the BioRuby
mailing list