[Biojava-l] Sequence retrieval
Keith James
kdj at sanger.ac.uk
Fri Jul 25 10:44:10 EDT 2003
>>>>> "Jeffrey" == Jeffrey Rosenfeld <jeffr at amnh.org> writes:
Jeffrey> I am new to this list, so my question might have already
Jeffrey> been discussed, but I cannot find any reference to it in
Jeffrey> the archive, so here goes: I am trying to find a quick
Jeffrey> java-only way to retrieve sequences from a blast
Jeffrey> database. I am writing a program that needs to obtain
Jeffrey> large amounts of sequences from a fairly large database.
Jeffrey> I have tried using fastacmd, but there is a great
Jeffrey> slowdown because of teh need to start up an external
Jeffrey> process for each sequence query. (I cannot execute one
Jeffrey> large fastacmd job because of the large amounts of
Jeffrey> sequence that I am querying. ) I know that biojava has
Jeffrey> many different formats for storing sequences, but I don't
Jeffrey> want to have to keep two databases of my sequences
Jeffrey> updated. I am already using the blast database for
Jeffrey> blast, so I don't want another database. Is there a
Jeffrey> simple way to implement fastacmd or somethign similar in
Jeffrey> java? It should not be too hard to do either using JNI
Jeffrey> or reverse engineering the fastacmd code.
Hi Jeffrey,
This is possible, but you would at least need to make a new
(additional) index of the Blast database. Biojava does not have a
reader for blast indices because their format is different between
ncbi/wu flavours and is also apt to change.
Brief background on the available indices - we started with our own
system (see interfaces org.biojava.bio.seq.db.Index,
org.biojava.bio.seq.db.IndexStore and the TabIndexStore implementation
of IndexStore).
Later an indexing system common to all the Bio* projects was proposed
and implemented (i.e. you can index with Bioperl and read in Biopython
etc). See the obf-common cvs package for a full spec and other docs
via webcvs at http://cvs.open-bio.org. This is quite heavily
integrated with a system-wide registry for local and distributed
databases (also described in obf-commion docs), which you won't need
to worry about as you just want a simple lookup.
To use this system... there is an end-user indexing program
org.biojava.app.BioFlatIndex which can create the index (actually a
directory containing metadata and offsets into sequence
files). Alternatively you can programmatically index using the
org.biojava.bio.program.indexdb.IndexTools class. See the unit tests
(in cvs, org.biojava.bio.program.indexdb.IndexToolsTest) for examples
such as:
public void testIndexFastaDNA() throws Exception
{
File [] files = getDBFiles(new String [] { "dna1.fasta",
"dna2.fasta" });
IndexTools.indexFasta("test", new File(location),
files, SeqIOConstants.DNA);
SequenceDBLite db = new FlatSequenceDB(location, "dna");
Sequence seq1 = db.getSequence("id1");
assertEquals("gatatcgatt", seq1.seqString());
Sequence seq2 = db.getSequence("id2");
assertEquals("ggcgcgcgcg", seq2.seqString());
Sequence seq3 = db.getSequence("id3");
assertEquals("ccccccccta", seq3.seqString());
Sequence seq4 = db.getSequence("id4");
assertEquals("tttttcgatt", seq4.seqString());
Sequence seq5 = db.getSequence("id5");
assertEquals("ggttcgcgcg", seq5.seqString());
Sequence seq6 = db.getSequence("id6");
assertEquals("nnnnnnttna", seq6.seqString());
}
Finally, the binary indices created by the Staden package and EMBOSS
(Embl CDROM format) are also supported. If you index your flatfiles
with dbifasta/dbiblast you can read the EMBOSS indices from Biojava
with a little effort. This uses an EmblCDROM implmementation of our
old IndexStore interface. The unit tests
(org.biojava.bio.seq.db.EmblCDROMIndexStoreTest) should prove useful:
URL divURL =
EmblCDROMIndexStoreTest.class.getResource("emblcd/division.lkp");
URL entURL =
EmblCDROMIndexStoreTest.class.getResource("emblcd/entrynam.idx");
File divisionLkp = new File(divURL.getFile());
File entryNamIdx = new File(entURL.getFile());
format = new FastaFormat();
alpha = ProteinTools.getAlphabet();
parser = alpha.getTokenization("token");
factory =
new FastaDescriptionLineParser.Factory(SimpleSequenceBuilder.FACTORY);
EmblCDROMIndexStore
emblCDIndexStore = new EmblCDROMIndexStore(divisionLkp,
entryNamIdx,
format,
factory,
parser);
emblCDIndexStore.setPathPrefix(entryNamIdx.getParentFile().getAbsoluteFile());
SequenceDB
sequenceDB = new IndexedSequenceDB(emblCDIndexStore);
and later...
// Test actual sequence fetches
Sequence seq = sequenceDB.getSequence("NMA0007");
assertEquals("NMA0007", seq.getName());
assertEquals(235, seq.length());
seq = sequenceDB.getSequence("NMA0020");
assertEquals("NMA0020", seq.getName());
assertEquals(494, seq.length());
seq = sequenceDB.getSequence("NMA0030");
assertEquals("NMA0030", seq.getName());
assertEquals(245, seq.length());
Hope this is useful,
Keith
--
- Keith James <kdj at sanger.ac.uk> bioinformatics programming support -
- Pathogen Sequencing Unit, The Wellcome Trust Sanger Institute, UK -
More information about the Biojava-l
mailing list