[Biojava-l] fileToBiojava question
Bernd Jagla
bernd.jagla at pasteur.fr
Thu Sep 23 11:23:14 UTC 2010
Simon,
thanks a lot!!!
I implemented your way in a separate class and it works. Now I just
have to get it work within my framework....
Best,
Bernd
On 9/22/2010 3:10 AM, simon rayner wrote:
> sorry for the delay in replying due to time difference.
>
> this is a modified version of your code that uses biojavax. i
> stripped out the pasteur stuff and added code to the *execute* method
> (about line 74). Also marked the imports i added at the top
>
> hope this helps
>
> package cn.cas.wiv.bif.biojava;
>
> import java.io.BufferedReader;
> import java.io.File;
> import java.io.FileReader;
> import java.io.IOException;
> import java.util.Iterator;
> import java.util.NoSuchElementException;
>
> /******************* your biojava imports **********************/
> import org.biojava.bio.BioException;
> import org.biojava.bio.seq.Sequence;
> import org.biojava.bio.seq.SequenceIterator;
> import org.biojava.bio.seq.io.SeqIOTools;
> import org.biojava.bio.seq.io.SymbolTokenization;
> import org.biojava.bio.symbol.Alphabet;
> import org.biojava.bio.symbol.AlphabetManager;
> import org.biojava.bio.symbol.SymbolList;
> import org.biojavax.RichObjectFactory;
> import org.biojavax.bio.seq.io.RichSequenceFormat;
> import org.biojavax.bio.seq.io.EMBLFormat;
> import org.biojavax.bio.seq.io.FastaFormat;
> import org.biojavax.bio.seq.io.GenbankFormat;
> import org.biojavax.bio.seq.io.INSDseqFormat;
> import org.biojavax.bio.seq.io.RichSequenceBuilderFactory;
> import org.biojavax.bio.seq.io.RichSequenceFormat;
> import org.biojavax.bio.seq.io.RichStreamReader;
> import org.biojavax.bio.seq.io.UniProtFormat;
>
> /********* added these imports to make things work **********/
> import org.biojavax.SimpleNamespace;
> import org.biojavax.bio.seq.RichSequence;
> import org.biojavax.bio.seq.RichSequenceIterator;
> import org.biojava.bio.seq.*;
> import org.biojava.bio.symbol.*;
>
> /**
> * This is the model implementation of FastAReader. Reads a FASTA file
> into two
> * columns: seq_name and sequence
> *
> * @author Bernd Jagla
> */
> @SuppressWarnings("deprecation")
> public class FastAReaderNodeModel {
> // the logger instance
> private Alphabet alpha;
> private SequenceIterator iter;
>
> protected void execute(FileReader fp) throws Exception {
>
> /**
> * {@inheritDoc}
> */
> //String form = m_fformat.getStringValue();
> //String alphabet = m_alphabet.getStringValue();
> String form = "genbank";
> String alphabet = "DNA";
> /****************** old way *********************/
> int count = 0;
> BufferedReader br = new BufferedReader(fp);
> SequenceIterator iter = (SequenceIterator)
> SeqIOTools.fileToBiojava(
> form, alphabet, br);
>
> while (iter.hasNext()) {
> // System.out.println(fastq.getSequence());
> Sequence seq = iter.nextSequence();
> String seqName = seq.getName();
> // String seqName = "asdf";
> //String sequence = seq.seqString();
> System.err.println("reading: " + seqName + " " +
> seq.length());
> count++;
> }
> System.err.println("finished reading file");
>
> /****************** biojavax way *********************/
> RichSequence refRSequence;
> SimpleNamespace ns = new SimpleNamespace("MTBGB");
> RichSequenceIterator rsi
> = RichSequence.IOTools.readGenbankDNA(br, ns);
> while(rsi.hasNext())
> {
> refRSequence = rsi.nextRichSequence();
> System.out.println("read " + refRSequence.length() + " bases");
> /** if you want the features, use a FeatureFilter and a
> FeatureHolder **/
> FeatureFilter ff = new FeatureFilter.ByType("CDS");
> FeatureHolder fhRef = refRSequence.filter(ff);
> }
>
> br.close();
> fp.close();
> }
>
> /**
> * Makes a <code>SequenceIterator</code> look like an
> * <code>Iterator {@code <Sequence>}</code>
> *
> * @param iter
> * The <CODE>SequenceIterator</CODE>
> * @return An <CODE>Iterator</CODE> that returns only
> <CODE>Sequence</CODE>
> * objects. <B>You cannot call <code>remove()</code> on this
> * iterator!</B>
> */
> public Iterator<Sequence> asIterator(SequenceIterator iter) {
> final SequenceIterator it = iter;
> return new Iterator<Sequence>() {
> public boolean hasNext() {
> return it.hasNext();
> }
>
> public Sequence next() {
> try {
> return it.nextSequence();
> } catch (BioException e) {
> NoSuchElementException ex = new
> NoSuchElementException();
> ex.initCause(e);
> throw ex;
> }
> }
>
> public void remove() {
> throw new UnsupportedOperationException();
> }
> };
> }
>
> public static RichSequenceFormat formatForName(String name)
> throws ClassNotFoundException, InstantiationException,
> IllegalAccessException {
> // determine the format to use
> RichSequenceFormat format;
> if (name.equalsIgnoreCase("fasta")) {
> format = (RichSequenceFormat) new FastaFormat();
> } else if (name.equalsIgnoreCase("genbank")) {
> format = (RichSequenceFormat) new GenbankFormat();
> } else if (name.equalsIgnoreCase("uniprot")) {
> format = new UniProtFormat();
> } else if (name.equalsIgnoreCase("embl")) {
> format = new EMBLFormat();
> } else if (name.equalsIgnoreCase("INSDseq")) {
> format = new INSDseqFormat();
> } else {
> Class formatClass = Class.forName(name);
> format = (RichSequenceFormat) formatClass.newInstance();
> }
> return format;
> }
>
> }
>
>
> On Tue, Sep 21, 2010 at 8:47 AM, Bernd Jagla <bernd.jagla at pasteur.fr
> <mailto:bernd.jagla at pasteur.fr>> wrote:
>
> Sorry for the wrong reply...
> Here is the FULL code I marked the passages that are important in red:
>
> Thanks for looking at it!!!!
>
> Bernd
>
>
> package org.pasteur.pf2.biojava;
>
> import java.io.BufferedReader;
> import java.io.File;
> import java.io.FileReader;
> import java.io.IOException;
> import java.util.Iterator;
> import java.util.NoSuchElementException;
>
> import org.biojava.bio.BioException;
> import org.biojava.bio.seq.Sequence;
> import org.biojava.bio.seq.SequenceIterator;
> import org.biojava.bio.seq.io.SeqIOTools;
> import org.biojava.bio.seq.io.SymbolTokenization;
> import org.biojava.bio.symbol.Alphabet;
> import org.biojava.bio.symbol.AlphabetManager;
> import org.biojava.bio.symbol.SymbolList;
> import org.biojavax.RichObjectFactory;
> import org.biojavax.bio.seq.io.RichSequenceFormat;
> import org.knime.core.data.DataCell;
> import org.knime.core.data.DataColumnSpec;
> import org.knime.core.data.DataColumnSpecCreator;
> import org.knime.core.data.DataTableSpec;
> import org.knime.core.data.RowKey;
> import org.knime.core.data.container.BlobDataCell;
> import org.knime.core.data.def.DefaultRow;
> import org.knime.core.data.def.StringCell;
> import org.knime.core.node.BufferedDataContainer;
> import org.knime.core.node.BufferedDataTable;
> import org.knime.core.node.CanceledExecutionException;
> import org.knime.core.node.ExecutionContext;
> import org.knime.core.node.ExecutionMonitor;
> import org.knime.core.node.InvalidSettingsException;
> import org.knime.core.node.NodeLogger;
> import org.knime.core.node.NodeModel;
> import org.knime.core.node.NodeSettingsRO;
> import org.knime.core.node.NodeSettingsWO;
> import org.knime.core.node.defaultnodesettings.SettingsModelString;
> import org.biojavax.bio.seq.io.EMBLFormat;
> import org.biojavax.bio.seq.io.FastaFormat;
> import org.biojavax.bio.seq.io.GenbankFormat;
> import org.biojavax.bio.seq.io.INSDseqFormat;
> import org.biojavax.bio.seq.io.RichSequenceBuilderFactory;
> import org.biojavax.bio.seq.io.RichSequenceFormat;
> import org.biojavax.bio.seq.io.RichStreamReader;
> import org.biojavax.bio.seq.io.UniProtFormat;
> import org.pasteur.pf2.datatypes.*;
> /**
> * This is the model implementation of FastAReader. Reads a FASTA
> file into two
> * columns: seq_name and sequence
> *
> * @author Bernd Jagla
> */
> @SuppressWarnings("deprecation")
> public class FastAReaderNodeModel extends NodeModel {
> // the logger instance
> private static final NodeLogger logger = NodeLogger
> .getLogger(FastQReaderNodeModel.class);
> private Alphabet alpha;
> private SequenceIterator iter;
>
> /**
> * the settings key which is used to retrieve and store the
> settings (from
> * the dialog or from a settings file) (package visibility to
> be usable from
> * the dialog).
> */
> private static final String FAR_name = "far_name";
>
> private static final String FAR_fileFormat = "far_ff";
>
> private static final String FAR_alphabet = "far_alph";
>
> private final SettingsModelString m_fpname = createFAR_fpname();
> private final SettingsModelString m_fformat = createFileFormat();
> private final SettingsModelString m_alphabet = createAlphabet();
>
> /**
> * Constructor for the node model.
> */
> protected FastAReaderNodeModel() {
> super(0, 1);
> }
>
> /**
> * {@inheritDoc}
> */
> @Override
> protected BufferedDataTable[] execute(final
> BufferedDataTable[] inData,
> final ExecutionContext exec) throws Exception {
>
> // TODO do something here
> logger.info <http://logger.info>("Node Model Stub... this is not
> yet implemented !");
>
> // the data table spec of the single output table,
> // the table will have three columns:
> DataColumnSpec[] allColSpecs = new DataColumnSpec[1];
> allColSpecs[0] = new DataColumnSpecCreator("sequence",
> SequenceDataCell.TYPE)
> .createSpec();
> DataTableSpec outputSpec = new DataTableSpec(allColSpecs);
> // the execution context will provide us with storage
> capacity, in this
> // case a data container to which we will add rows
> sequentially
> // Note, this container can also handle arbitrary big data
> tables, it
> // will buffer to disc if necessary.
> BufferedDataContainer container =
> exec.createDataContainer(outputSpec);
> // let's add m_count rows to it
> // once we are done, we close the container and return its
> table
> FileReader fp = new FileReader(m_fpname.getStringValue());
>
> exec.checkCanceled();
> //String form = m_fformat.getStringValue();
> //String alphabet = m_alphabet.getStringValue();
> String form = "genbank";
> String alphabet = "DNA";
>
> BufferedReader br = new BufferedReader(fp);
> // String line = br.readLine();
> int count = 0;
>
> SequenceIterator iter = (SequenceIterator)
> SeqIOTools.fileToBiojava(
> form, alphabet, br);
>
> while (iter.hasNext()) {
> exec.checkCanceled();
> RowKey key = new RowKey("Row " + count);
> exec.setProgress("Row " + count);
> // System.out.println(fastq.getSequence());
> Sequence seq = iter.nextSequence();
> String seqName = seq.getName();
> // String seqName = "asdf";
> //String sequence = seq.seqString();
> System.err.println("reading: " + seqName + " " +
> seq.length());
> SequenceDataCell seqCell = new
> SequenceDataCell(seqName, seq);
> container.addRowToTable(new DefaultRow(key, seqCell));
> count++;
> }
> System.err.println("finished reading file");
> br.close();
> fp.close();
> container.close();
> return new BufferedDataTable[] { container.getTable() };
> }
>
> /**
> * Makes a <code>SequenceIterator</code> look like an
> * <code>Iterator {@code <Sequence>}</code>
> *
> * @param iter
> * The <CODE>SequenceIterator</CODE>
> * @return An <CODE>Iterator</CODE> that returns only
> <CODE>Sequence</CODE>
> * objects. <B>You cannot call <code>remove()</code>
> on this
> * iterator!</B>
> */
> public Iterator<Sequence> asIterator(SequenceIterator iter) {
> final SequenceIterator it = iter;
> return new Iterator<Sequence>() {
> public boolean hasNext() {
> return it.hasNext();
> }
>
> public Sequence next() {
> try {
> return it.nextSequence();
> } catch (BioException e) {
> NoSuchElementException ex = new
> NoSuchElementException();
> ex.initCause(e);
> throw ex;
> }
> }
>
> public void remove() {
> throw new UnsupportedOperationException();
> }
> };
> }
>
> public static RichSequenceFormat formatForName(String name)
> throws ClassNotFoundException, InstantiationException,
> IllegalAccessException {
> // determine the format to use
> RichSequenceFormat format;
> if (name.equalsIgnoreCase("fasta")) {
> format = (RichSequenceFormat) new FastaFormat();
> } else if (name.equalsIgnoreCase("genbank")) {
> format = (RichSequenceFormat) new GenbankFormat();
> } else if (name.equalsIgnoreCase("uniprot")) {
> format = new UniProtFormat();
> } else if (name.equalsIgnoreCase("embl")) {
> format = new EMBLFormat();
> } else if (name.equalsIgnoreCase("INSDseq")) {
> format = new INSDseqFormat();
> } else {
> Class formatClass = Class.forName(name);
> format = (RichSequenceFormat) formatClass.newInstance();
> }
> return format;
> }
>
> /**
> * {@inheritDoc}
> */
> @Override
> protected void reset() {
> }
>
> /**
> * {@inheritDoc}
> */
> @Override
> protected DataTableSpec[] configure(final DataTableSpec[] inSpecs)
> throws InvalidSettingsException {
> DataColumnSpec[] allColSpecs = new DataColumnSpec[1];
> allColSpecs[0] = new DataColumnSpecCreator("sequence",
> SequenceDataCell.TYPE)
> .createSpec();
> DataTableSpec outputSpec = new DataTableSpec(allColSpecs);
>
> return new DataTableSpec[] { outputSpec };
>
> }
>
> /**
> * {@inheritDoc}
> */
> @Override
> protected void saveSettingsTo(final NodeSettingsWO settings) {
> m_alphabet.saveSettingsTo(settings);
> m_fformat.saveSettingsTo(settings);
> m_fpname.saveSettingsTo(settings);
> }
>
> /**
> * {@inheritDoc}
> */
> @Override
> protected void loadValidatedSettingsFrom(final NodeSettingsRO
> settings)
> throws InvalidSettingsException {
> m_alphabet.loadSettingsFrom(settings);
> m_fformat.loadSettingsFrom(settings);
> m_fpname.loadSettingsFrom(settings);
> }
>
> /**
> * {@inheritDoc}
> */
> @Override
> protected void validateSettings(final NodeSettingsRO settings)
> throws InvalidSettingsException {
> m_alphabet.validateSettings(settings);
> m_fformat.validateSettings(settings);
> m_fpname.validateSettings(settings);
> }
>
> /**
> * {@inheritDoc}
> */
> @Override
> protected void loadInternals(final File internDir,
> final ExecutionMonitor exec) throws IOException,
> CanceledExecutionException {
> }
>
> /**
> * {@inheritDoc}
> */
> @Override
> protected void saveInternals(final File internDir,
> final ExecutionMonitor exec) throws IOException,
> CanceledExecutionException {
> }
>
> public static SettingsModelString createFAR_fpname() {
> return new SettingsModelString(FAR_name, "");
> }
>
> public static SettingsModelString createFileFormat() {
> return new SettingsModelString(FAR_fileFormat, "FASTA");
> }
>
> public static SettingsModelString createAlphabet() {
> return new SettingsModelString(FAR_alphabet, "RNA");
>
> }
>
> }
>
>
> On 9/21/2010 2:40 PM, simon rayner wrote:
>> hi,
>>
>> can you repost to the biojava group along with the full code,
>> (just in case there is a missing import or something). you only
>> replied to, and not to the biojava mailing list
>>
>> thanks
>>
>> simon
>>
>> On Tue, Sep 21, 2010 at 8:18 PM, Bernd Jagla
>> <bernd.jagla at pasteur.fr <mailto:bernd.jagla at pasteur.fr>> wrote:
>>
>> Thanks for the quick reply!
>>
>> Here is some code that should have all the important parts:
>>
>> String form = "genbank";
>> String alphabet = "dna";
>> BufferedReader br = new BufferedReader(fp);
>> SequenceIterator iter = (SequenceIterator)
>> SeqIOTools.fileToBiojava(
>> form, alphabet, br);
>> while (iter.hasNext()) {
>> Sequence seq = iter.nextSequence();
>> => Exception thrown
>> String seqName = seq.getName();
>> }
>>
>>
>> When trying to simplify the code a bit I now get the
>> following error:
>> Execute failed: Could not initialize class
>> org.biojava.bio.seq.FeatureFilter
>>
>> I assume that in the previous times I had a spelling error??
>> Then the exception got thrown during the initialization of "iter"
>>
>> Thanks,
>>
>> Bernd
>>
>>
>> On 9/21/2010 2:07 PM, simon rayner wrote:
>>> hi,
>>>
>>> can you post the code you are trying to run along with the
>>> full error, it will help to figure out what is happening.
>>> There are now loaders for biojavax as well, which work well
>>> which are available in the biojavax docs here
>>> http://biojava.org/wiki/BioJava:BioJavaXDocs#Example
>>>
>>> but yeah, it's confusing unless you happen to be a real java
>>> guru. i keep having to refer back to the docs because i
>>> keep forgeting which class does what
>>>
>>> On Tue, Sep 21, 2010 at 7:46 PM, Bernd Jagla
>>> <bernd.jagla at pasteur.fr <mailto:bernd.jagla at pasteur.fr>> wrote:
>>>
>>> Hello,
>>>
>>> I am getting a little frustrated with the wiki page (I
>>> guess I don't spend enough time reading and testing). I
>>> have the impression that some of the documentation
>>> relates to version 3 whereas others relate to 1.5 or 1.7.
>>> So sorry if this all sounds a bit confused... ;(
>>>
>>> I believe I am using 1.7.1. (I wasn't able to find a
>>> readme file that contains that information) even though
>>> I would probably like to use version 3. But as I am
>>> stuck with an older Eclipse version I think it will be
>>> even worse when I try that.
>>>
>>> Anyways, I am trying to read in sequence files using
>>> SeqIOTools.fileToBiojava, which seems to be deprecated,
>>> with the following parameters: "genbank", "dna",
>>> bufferedReader.
>>>
>>> somehow this works with "fasta" but with genbank I get
>>> the following exception:
>>> Execute failed: Unknown file type '524300'
>>> in some cases I get:
>>> Unknown file type '262156'
>>>
>>> Does this mean anything to you?
>>>
>>> Or how do you read in a sequence file? I am looking for
>>> a generic way that covers many file types (genbank,
>>> fasta, swissprot...)
>>>
>>> Once I have this I will probably be able to get to the
>>> feature information using the information from the
>>> tutorial.
>>>
>>> Thanks for your time.
>>>
>>> Bernd
>>>
>>>
>>>
>>> _______________________________________________
>>> Biojava-l mailing list - Biojava-l at lists.open-bio.org
>>> <mailto:Biojava-l at lists.open-bio.org>
>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>
>>>
>>>
>>>
>>> --
>>> Simon Rayner
>>>
>>> State Key Laboratory of Virology
>>> Wuhan Institute of Virology
>>> Chinese Academy of Sciences
>>> Wuhan, Hubei 430071
>>> P.R.China
>>>
>>> +86 (27) 87199895 (office)
>>> +86 18627113001 (cell)
>>>
>>
>>
>>
>> --
>> Simon Rayner
>>
>> State Key Laboratory of Virology
>> Wuhan Institute of Virology
>> Chinese Academy of Sciences
>> Wuhan, Hubei 430071
>> P.R.China
>>
>> +86 (27) 87199895 (office)
>> +86 18627113001 (cell)
>>
>
>
>
> --
> Simon Rayner
>
> State Key Laboratory of Virology
> Wuhan Institute of Virology
> Chinese Academy of Sciences
> Wuhan, Hubei 430071
> P.R.China
>
> +86 (27) 87199895 (office)
> +86 18627113001 (cell)
>
More information about the Biojava-l
mailing list