[Biojava-dev] Biojava.util package?

David Felty davfelty at gmail.com
Sat Mar 31 17:16:57 UTC 2012


I've been looking at the file parsers for BioPython and BioPerl, and
here are some features I've compiled:
Important features:
- Conversion between file formats
- Lazy IO; useful for large files
- Use Iterable interface so we get Java foreach over sequences
- Index sequences by ID (turn a list of sequences to a map from ID -> seq)
- Fetching from remote databases

Other features:
- Restrict fields needed to speed up parsing; see
http://bioperl.org/wiki/HOWTO:SeqIO#Speed.2C_Bio::Seq::SeqBuilder
- Auto-detect file format (use file extension)
- General-purpose API with sensible defaults for most cases, and a
more specific but complex API for more control
- Index sequences by a user-defined value
- Store indexed database files locally (BioPython stores as a SQLite database)

Does this look like a fair list? I tried to look for common use cases
in BioJava's tutorial, but I only found this page, which comes from
BioJava 1.8: http://biojava.org/wiki/BioJava:Tutorial:Sequence_IO_basics
Are there any other useful sources I could look at? Or perhaps even
some real-world code that makes use of parsers?

Thanks,
David

On Fri, Mar 30, 2012 at 6:52 PM, P. Troshin <to.petr at gmail.com> wrote:
>> But are there any additional features anyone wants me to consider? Once again, I don't
>> have the same experience as many of you, so your input is very
>> helpful!
>
> David, I think this is a pretty good list. Remember you are here into
> something more than just a FASTA parser.
>
>> But here is what I've gathered so far from a combination
>> of already-existing code and people's responses:
>
> I think this is a very good approach.
> Look at the existing parsers in BioJava and beyond, the features that
> are common will be the most important. Less common will be useful in
> some cases but less in others. Come up with a set of use cases and try
> using the parsers to achieve them, see how easy (or indeed possible)
> it is going to be with various parsers. I appreciate this is a lot of
> work, but this way you'll know by heart what a good parser constitutes
> of.
> You can learn from many implementations to get you own just right.
> Once you've done this, you are going to be the expert and will be able
> to come up with a list of features in order of importance that your
> parser is going to have and have some guesstimate of how long it is
> going to take you to implement them. Do not hesitate to ask the
> community if there is something you cannot get your heard around.
>
> Good luck,
> Peter
>
>
> On 30 March 2012 00:59, David Felty <davfelty at gmail.com> wrote:
>>>I'd suggest you step aside from the
>>>details of implementation. Think about what features your parser(s)
>>>must have and when how you are going to achieve them
>>
>> Thank you for this! I now realize that I've been concentrating too
>> much on the implementation rather than the features. The
>> implementation will be important when (or if) I actually work on the
>> project during GSoC, but for now, I'll try to focus on features for my
>> proposal.
>>
>> Unfortunately, I'm not very acquainted with the world of computational
>> biology, so I can't be sure what features would be most useful for the
>> file parsers. But here is what I've gathered so far from a combination
>> of already-existing code and people's responses:
>> - Simple api
>> - Robust
>> - Extensible
>> - Good performance
>> - Feature-rich
>> - Wide variety of parsers
>> - Proxy-fetching from remote databases (by ID or location)
>> - Local caching
>> - Auto-detection of data type
>> - Auto-detection of file format
>> - Lazy IO
>> - Random access file reading
>>
>> Obviously, these are not all of equal importance, so I'll have to pick
>> out the most important ones for my proposal. But are there any
>> additional features anyone wants me to consider? Once again, I don't
>> have the same experience as many of you, so your input is very
>> helpful!
>>
>> Thanks,
>> David
>>
>> On Thu, Mar 29, 2012 at 5:49 PM, P. Troshin <to.petr at gmail.com> wrote:
>>> Hi David,
>>>
>>> Great to see such a discussion! You should see how important your work
>>> for Bio community is going to be.
>>>
>>> Now, what you need to do is to try taking into account what other
>>> people were suggesting and put it into your proposal. It's not going
>>> to be any good just to add a bunch of opinions; you need to come up
>>> with a coherent proposal. For this I'd suggest you step aside from the
>>> details of implementation. Think about what features your parser(s)
>>> must have and when how you are going to achieve them?
>>> I'd suggest that your parsers should be
>>> - easy to use (IMHO this is something BioJava 1 FASTA parser lacked)
>>> - robust
>>> - extensible
>>> - have good performance
>>> - most importantly, have sufficiently rich feature set so that we can
>>> replace other parsers (for the same format) in BioJava with yours.
>>>
>>> Do not forget to split your work in several achievable stages.
>>>
>>> I'd be careful about transferring the design from Python and
>>> especially a decade old Perl implementation straight to Java. While
>>> high level concerts may be the similar, implementation details should
>>> not be. It’s not that there is anything wrong with these parsers, it
>>> just that the languages are different. It is good to know how things
>>> are done elsewhere, but I'd suggest that for Java implementation you
>>> should be taking inspiration from some well know Java feature. For
>>> example, the Java Collections - a set of highly regarded tools for
>>> working with various collections of objects. Also do some reading on
>>> Java enums, your proposed implementation will definitely benefit from
>>> using them.
>>>
>>> Have fun,
>>>
>>> Regards,
>>> Peter
>>>
>>>
>>> On 29 March 2012 16:39, David Felty <davfelty at gmail.com> wrote:
>>>> Hey Andreas,
>>>>
>>>> It it wouldn't be too difficult to make a method that can infer the
>>>> file type using the file extension. In fact, it looks like BioPerl's
>>>> SeqIO does something like this. On the other hand, BioPython's SeqIO
>>>> takes the route of "explicit is better than implicit," and requires
>>>> that you explicitly give the format. Perhaps BioJava could take both
>>>> routes, and have an overloaded parse method that infers the file type,
>>>> along with the regular explicit method.
>>>>
>>>> As for non-fasta files, I implemented a couple of fasq parsers here:
>>>> http://pastebin.com/KLcpq8Qb
>>>> This would work similarly:
>>>>
>>>> InputStream is = ...
>>>> ProteinSequence seq = SeqIO.parse(is, SeqIO.FASTQ_SANGER, SeqIO.PROTEIN);
>>>>
>>>>
>>>> It looks like the other sequence readers aren't as clear-cut, so they
>>>> may need a bit more wrapping before they can be adapted to this
>>>> method. A common problem is that sequence readers don't return a
>>>> specific type of sequence, like with
>>>> org.biojava3.core.sequence.loader.UniprotProxySequenceReader, which
>>>> just contains the sequence data in itself. We might want to create
>>>> methods that convert the UniprotProxySequenceReader into sequences
>>>> that make more sense, like DNASequence and ProteinSequence.
>>>>
>>>> I'll look into this more later, I have to go to class.
>>>>
>>>> Regards,
>>>> David
>>>>
>>>> On Thu, Mar 29, 2012 at 10:39 AM, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>
>>>>> Hi David,
>>>>>
>>>>> so far it still feels like a wrapper for what is already there. Try to
>>>>> take it to the next level. Why does the user still need to provide the
>>>>> type of file, can't this be auto-detected? What is the behaviour for
>>>>> non-fasta files, what can be supported and where are the limits, etc.
>>>>>
>>>>> Andreas
>>>>>
>>>>> On Thu, Mar 29, 2012 at 6:55 AM, David Felty <davfelty at gmail.com> wrote:
>>>>> > I've actually been working on something like this for my GSoC proposal,
>>>>> > here's what I came up with:
>>>>> >
>>>>> > public class SeqIO {
>>>>> >    public static final int FASTA = 0;
>>>>> >    public static final int FASTQ = 1;
>>>>> >    public static final Class<DNASequence> DNA = DNASequence.class;
>>>>> >    public static final Class<ProteinSequence> PROTEIN =
>>>>> > ProteinSequence.class;
>>>>> >
>>>>> >    public static <S extends Sequence> Iterable<S> parse(InputStream is,
>>>>> > int fileFormat, Class<S> seqType) throws Exception {
>>>>> >        switch (fileFormat) {
>>>>> >            case FASTA:
>>>>> >                if (seqType == DNA) {
>>>>> >                    return (Iterable<S>)
>>>>> > FastaReaderHelper.readFastaDNASequence(is);
>>>>> >                } else if (seqType == PROTEIN) {
>>>>> >                    // etc...
>>>>> >                }
>>>>> > break;
>>>>> >            case FASTQ:
>>>>> >                // etc...
>>>>> >        }
>>>>> >    }
>>>>> > }
>>>>> >
>>>>> > It would be used like so:
>>>>> >
>>>>> > InputStream is = ...
>>>>> > Iterable<DNASequence> seqs = SeqIO.parse(is, SeqIO.FASTA, SeqIO.DNA);
>>>>> > for (DNASequence s : seqs) {
>>>>> >   // do something
>>>>> > }
>>>>> >
>>>>> > Obviously it's not the prettiest and a lot could be changed, but that's my
>>>>> > initial design. I tried to base it off BioPython's SeqIO, but static typing
>>>>> > and the variety of Sequence types forced me to put in some nasty generics.
>>>>> > Any tips would be appreciated!
>>>>> >
>>>>> > David
>>>>> >
>>>>> > On Thu, Mar 29, 2012 at 4:27 AM, Hannes Brandstätter-Müller <
>>>>> > biojava at hannes.oib.com> wrote:
>>>>> >
>>>>> >> Yes, something like a simplifying and unifying wrapper would be what I
>>>>> >> am thinking of.
>>>>> >>
>>>>> >> Hannes
>>>>> >>
>>>>> >> On Thu, Mar 29, 2012 at 05:55, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>> >> > Hi Hannes,
>>>>> >> >
>>>>> >> > I guess this is pretty similar to:
>>>>> >> >
>>>>> >> > http://biojava.org/wiki/BioJava:CookBook:Core:FastaReadWrite
>>>>> >> >
>>>>> >> > we have also been using "proxy" objects to fetch sequence data on the fly
>>>>> >> >
>>>>> >> > http://biojava.org/wiki/BioJava:CookBook:Core:Sequences
>>>>> >> >
>>>>> >> > As such I think we should discuss this a bit more. If we can find a
>>>>> >> > common api that is simple and works with both local files as well as
>>>>> >> > remote proxy objects, that would be nice. There should be no need to
>>>>> >> > change much of the existing code, but perhaps there could be a
>>>>> >> > simplified wrapper for what is already there.
>>>>> >> >
>>>>> >> >  Andreas
>>>>> >> >
>>>>> >> > On Wed, Mar 28, 2012 at 12:04 PM, Hannes Brandstätter-Müller
>>>>> >> > <biojava at hannes.oib.com> wrote:
>>>>> >> >> Hi,
>>>>> >> >>
>>>>> >> >> I browsed around in the sister projects Biopython and Bioperl a bit,
>>>>> >> >> and noticed that many of the user interaction with the code goes
>>>>> >> >> through classes like SeqIO, SearchIO, AlignIO...
>>>>> >> >>
>>>>> >> >> So that got me thinking: how about we create similar "Interface"
>>>>> >> >> classes in Biojava?
>>>>> >> >>
>>>>> >> >> PROS:
>>>>> >> >>
>>>>> >> >>  - easy change for programmers who switch languages
>>>>> >> >>  - easy base interface that can be used in cookbook examples
>>>>> >> >>  - makes code more readable if designed properly
>>>>> >> >>  - easy access to features that are spread over the whole codebase but
>>>>> >> >> are connected anyway, like all file parsers
>>>>> >> >>
>>>>> >> >> CONS:
>>>>> >> >>
>>>>> >> >>  - another thing to maintain
>>>>> >> >>  - creates possible cross-dependencies (but if you don't want that,
>>>>> >> >> just use the existing classes directly)
>>>>> >> >>
>>>>> >> >>
>>>>> >> >> What are your thoughts?
>>>>> >> >>
>>>>> >> >> python from http://biopython.org/wiki/SeqIO:
>>>>> >> >>
>>>>> >> >> from Bio import SeqIO
>>>>> >> >> handle = open("example.fasta", "rU")
>>>>> >> >> for record in SeqIO.parse(handle, "fasta") :
>>>>> >> >>    print record.id
>>>>> >> >> handle.close()
>>>>> >> >>
>>>>> >> >> possible equivalent in biojava (support for streaming API, Iterators,
>>>>> >> etc?):
>>>>> >> >>
>>>>> >> >> import org.biojava3.util.SeqIO;
>>>>> >> >>
>>>>> >> >> File file = new File("example.fasta");
>>>>> >> >> SeqIO seqIO = new SeqIO(file, SeqIO.FASTA);
>>>>> >> >> while (seqIO.hasNext()) {
>>>>> >> >>    System.out.println(seqIO.next());
>>>>> >> >> }
>>>>> >> >> file.close();
>>>>> >> >>
>>>>> >> >> Hannes
>>>>> >> >> _______________________________________________
>>>>> >> >> biojava-dev mailing list
>>>>> >> >> biojava-dev at lists.open-bio.org
>>>>> >> >> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>> >> >
>>>>> >> >
>>>>> >> >
>>>>> >> > --
>>>>> >> > -----------------------------------------------------------------------
>>>>> >> > Dr. Andreas Prlic
>>>>> >> > Senior Scientist, RCSB PDB Protein Data Bank
>>>>> >> > University of California, San Diego
>>>>> >> > (+1) 858.246.0526
>>>>> >> > -----------------------------------------------------------------------
>>>>> >>
>>>>> >> _______________________________________________
>>>>> >> biojava-dev mailing list
>>>>> >> biojava-dev at lists.open-bio.org
>>>>> >> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>> >>
>>>>> >
>>>>> > _______________________________________________
>>>>> > biojava-dev mailing list
>>>>> > biojava-dev at lists.open-bio.org
>>>>> > http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>
>>>> _______________________________________________
>>>> biojava-dev mailing list
>>>> biojava-dev at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev




More information about the biojava-dev mailing list