[Biojava-dev] Biojava.util package?

Hannes Brandstätter-Müller biojava at hannes.oib.com
Sat Mar 31 20:07:44 UTC 2012


Try:

http://biojava.org/wiki/BioJava:CookBook:Core:FastaReadWrite
http://biojava.org/wiki/BioJava:CookBook:genome:Overview
http://biojava.org/wiki/BioJava:CookBook3:FASTQ

Hannes

On Sat, Mar 31, 2012 at 19:16, David Felty <davfelty at gmail.com> wrote:
> I've been looking at the file parsers for BioPython and BioPerl, and
> here are some features I've compiled:
> Important features:
> - Conversion between file formats
> - Lazy IO; useful for large files
> - Use Iterable interface so we get Java foreach over sequences
> - Index sequences by ID (turn a list of sequences to a map from ID -> seq)
> - Fetching from remote databases
>
> Other features:
> - Restrict fields needed to speed up parsing; see
> http://bioperl.org/wiki/HOWTO:SeqIO#Speed.2C_Bio::Seq::SeqBuilder
> - Auto-detect file format (use file extension)
> - General-purpose API with sensible defaults for most cases, and a
> more specific but complex API for more control
> - Index sequences by a user-defined value
> - Store indexed database files locally (BioPython stores as a SQLite database)
>
> Does this look like a fair list? I tried to look for common use cases
> in BioJava's tutorial, but I only found this page, which comes from
> BioJava 1.8: http://biojava.org/wiki/BioJava:Tutorial:Sequence_IO_basics
> Are there any other useful sources I could look at? Or perhaps even
> some real-world code that makes use of parsers?
>
> Thanks,
> David
>
> On Fri, Mar 30, 2012 at 6:52 PM, P. Troshin <to.petr at gmail.com> wrote:
>>> But are there any additional features anyone wants me to consider? Once again, I don't
>>> have the same experience as many of you, so your input is very
>>> helpful!
>>
>> David, I think this is a pretty good list. Remember you are here into
>> something more than just a FASTA parser.
>>
>>> But here is what I've gathered so far from a combination
>>> of already-existing code and people's responses:
>>
>> I think this is a very good approach.
>> Look at the existing parsers in BioJava and beyond, the features that
>> are common will be the most important. Less common will be useful in
>> some cases but less in others. Come up with a set of use cases and try
>> using the parsers to achieve them, see how easy (or indeed possible)
>> it is going to be with various parsers. I appreciate this is a lot of
>> work, but this way you'll know by heart what a good parser constitutes
>> of.
>> You can learn from many implementations to get you own just right.
>> Once you've done this, you are going to be the expert and will be able
>> to come up with a list of features in order of importance that your
>> parser is going to have and have some guesstimate of how long it is
>> going to take you to implement them. Do not hesitate to ask the
>> community if there is something you cannot get your heard around.
>>
>> Good luck,
>> Peter
>>
>>
>> On 30 March 2012 00:59, David Felty <davfelty at gmail.com> wrote:
>>>>I'd suggest you step aside from the
>>>>details of implementation. Think about what features your parser(s)
>>>>must have and when how you are going to achieve them
>>>
>>> Thank you for this! I now realize that I've been concentrating too
>>> much on the implementation rather than the features. The
>>> implementation will be important when (or if) I actually work on the
>>> project during GSoC, but for now, I'll try to focus on features for my
>>> proposal.
>>>
>>> Unfortunately, I'm not very acquainted with the world of computational
>>> biology, so I can't be sure what features would be most useful for the
>>> file parsers. But here is what I've gathered so far from a combination
>>> of already-existing code and people's responses:
>>> - Simple api
>>> - Robust
>>> - Extensible
>>> - Good performance
>>> - Feature-rich
>>> - Wide variety of parsers
>>> - Proxy-fetching from remote databases (by ID or location)
>>> - Local caching
>>> - Auto-detection of data type
>>> - Auto-detection of file format
>>> - Lazy IO
>>> - Random access file reading
>>>
>>> Obviously, these are not all of equal importance, so I'll have to pick
>>> out the most important ones for my proposal. But are there any
>>> additional features anyone wants me to consider? Once again, I don't
>>> have the same experience as many of you, so your input is very
>>> helpful!
>>>
>>> Thanks,
>>> David
>>>
>>> On Thu, Mar 29, 2012 at 5:49 PM, P. Troshin <to.petr at gmail.com> wrote:
>>>> Hi David,
>>>>
>>>> Great to see such a discussion! You should see how important your work
>>>> for Bio community is going to be.
>>>>
>>>> Now, what you need to do is to try taking into account what other
>>>> people were suggesting and put it into your proposal. It's not going
>>>> to be any good just to add a bunch of opinions; you need to come up
>>>> with a coherent proposal. For this I'd suggest you step aside from the
>>>> details of implementation. Think about what features your parser(s)
>>>> must have and when how you are going to achieve them?
>>>> I'd suggest that your parsers should be
>>>> - easy to use (IMHO this is something BioJava 1 FASTA parser lacked)
>>>> - robust
>>>> - extensible
>>>> - have good performance
>>>> - most importantly, have sufficiently rich feature set so that we can
>>>> replace other parsers (for the same format) in BioJava with yours.
>>>>
>>>> Do not forget to split your work in several achievable stages.
>>>>
>>>> I'd be careful about transferring the design from Python and
>>>> especially a decade old Perl implementation straight to Java. While
>>>> high level concerts may be the similar, implementation details should
>>>> not be. It’s not that there is anything wrong with these parsers, it
>>>> just that the languages are different. It is good to know how things
>>>> are done elsewhere, but I'd suggest that for Java implementation you
>>>> should be taking inspiration from some well know Java feature. For
>>>> example, the Java Collections - a set of highly regarded tools for
>>>> working with various collections of objects. Also do some reading on
>>>> Java enums, your proposed implementation will definitely benefit from
>>>> using them.
>>>>
>>>> Have fun,
>>>>
>>>> Regards,
>>>> Peter
>>>>
>>>>
>>>> On 29 March 2012 16:39, David Felty <davfelty at gmail.com> wrote:
>>>>> Hey Andreas,
>>>>>
>>>>> It it wouldn't be too difficult to make a method that can infer the
>>>>> file type using the file extension. In fact, it looks like BioPerl's
>>>>> SeqIO does something like this. On the other hand, BioPython's SeqIO
>>>>> takes the route of "explicit is better than implicit," and requires
>>>>> that you explicitly give the format. Perhaps BioJava could take both
>>>>> routes, and have an overloaded parse method that infers the file type,
>>>>> along with the regular explicit method.
>>>>>
>>>>> As for non-fasta files, I implemented a couple of fasq parsers here:
>>>>> http://pastebin.com/KLcpq8Qb
>>>>> This would work similarly:
>>>>>
>>>>> InputStream is = ...
>>>>> ProteinSequence seq = SeqIO.parse(is, SeqIO.FASTQ_SANGER, SeqIO.PROTEIN);
>>>>>
>>>>>
>>>>> It looks like the other sequence readers aren't as clear-cut, so they
>>>>> may need a bit more wrapping before they can be adapted to this
>>>>> method. A common problem is that sequence readers don't return a
>>>>> specific type of sequence, like with
>>>>> org.biojava3.core.sequence.loader.UniprotProxySequenceReader, which
>>>>> just contains the sequence data in itself. We might want to create
>>>>> methods that convert the UniprotProxySequenceReader into sequences
>>>>> that make more sense, like DNASequence and ProteinSequence.
>>>>>
>>>>> I'll look into this more later, I have to go to class.
>>>>>
>>>>> Regards,
>>>>> David
>>>>>
>>>>> On Thu, Mar 29, 2012 at 10:39 AM, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>>
>>>>>> Hi David,
>>>>>>
>>>>>> so far it still feels like a wrapper for what is already there. Try to
>>>>>> take it to the next level. Why does the user still need to provide the
>>>>>> type of file, can't this be auto-detected? What is the behaviour for
>>>>>> non-fasta files, what can be supported and where are the limits, etc.
>>>>>>
>>>>>> Andreas
>>>>>>
>>>>>> On Thu, Mar 29, 2012 at 6:55 AM, David Felty <davfelty at gmail.com> wrote:
>>>>>> > I've actually been working on something like this for my GSoC proposal,
>>>>>> > here's what I came up with:
>>>>>> >
>>>>>> > public class SeqIO {
>>>>>> >    public static final int FASTA = 0;
>>>>>> >    public static final int FASTQ = 1;
>>>>>> >    public static final Class<DNASequence> DNA = DNASequence.class;
>>>>>> >    public static final Class<ProteinSequence> PROTEIN =
>>>>>> > ProteinSequence.class;
>>>>>> >
>>>>>> >    public static <S extends Sequence> Iterable<S> parse(InputStream is,
>>>>>> > int fileFormat, Class<S> seqType) throws Exception {
>>>>>> >        switch (fileFormat) {
>>>>>> >            case FASTA:
>>>>>> >                if (seqType == DNA) {
>>>>>> >                    return (Iterable<S>)
>>>>>> > FastaReaderHelper.readFastaDNASequence(is);
>>>>>> >                } else if (seqType == PROTEIN) {
>>>>>> >                    // etc...
>>>>>> >                }
>>>>>> > break;
>>>>>> >            case FASTQ:
>>>>>> >                // etc...
>>>>>> >        }
>>>>>> >    }
>>>>>> > }
>>>>>> >
>>>>>> > It would be used like so:
>>>>>> >
>>>>>> > InputStream is = ...
>>>>>> > Iterable<DNASequence> seqs = SeqIO.parse(is, SeqIO.FASTA, SeqIO.DNA);
>>>>>> > for (DNASequence s : seqs) {
>>>>>> >   // do something
>>>>>> > }
>>>>>> >
>>>>>> > Obviously it's not the prettiest and a lot could be changed, but that's my
>>>>>> > initial design. I tried to base it off BioPython's SeqIO, but static typing
>>>>>> > and the variety of Sequence types forced me to put in some nasty generics.
>>>>>> > Any tips would be appreciated!
>>>>>> >
>>>>>> > David
>>>>>> >
>>>>>> > On Thu, Mar 29, 2012 at 4:27 AM, Hannes Brandstätter-Müller <
>>>>>> > biojava at hannes.oib.com> wrote:
>>>>>> >
>>>>>> >> Yes, something like a simplifying and unifying wrapper would be what I
>>>>>> >> am thinking of.
>>>>>> >>
>>>>>> >> Hannes
>>>>>> >>
>>>>>> >> On Thu, Mar 29, 2012 at 05:55, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>> >> > Hi Hannes,
>>>>>> >> >
>>>>>> >> > I guess this is pretty similar to:
>>>>>> >> >
>>>>>> >> > http://biojava.org/wiki/BioJava:CookBook:Core:FastaReadWrite
>>>>>> >> >
>>>>>> >> > we have also been using "proxy" objects to fetch sequence data on the fly
>>>>>> >> >
>>>>>> >> > http://biojava.org/wiki/BioJava:CookBook:Core:Sequences
>>>>>> >> >
>>>>>> >> > As such I think we should discuss this a bit more. If we can find a
>>>>>> >> > common api that is simple and works with both local files as well as
>>>>>> >> > remote proxy objects, that would be nice. There should be no need to
>>>>>> >> > change much of the existing code, but perhaps there could be a
>>>>>> >> > simplified wrapper for what is already there.
>>>>>> >> >
>>>>>> >> >  Andreas
>>>>>> >> >
>>>>>> >> > On Wed, Mar 28, 2012 at 12:04 PM, Hannes Brandstätter-Müller
>>>>>> >> > <biojava at hannes.oib.com> wrote:
>>>>>> >> >> Hi,
>>>>>> >> >>
>>>>>> >> >> I browsed around in the sister projects Biopython and Bioperl a bit,
>>>>>> >> >> and noticed that many of the user interaction with the code goes
>>>>>> >> >> through classes like SeqIO, SearchIO, AlignIO...
>>>>>> >> >>
>>>>>> >> >> So that got me thinking: how about we create similar "Interface"
>>>>>> >> >> classes in Biojava?
>>>>>> >> >>
>>>>>> >> >> PROS:
>>>>>> >> >>
>>>>>> >> >>  - easy change for programmers who switch languages
>>>>>> >> >>  - easy base interface that can be used in cookbook examples
>>>>>> >> >>  - makes code more readable if designed properly
>>>>>> >> >>  - easy access to features that are spread over the whole codebase but
>>>>>> >> >> are connected anyway, like all file parsers
>>>>>> >> >>
>>>>>> >> >> CONS:
>>>>>> >> >>
>>>>>> >> >>  - another thing to maintain
>>>>>> >> >>  - creates possible cross-dependencies (but if you don't want that,
>>>>>> >> >> just use the existing classes directly)
>>>>>> >> >>
>>>>>> >> >>
>>>>>> >> >> What are your thoughts?
>>>>>> >> >>
>>>>>> >> >> python from http://biopython.org/wiki/SeqIO:
>>>>>> >> >>
>>>>>> >> >> from Bio import SeqIO
>>>>>> >> >> handle = open("example.fasta", "rU")
>>>>>> >> >> for record in SeqIO.parse(handle, "fasta") :
>>>>>> >> >>    print record.id
>>>>>> >> >> handle.close()
>>>>>> >> >>
>>>>>> >> >> possible equivalent in biojava (support for streaming API, Iterators,
>>>>>> >> etc?):
>>>>>> >> >>
>>>>>> >> >> import org.biojava3.util.SeqIO;
>>>>>> >> >>
>>>>>> >> >> File file = new File("example.fasta");
>>>>>> >> >> SeqIO seqIO = new SeqIO(file, SeqIO.FASTA);
>>>>>> >> >> while (seqIO.hasNext()) {
>>>>>> >> >>    System.out.println(seqIO.next());
>>>>>> >> >> }
>>>>>> >> >> file.close();
>>>>>> >> >>
>>>>>> >> >> Hannes
>>>>>> >> >> _______________________________________________
>>>>>> >> >> biojava-dev mailing list
>>>>>> >> >> biojava-dev at lists.open-bio.org
>>>>>> >> >> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>> >> >
>>>>>> >> >
>>>>>> >> >
>>>>>> >> > --
>>>>>> >> > -----------------------------------------------------------------------
>>>>>> >> > Dr. Andreas Prlic
>>>>>> >> > Senior Scientist, RCSB PDB Protein Data Bank
>>>>>> >> > University of California, San Diego
>>>>>> >> > (+1) 858.246.0526
>>>>>> >> > -----------------------------------------------------------------------
>>>>>> >>
>>>>>> >> _______________________________________________
>>>>>> >> biojava-dev mailing list
>>>>>> >> biojava-dev at lists.open-bio.org
>>>>>> >> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>> >>
>>>>>> >
>>>>>> > _______________________________________________
>>>>>> > biojava-dev mailing list
>>>>>> > biojava-dev at lists.open-bio.org
>>>>>> > http://lists.open-bio.org/mailman/listinfo/biojava-dev
>>>>>
>>>>> _______________________________________________
>>>>> biojava-dev mailing list
>>>>> biojava-dev at lists.open-bio.org
>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev




More information about the biojava-dev mailing list