[Biojava-dev] BioJava 3 code usage examples

Thu Nov 20 22:26:07 UTC 2008

Hi Richard,

I spent some time reading the codes today. I found that you had packed the biojava3
modules in a different style from the old version. I guess that some of the
reasons are related to the new design philosophy and some are related to the
maven software (I am new to maven). The things that are not clear to me are:

1) It doesn't seem that you want to avoid name conflicts with the old version
because you are continuing using the package name "org.biojava.*"
instead of "org.biojava3.*"

2) The old biojava version arranges sequence related classes in a hierarchical fashion,
while in the new version you put the FASTA parsing classes directly under a first level
node "org.biojava.fasta" rather than under the
"org.biojava.seq" as before. There are tens of popular file formats
in the bioinformatics world, so will all of them crowd the first level nodes under
the root package?

3) The source files are now in much deeper paths now, for example for
the FASTA parser, the path is "src/main/java/org/biojava/fasta",
as opposed to the common style "src/org/biojava/fasta", so I am
wondering why it is necessary to add "main/java" in the middle of the
path.

4) It is interesting to see that you put the source codes of all the
sub-packages separately, so whenever I need to browse the codes of some related classes
in Windows explorer or Unix shell, I really need to go up and down by clicking or typing many more times.
Netbean IDE alleviated this problem a little bit. I understand the idea of seperating independent packages in the new design, but I am wondering whether the current very fine seperation of classes went too far.

I am not familiar with the new design, so forgive my ignorance. Thanks for your
time.

Hongyu Zhang, Ph.D.
Ceres Inc., Thousand Oaks, CA
Cell: 805-405-5394
Fax: 866-447-8750

________________________________
From: Hongyu Zhang <me at hongyu.org>
To: holland at eaglegenomics.com
Sent: Wednesday, November 19, 2008 10:55:06 AM
Subject: Re: [Biojava-dev] BioJava 3 code usage examples

Thanks for the quick response, Richard. I will dive deeper into your codes. 

 Best,

Hongyu Zhang, Ph.D.
Ceres Inc., Thousand Oaks, CA
Cell: 805-405-5394
Fax: 866-447-8750

________________________________
From: Richard Holland <holland at eaglegenomics.com>
To: Hongyu Zhang <me at hongyu.org>
Cc: biojava-dev <biojava-dev at lists.open-bio.org>
Sent: Wednesday, November 19, 2008 4:00:17 AM
Subject: Re: [Biojava-dev] BioJava 3 code usage examples

Hello.

Thanks for your feedback. You are right that we've continued to
provide a Symbol-based alphabet/symbol structure, but it is no longer
a central concept nor is it required to use it.

You'll notice that when FASTA is read using the new parser, it reads
the sequence from the FASTA file as a simple String (actually, a
CharSequence). If you want to work with it as a String/CharSequence
and don't want to convert it into Symbols/Lists, you can do so. This
is the big change from the existing BioJava way of doing things, which
automatically converts everything into the BioJava object model
instead of giving the user the choice of what to do with it. This
change is consistent with the part of the design document you quote in
your email.

So, this is giving users the choice of whether they want to work with
the sequences directly as Strings/CharSequences, or whether they want
to convert them into Symbols/Lists. Users can then tailor their choice
depending on locally observed speed/memory usage issues should they so
wish.

cheers,
Richard

2008/11/19 Hongyu Zhang <me at hongyu.org>:
> Hi Richard,
>
> Thanks for your great work! I noticed from your examples that you decided to continue to use the Symbol object-based model to represent sequences even though in the Biojava3 design page ( http://biojava.org/wiki/BioJava3_Design ) it said
> "Sequences are perfectly happy as Strings unless you want to do complex
> things like store base quality information, and only at that point
> should you want to convert them into more complex object models."
>
>
> The original Biojava tutorial ( http://biojava.org/wiki/BioJava:Tutorial:Symbols_and_SymbolLists#Doesn.27t_this_all_waste_memory.3F ) discussed the memoery space difference between Symbol object-based sequence representation and String-based sequence representation, but it didn't address speed issue. One of the advantages of Java String library is that it was optimized using native machine codes, so  I think an Sybmol object-based sequence representation would be slower than String-based sequence representation for certain operations such as substring search.
>
> Let me know if I missed something. Thanks!
>
> Best,
>
> Hongyu Zhang, Ph.D.
> Ceres Inc., Thousand Oaks, CA
> Cell: 805-405-5394
> Fax: 866-447-8750
> _______________________________________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biojava-dev
>

-- 
Richard Holland, BSc MBCS
Finance Director, Eagle Genomics Ltd
M: +44 7500 438846 | E: holland at eaglegenomics.com
http://www.eaglegenomics.com/