[Biojava-dev] Fwd: [Biojava-l] file i/o with ArrayList
stefan harjes
stefanharjes at yahoo.de
Fri Feb 13 09:53:40 UTC 2015
Hi Erik,
thanks for the offer, but the little mistake with the enumeration was very quick to find. There is an increment of one added to the start position of each location for some reason. If you change it from 1 to 0, the sequence location start positions are no longer incremented.
you can find it in line 284 of GenericInsdcHeaderFormat.java
CheersStefan
Erik McKee <emckee2006 at gmail.com> schrieb am 0:13 Freitag, 13.Februar 2015:
I'm the author of the genbank writer and would greatly appreciate fixes/improvements... I initially ported it from BioPython and simply kept their function structure intact. I focused on the portions my application needed, so there are likely gaps. I'm sure there are bugs as well, and likely can help look over patches to it. Thanks in advance and sorry for any issues/bugs that may exist in the code...On Feb 12, 2015 6:06 AM, "Paolo Pavan" <paolo.pavan at gmail.com> wrote:
Stefan,
Don't misunderstand me! Anyone effort generally is more than welcome by lead developers and yes, the design is not simple and we have already talked about the idea of a refactoring that probably will be rediscussed and planned in future. Just try to be more graceful!
In the specific:
- patch for multiple loading is great (twin GenbankProxySequenceLoader still needs to be verified, if you want).
- Just review the tags you added, I added one note directly in github. Fell free to add examples in test files if you are conviced.
- about references, database and so, new properties need to be added in AbstractSequence, ask Andreas if there are concern with this
- the Writer subsystem needs to be verified according to recent modifications in loaders. My suspect is that will fail in situation described above, but will not take long to review. I you would like to do this I'm sure that Andreas will be very happy to include your patches
- You're right, initial coordinates seems to need a fix. Moreover the ordering of qualifiers is shuffled respect to the original, but I don't think this is an issue when reloading the file.
This will fix your original use case with limited work, in my opinion.
Bye bye and thank you if you will decide to use your time.
Paolo
2015-02-12 2:32 GMT+01:00 stefan harjes <stefanharjes at yahoo.de>:
Hi Paolo, Andreas,
I am sorry if I sounded disrespectful.
I would like to point out that a new user gets confused by what has been published about the biojava library. There are several places where you strongly indicate that concatenated sequences are the intended design (see citations below).
I interpreted the below citations, that the library is aimed at concatenated sequences. However the actual reading reads only one record. I thought it to be very helpful to correct this discrepancy and submit a patch. Usually patches are reviewed by a maintainer who either accepts or rejects the pull request. And I would like to mention, that I also spent a lot of time to understand and correct the issue.
I would be glad if you find a way how to include contributions.
I would also like to mention, that there are errors during Genbank reading/writing. When I compare an original Genbank sequence to one which has been first read and then written, I can see that there are several differences between the two files. The most urgent of which is that the Location start of each feature is incremented by one for each read/write cycle. There are also some minor issues like: the version field is shortened, references and organism are dropped, keywords and source are not copied etc. So it seems you are in need of additional contributions.
citations:
from the cookbook:
/*
* Method 2: With the GenbankReaderHelper
*/
//Try with the GenbankReaderHelper
File dnaFile = new File("src/test/resources/NM_000266.gb");
File protFile = new File("src/test/resources/BondFeature.gb");
LinkedHashMap<String, DNASequence> dnaSequences = GenbankReaderHelper.readGenbankDNASequence( dnaFile );
for (DNASequence sequence : dnaSequences.values()) {
System.out.println( sequence.getSequenceAsString() );
}
without knowing the contents of 'NM_000266.gb' the reader must assume, that there are several sequences in the file as first: The LinkedHashMap is called 'dnaSequences" with emphasis on the plural. Second if you read only one DNASequence why would you have a LinkedHashMap and why would you loop over one! sequence? Correct me if I am wrong, but in my opinion the cookbook expects concatenated sequences per single file.
For non concatenated sequences speaks, that the method itself is named 'readGenbankDNASequences'. So I looked into the method to gain more clarity.
from the source code of GenbankReader:/*** This method tries to parse maximum <code>max</code> records from* the open File or InputStream, and leaves the underlying resource open.<br>...
The introducing comment of the method clearly speaks of multiple records. Themethod is called with a parameter 'max=-1' to indicate that all records of thefile should be read. Interestingly the parameter max is not mentioned again in the following code and thus not implemented.
So do you not agree, that the design discussion of whether or not concatenatedsequence files are expected is not decided in your library?
Best regardsStefan
Andreas Prlic <andreas at sdsc.edu> schrieb am 9:10 Donnerstag, 12.Februar 2015:
Thanks, Paolo, I would like to second what you said.
We don't have many rules at BioJava, but "respect the work of others" is one that we used to send out to developers in the past before we granted SVN write access. I would like to put up this rule somewhere where it can be noticed. Nobody gets paid to contribute to BioJava and we are depending on contribution made on a volunteer basis. As such it is important to show respect for the work that people contributed previously. This does not mean we can't change things, it just means we do it in a way where everybody can feel respected.
Andreas
On Wed, Feb 11, 2015 at 3:12 AM, Paolo Pavan <paolo.pavan at gmail.com> wrote:
Hi Stefan,
I don't want to talk about the design of the system. I know what you mean because I have already had the need to work on that and, yes, there are some choices that I would have not taken. But my choice, entering in a collaborative project, was to mantain the planned design to be sure to add some extra features without run the risk of getting lost in huge modifications, minimize the impact of modifications on the API and moreover to have some kind of "respect" for the work of people before me.
Unless the system works in reasonable time, in my opinion this is the important thing.
Moreover be aware that there is also a parallel GenbankProxyLoader system that I'm pretty sure that it is not receiving your patches (catenated sequence loading). To be checked.
About section keys, unknown tags and collecting them in a list, I partially agree with you (more no than yes) because the Genbank format is highly specified and formal, please have a look at the link below. The best think would have to throw an Exception in unsupported cases but this would definitely limit the parser usage.
http://www.insdc.org/files/feature_table.html
About reference, authors and dblink, as prevously mentioned, I have no exceptions to add those properties to AbstractSequence even if I feel they make poor sense outside your specific use case.
Andreas has the last word on this, however.
My suggestion is just to carefully check your new added tags since it seems to me that SOURCE_TAG is the same of your DBSOURCE but I know that this topic is also debated in the others bio* projects so it is not trivial.
Happy biojava-ing,
Paolo
2015-02-11 0:13 GMT+01:00 stefan harjes <stefanharjes at yahoo.de>:
Hi Paolo,
come on now the reader returns a LinkedHashMap. If it would only read one sequence, it could simply return the sequence. Also the actual method call contains a integer parameter indicating how many sequences should be fetched. In my pull request I patched the reader so that it actually does that.
The sectionKey tags I was talking about are actually 'PRIMARY', DBREFERENCE' and "DBLINK'. I think it would be better not to ask after every single key, but simply collect all 'not known' keys together with their values and store them in a list. Then you could just read/write them without silently forget about them. Right now if any unknown tag is read, it is simply recognized and then dropped, which I do not consider friendly.
CheersStefan
Paolo Pavan <paolo.pavan at gmail.com> schrieb am 23:56 Dienstag, 10.Februar 2015:
Hi Stefan, thank you for the review.
You are actually surprising me since if I'm not sure that the reader parser supports multiple genbank files catenated I tought instead that all the info now are full filled in the sequence object.
There are just few tags that are not imported (KEYWORDS_TAG, SOURCE_TAG, REFERENCE_TAG, BASE_COUNT_TAG), the documentation says that this is because they are anyway inferrable by different fields. I can also add this is because, as in the case of authors and reference tags, there is not such a property in the AbstractSequence class and I see poor sense to have it, unless you are doing this sort of swapping job. Anyway, it could be certainly added.
About writer and its failure in writing the reported accession: even if I can't go in deep now, it may well be that it is failing in writing InsdcLocations (also known as split locations, for example join(58474..59052,59052..59279) reported in your genome file) since they have been used more sistematically by the last updated genbank reader. It may need a quick review along with db_xref qualifiers as well.
In the end about alignment, if you are using NedlemanWunsch or such in the alignment package, be sure to load them with the proper AmbiguityDNACompoundSet.
Cheers,
Paolo
2015-02-10 10:28 GMT+01:00 stefan harjes <stefanharjes at yahoo.de>:
Hi Paolo, biojava-dev
I had a look myself. First I noticed, that GenbankWriter was actually more sophisticated than the Reader, as it was able to write more than one sequence. I submitted a pull request to patch GenbankReader which enables reading more than one genbank sequence from one file. When we speak of full Genbank reading capability, there are still at least 5 sectionKeys which are just ignored in the reader. I think there should be a way of simply storing them in a List and not asking for each one of them, maybe I will look there later.
The writer is doing pretty well, but you should try to write 'NC_000913.gb' which crashed it in my case (writing nothing/no exception).
I added two more test cases, but I think in order to really test the reader/writer capabilities we need a test where several sequences/proteins are read, merge into an array and written to stream. Upon reading this stream again, we should compare if they are still identical.
Also I noticed, that you can not compare (align) a DNA sequence with non ambiguous nucleotide to a sequence with ambiguous nucleotide compounds even though a matrix dedicated for that exact comparison exists.
CheersStefan
_______________________________________________
biojava-dev mailing list
biojava-dev at mailman.open-bio.org
http://mailman.open-bio.org/mailman/listinfo/biojava-dev
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biojava-dev/attachments/20150213/4dde8ff7/attachment-0001.html>
More information about the biojava-dev
mailing list