[Biojava-l] GenBank parsing
simon rayner
simon.rayner.cn at gmail.com
Wed Jun 3 12:29:06 UTC 2015
Thanks to all for taking the time to answer.
I had already got as far as parsing out the feature information using
something like
LinkedHashMap<String, DNASequence> dnaSequences =
GenbankReaderHelper.readGenbankDNASequence( dnaFile );
for (DNASequence sequence : dnaSequences.values()) {
List<FeatureInterface<AbstractSequence<NucleotideCompound>,
NucleotideCompound>> fl = sequence.getFeatures();
for (FeatureInterface fi : fl) {
HashMap <String, Qualifier> quals = fi.getQualifiers();
for(Map.Entry<String, Qualifier> entry :
quals.entrySet()){
logger.info("--\t" + entry.getKey() + "\t|\t" +
entry.getValue().getName()
+ " / " + entry.getValue().getValue() +
"\\" + entry.getValue().toString());
}
logger.info("SHORT\t" + fi.getShortDescription());
logger.info("SOURCE\t" + fi.getSource());
logger.info("TYPE\t" + fi.getType());
logger.info("HASHCODE\t" + fi.hashCode());
logger.info("-");
}
}
But I am still stumped as to how to access the annotation information at
the top of a GenBank file.
For example, getAccession gets me the accession number of the sequence, but
what about all the other data that is there (e.g. the pubmed records)?
In BJ3, there was a RichAnnotation class, but I don't see anything
equivalent in BJ4.
cheers
Simon
On Wed, Jun 3, 2015 at 12:39 PM, Paolo Pavan <paolo.pavan at gmail.com> wrote:
> Hi Simon,
> I took care about last updates to the Genbank parser (reader). At the
> state of the art, there are two ways to read annotated Genbank files: via GenbankReader
> and via GenbankProxySequenceReader .
>
> The first one:
> GenbankReader<ProteinSequence, AminoAcidCompound> GenbankProtein
> = new GenbankReader<ProteinSequence, AminoAcidCompound>(
> inStream,
> new GenericGenbankHeaderParser<ProteinSequence,
> AminoAcidCompound>(),
> new
> ProteinSequenceCreator(AminoAcidCompoundSet.getAminoAcidCompoundSet())
> );
> LinkedHashMap<String, ProteinSequence> proteinSequences =
> GenbankProtein.process();
> inStream.close();
>
>
> The second one is:
>
> GenbankProxySequenceReader<AminoAcidCompound> genbankProteinReader
> = new
> GenbankProxySequenceReader<AminoAcidCompound>("/my_directory", "NP_000257",
> AminoAcidCompoundSet.getAminoAcidCompoundSet());
> ProteinSequence proteinSequence = new
> ProteinSequence(genbankProteinReader);
>
>
> Just keep in mind to use NucleotideCompound and a
> DNASequenceCreator(DNACompoundSet.getDNACompoundSet()) if you need to parse
> genbank nucleotide files.
>
> You can access annotation stored via getFeatures() methods family of the
> readed sequence object. Also note that features have qualifiers (those
> starting with / in the genbank file) and they must be accessed from the
> feature object with getQualifiers().
> Also note that feature can have complex locations (rare, but present) in
> this case you will find nested locations in the feature retrieved.
>
> Does this answer your question?
> Bye bye,
> Paolo
>
>
>
>
>
>
> 2015-06-03 10:27 GMT+02:00 Jose Manuel Duarte <jose.duarte at psi.ch>:
>
>> I can't offer much help regarding GenBank parsing itself, but I would at
>> least like to clarify the situation with the different (indeed confusing)
>> versions:
>>
>> BJ4 is the current release, well maintained and under development. BJ3
>> has been completely superseded by BJ4. That means that BJ4 does everything
>> that BJ3 did. In the cookbook and tutorials everything that refers to BJ3
>> should work in BJ4, with the only difference that the namespace of packages
>> has changed from org.biojava.bio/org.biojava3 to org.biojava.nbio.
>>
>> BJ1 and BJX are both legacy projects, with some maintenance but not much
>> active development. I believe that some of the features in them were not
>> ported to BJ3+.
>>
>> Cheers
>>
>> Jose
>>
>>
>>
>> On 02.06.2015 11:40, Simon Rayner wrote:
>>
>>> Hi
>>>
>>> I'm coming back to BioJava (BJ) after a couple of years away and am
>>> somewhat confused by the current collection of cookbooks, tutorials and
>>> APIs. There appear to be a few examples for handling protein structure
>>> data, but relatively little for more mainstream stuff such as parsing
>>> Genbank files, which I first need to get the information I want to
>>> investigate protein structure. But when I look at the relevant code samples
>>> to do this, they refer back to BJ3, BJ1, or even BJX. Even the Wiki page
>>> still refers to BJ3 despite the release of BJ4 back in Feb 2015.
>>>
>>> I have everything working for parsing GenBank data, but I'm still trying
>>> to get the Annotation information out of the top of a GenBank file, and
>>> can't find any way of doing this using BJ4 - the BJ4 API appears to refer
>>> to the RichAnnotation type in BJX release. Can anyone clarify what you are
>>> supposed to do here? Start mixing in some BJX? (and is BJX still active?)
>>> or should I still be using BJ3 until BJ4 stabilizes. I realise this is an
>>> open source project, but some clarification on the current status of things
>>> would be handy if the project is going to appeal to a larger community :)
>>>
>>> Thanks!
>>>
>>>
>>>
>>> _______________________________________________
>>> Biojava-l mailing list - Biojava-l at mailman.open-bio.org
>>> http://mailman.open-bio.org/mailman/listinfo/biojava-l
>>>
>>
>> _______________________________________________
>> Biojava-l mailing list - Biojava-l at mailman.open-bio.org
>> http://mailman.open-bio.org/mailman/listinfo/biojava-l
>>
>
>
> _______________________________________________
> Biojava-l mailing list - Biojava-l at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biojava-l
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biojava-l/attachments/20150603/0f9a4200/attachment.html>
More information about the Biojava-l
mailing list