[Biojava-l] Parsing INSDseq Sequences (1.3 & 1.4)
Richard Holland
richard.holland at ebi.ac.uk
Mon Jun 12 08:37:23 UTC 2006
Typo in code. my fault. Try again!
On Thu, 2006-06-08 at 10:23 -0400, Seth Johnson wrote:
> I'm still getting an empty array back from this:
>
> Note [] myAccs = ((RichAnnotation)rs.getAnnotation()).getProperties
> (INSDseqFormat.Terms.getOtherSeqIdTerm());
>
> Here's the file that I'm parsing:
> ~~~~~~~~~~~~~~~~~~~~~~
> <?xml version="1.0"?>
> <!DOCTYPE INSDSet PUBLIC "-//NCBI//INSD INSDSeq/EN"
> "http://www.ncbi.nlm.nih.gov/dtd/INSD_INSDSeq.dtd">
> <INSDSet>
> <INSDSeq>
> <INSDSeq_locus>AY069118</INSDSeq_locus>
> <INSDSeq_length>1502</INSDSeq_length>
> <INSDSeq_strandedness>single</INSDSeq_strandedness>
> <INSDSeq_moltype>mRNA</INSDSeq_moltype>
> <INSDSeq_topology>linear</INSDSeq_topology>
> <INSDSeq_division>INV</INSDSeq_division>
> <INSDSeq_update-date>17-DEC-2001</INSDSeq_update-date>
> <INSDSeq_create-date>15-DEC-2001</INSDSeq_create-date>
> <INSDSeq_definition>Drosophila melanogaster GH13089 full length
> cDNA</INSDSeq_definition>
> <INSDSeq_primary-accession>AY069118</INSDSeq_primary-accession>
> <INSDSeq_accession-version>AY069118.1</INSDSeq_accession-version>
> <INSDSeq_other-seqids>
> <INSDSeqid>gb|AY069118.1|</INSDSeqid>
> <INSDSeqid>gi|17861571</INSDSeqid>
> </INSDSeq_other-seqids>
> <INSDSeq_keywords>
> <INSDKeyword>FLI_CDNA</INSDKeyword>
> </INSDSeq_keywords>
> <INSDSeq_source>Drosophila melanogaster (fruit
> fly)</INSDSeq_source>
> <INSDSeq_organism>Drosophila melanogaster</INSDSeq_organism>
> <INSDSeq_taxonomy>Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta;
> Pterygota; Neoptera; Endopterygota; Diptera; Brachycera; Muscomorpha;
> Ephydroidea; Drosophilidae; Drosophila</INSDSeq_taxonomy>
> <INSDSeq_references>
> <INSDReference>
> <INSDReference_reference>1 (bases 1 to
> 1502)</INSDReference_reference>
> <INSDReference_position>1..1502</INSDReference_position>
> <INSDReference_authors>
> <INSDAuthor>Stapleton,M.</INSDAuthor>
> <INSDAuthor>Brokstein,P.</INSDAuthor>
> <INSDAuthor>Hong,L.</INSDAuthor>
> <INSDAuthor>Agbayani,A.</INSDAuthor>
> <INSDAuthor>Carlson,J.</INSDAuthor>
> <INSDAuthor>Champe,M.</INSDAuthor>
> <INSDAuthor>Chavez,C.</INSDAuthor>
> <INSDAuthor>Dorsett,V.</INSDAuthor>
> <INSDAuthor>Farfan,D.</INSDAuthor>
> <INSDAuthor>Frise,E.</INSDAuthor>
> <INSDAuthor>George,R.</INSDAuthor>
> <INSDAuthor>Gonzalez,M.</INSDAuthor>
> <INSDAuthor>Guarin,H.</INSDAuthor>
> <INSDAuthor>Li,P.</INSDAuthor>
> <INSDAuthor>Liao,G.</INSDAuthor>
> <INSDAuthor>Miranda,A.</INSDAuthor>
> <INSDAuthor>Mungall,C.J.</INSDAuthor>
> <INSDAuthor>Nunoo,J.</INSDAuthor>
> <INSDAuthor>Pacleb,J.</INSDAuthor>
> <INSDAuthor>Paragas,V.</INSDAuthor>
> <INSDAuthor>Park,S.</INSDAuthor>
> <INSDAuthor>Phouanenavong,S.</INSDAuthor>
> <INSDAuthor>Wan,K.</INSDAuthor>
> <INSDAuthor>Yu,C.</INSDAuthor>
> <INSDAuthor>Lewis,S.E.</INSDAuthor>
> <INSDAuthor>Rubin,G.M.</INSDAuthor>
> <INSDAuthor>Celniker,S.</INSDAuthor>
> </INSDReference_authors>
> <INSDReference_title>Direct Submission</INSDReference_title>
> <INSDReference_journal>Submitted (10-DEC-2001) Berkeley
> Drosophila Genome Project, Lawrence Berkeley National Laboratory, One
> Cyclotron Road, Berkeley, CA 94720, USA</INSDReference_journal>
> </INSDReference>
> </INSDSeq_references>
> <INSDSeq_comment>Sequence submitted by: Berkeley Drosophila Genome
> Project Lawrence Berkeley National Laboratory Berkeley, CA 94720 This
> clone was sequenced as part of a high-throughput process to sequence
> clones from Drosophila Gene Collection 1 (Rubin et al., Science 2000).
> The sequence has been subjected to integrity checks for sequence
> accuracy, presence of a polyA tail and contiguity within 100 kb in the
> genome. Thus we believe the sequence to reflect accurately this
> particular cDNA clone. However, there are artifacts associated with
> the generation of cDNA clones that may have not been detected in our
> initial analyses such as internal priming, priming from contaminating
> genomic DNA, retained introns due to reverse transcription of
> unspliced precursor RNAs, and reverse transcriptase errors that result
> in single base changes. For further information about this sequence,
> including its location and relationship to other sequences, please
> visit our Web site ( http://fruitfly.berkeley.edu) or send email to
> cdna at fruitfly.berkeley.edu.</INSDSeq_comment>
> <INSDSeq_feature-table>
> <INSDFeature>
> <INSDFeature_key>source</INSDFeature_key>
> <INSDFeature_location>1..1502</INSDFeature_location>
> <INSDFeature_intervals>
> <INSDInterval>
> <INSDInterval_from>1</INSDInterval_from>
> <INSDInterval_to>1502</INSDInterval_to>
> <INSDInterval_accession>AY069118.1</INSDInterval_accession>
> </INSDInterval>
> </INSDFeature_intervals>
> <INSDFeature_quals>
> <INSDQualifier>
> <INSDQualifier_name>organism</INSDQualifier_name>
> <INSDQualifier_value>Drosophila
> melanogaster</INSDQualifier_value>
> </INSDQualifier>
> <INSDQualifier>
> <INSDQualifier_name>mol_type</INSDQualifier_name>
> <INSDQualifier_value>mRNA</INSDQualifier_value>
> </INSDQualifier>
> <INSDQualifier>
> <INSDQualifier_name>strain</INSDQualifier_name>
> <INSDQualifier_value>y; cn bw sp</INSDQualifier_value>
> </INSDQualifier>
> <INSDQualifier>
> <INSDQualifier_name>db_xref</INSDQualifier_name>
> <INSDQualifier_value>taxon:7227</INSDQualifier_value>
> </INSDQualifier>
> <INSDQualifier>
> <INSDQualifier_name>map</INSDQualifier_name>
> <INSDQualifier_value>39B3-39B3</INSDQualifier_value>
> </INSDQualifier>
> </INSDFeature_quals>
> </INSDFeature>
> <INSDFeature>
> <INSDFeature_key>gene</INSDFeature_key>
> <INSDFeature_location>1..1502</INSDFeature_location>
> <INSDFeature_intervals>
> <INSDInterval>
> <INSDInterval_from>1</INSDInterval_from>
> <INSDInterval_to>1502</INSDInterval_to>
> <INSDInterval_accession> AY069118.1</INSDInterval_accession>
> </INSDInterval>
> </INSDFeature_intervals>
> <INSDFeature_quals>
> <INSDQualifier>
> <INSDQualifier_name>gene</INSDQualifier_name>
> <INSDQualifier_value>E2f2</INSDQualifier_value>
> </INSDQualifier>
> <INSDQualifier>
> <INSDQualifier_name>note</INSDQualifier_name>
> <INSDQualifier_value>alignment with genomic scaffold
> AE003669</INSDQualifier_value>
> </INSDQualifier>
> <INSDQualifier>
> <INSDQualifier_name>db_xref</INSDQualifier_name>
>
> <INSDQualifier_value>FLYBASE:FBgn0024371</INSDQualifier_value>
> </INSDQualifier>
> </INSDFeature_quals>
> </INSDFeature>
> <INSDFeature>
> <INSDFeature_key>CDS</INSDFeature_key>
> <INSDFeature_location>189..1301</INSDFeature_location>
> <INSDFeature_intervals>
> <INSDInterval>
> <INSDInterval_from>189</INSDInterval_from>
> <INSDInterval_to>1301</INSDInterval_to>
> <INSDInterval_accession> AY069118.1</INSDInterval_accession>
> </INSDInterval>
> </INSDFeature_intervals>
> <INSDFeature_quals>
> <INSDQualifier>
> <INSDQualifier_name>gene</INSDQualifier_name>
> <INSDQualifier_value>E2f2</INSDQualifier_value>
> </INSDQualifier>
> <INSDQualifier>
> <INSDQualifier_name>note</INSDQualifier_name>
> <INSDQualifier_value>Longest ORF</INSDQualifier_value>
> </INSDQualifier>
> <INSDQualifier>
> <INSDQualifier_name>codon_start</INSDQualifier_name>
> <INSDQualifier_value>1</INSDQualifier_value>
> </INSDQualifier>
> <INSDQualifier>
> <INSDQualifier_name>transl_table</INSDQualifier_name>
> <INSDQualifier_value>1</INSDQualifier_value>
> </INSDQualifier>
> <INSDQualifier>
> <INSDQualifier_name>product</INSDQualifier_name>
> <INSDQualifier_value>GH13089p</INSDQualifier_value>
> </INSDQualifier>
> <INSDQualifier>
> <INSDQualifier_name>protein_id</INSDQualifier_name>
> <INSDQualifier_value>AAL39263.1</INSDQualifier_value>
> </INSDQualifier>
> <INSDQualifier>
> <INSDQualifier_name>db_xref</INSDQualifier_name>
> <INSDQualifier_value>GI:17861572</INSDQualifier_value>
> </INSDQualifier>
> <INSDQualifier>
> <INSDQualifier_name>db_xref</INSDQualifier_name>
>
> <INSDQualifier_value>FLYBASE:FBgn0024371</INSDQualifier_value>
> </INSDQualifier>
> <INSDQualifier>
> <INSDQualifier_name>translation</INSDQualifier_name>
>
> <INSDQualifier_value>MYKRKTASIVKRDSSAAGTTSSAMMMKVDSAETSVRSQSYESTPVSMDTSPDPPTPIKSPSNSQSQSQPGQQRSVGSLVLLTQKFVDLVKANEGSIDLKAATKILDVQKRRIYDITNVLEGIGLIDKGRHCSLVRWRGGGFNNAKDQENYDLARSRTNHLKMLEDDLDRQLEYAQRNLRYVMQDPSNRSYAYVTRDDLLDIFGDDSVFTIPNYDEEVDIKRNHYELAVSLDNGSAIDIRLVTNQGKSTTNPHDVDGFFDYHRLDTPSPSTSSHSSEDGNAPACAGNVITDEHGYSCNPGMKDEMKLLENELTAKIIFQNYLSGHSLRRFYPDDPNLENPPLLQLNPPQEDFNFALKSDEGICELFDVQCS</INSDQualifier_value>
> </INSDQualifier>
> </INSDFeature_quals>
> </INSDFeature>
> </INSDSeq_feature-table>
>
> <INSDSeq_sequence>AAGAATAGAGGGAGAATGAAAAAAATGACATAAATGGCGGAAAGCAAACCTAGCGCCAACATTCGTATTTTCGTTTAATTTTCGCTCCAAAGTGCAATTAATTCCGGCTTCTTGATCGCTGCATATTGAGTGCAGCCACGCAAAGAGTTACAAGGACAGGAGTATAGTCATCGAGTCGATTGCGGACCATGTACAAGCGCAAAACCGCGAGTATTGTTAAAAGAGACAGCTCCGCAGCGGGCACCACCTCCTCGGCTATGATGATGAAGGTGGATTCGGCTGAGACTTCGGTCCGGTCGCAGAGCTACGAGTCTACACCCGTTAGCATGGACACATCACCGGATCCTCCAACGCCAATCAAGTCTCCGTCGAATTCACAATCGCAATCGCAGCCTGGACAACAGCGCTCCGTGGGCTCACTGGTCCTGCTCACACAGAAGTTTGTGGATCTCGTGAAGGCCAACGAAGGATCCATCGACCTGAAAGCGGCAACCAAAATCTTGGACGTACAGAAGCGCCGAATATACGATATTACCAATGTTTTAGAGGGCATTGGACTAATTGATAAGGGCAGACACTGCTCCCTAGTGCGCTGGCGCGGAGGGGGCTTTAACAATGCCAAGGACCAAGAGAACTACGACCTGGCACGTAGCCGGACTAATCATTTGAAAATGTTGGAGGATGACCTAGACAGGCAACTGGAGTATGCACAGCGCAATCTGCGCTACGTTATGCAGGATCCCTCGAATAGGTCGTATGCATATGTGACACGTGATGATCTGCTGGACATCTTTGGAGATGATTCCGTATTCACAATACCTAATTATGACGAGGAAGTAGATATCAAGCGTAATCATTACGAGCTGGCCGTGTCGCTGGACAATGGCAGCGCAATTGACATTCGCCTGGTGACGAACCAAGGAAAGAGTACTACAAATCCGCACGATGTGGATGGGTTCTTTGACTATCACCGTCTGGACACGCCCTCACCCTCGACGTCGTCGCACTCCAGCGAGGATGGTAACGCTCCAGCATGCGCGGGGAACGTGATCACCGACGAGCACGGTTACTCGTGCAATCCCGGGATGAAAGATGAGATGAAACTTTTGGAGAACGAGCTGACGGCCAAGATAATCTTCCAAAATTATCTGTCCGGTCATTCGCTGCGGCGATTTTATCCCGATGATCCGAATCTAGAAAACCCGCCGCTGCTGCAGCTGAATCCTCCGCAGGAAGACTTCAACTTTGCGTTAAAAAGCGACGAAGGTATTTGCGAGCTGTTTGATGTTCAGTGCTCCTAACTGTGGAAGGGGATGTACACCTTAGGACTATAGCTACACTGCAACTGGCCGCGTGCATTGTGCAAATATTTATGATTAGTACAATTTTGACTTTGGATTTCTCTATATCGTCTAGAAATTTTTAATTAGTGTAATACCTTGTAATTTCGCAAATAACAGCAAAACCAATAAATTCGTAAATGCAAAAAAAAAAAAAAAAAA</INSDSeq_sequence>
> </INSDSeq>
> </INSDSet>
> ~~~~~~~~~~~~~~~~~~~~~~
>
> On 6/8/06, Richard Holland <richard.holland at ebi.ac.uk> wrote:
> Yesterday I think I said I was going to add other-seqids but I
> forgot to
> do it, so I did it just now. Try it and see. Use the new
> INSDseqFormat.Terms.getOtherSeqIdTerm() term to find them.
>
> cheers,
> Richard
>
> On Wed, 2006-06-07 at 19:48 -0400, Seth Johnson wrote:
> > Hi Richard,
> >
> > I still cannot locate the GI number for the main
> sequence. After I
> > parse it with readINSDseqDNA, I then use:
> >
> > Note [] myAccs =
> ((RichAnnotation)rs.getAnnotation
> > ()).getProperties(Terms.getAdditionalAccessionTerm ());
> >
> > However, the 'myAccs' appears to be empty. Am I on the
> wrong track to
> > get to other-seqids???
> >
> > On 6/6/06, Richard Holland < richard.holland at ebi.ac.uk>
> wrote:
> > GenBank has a separate line for GI number, so it can
> be parsed
> > out
> > nicely. INSDseq does not, so you have to rely on the
> other-
> > seqids tag
> > and hope that one of them is the GI number. However
> it seems I
> > have not
> > included that tag in the parser, so I will include
> it. This
> > will make
> > the other-seqids values available through the notes
> with the
> > term
> > Terms.getAdditionalAccessionTerm(), but
> getIdentifier() will
> > remain
> > null.
> >
> > For your second question, the tutorial makes the
> mistake in
> > several
> > places of saying getNoteSet(Terms.blahblah()). This
> was
> > shorthand for:
> >
> > rs.getAnnotation().getProperty(Terms.blahblah())
> > (for single values)
> >
> > or
> >
> > ((RichAnnotation)rs.getAnnotation()).getProperties
> > ( Terms.blahblah ())
> > (for multiple values)
> >
> > but never got expanded. Maybe someone can fix that
> one
> > day... :)ded...
> >
> > I'm just updating INSDseq to 1.4 now. The guys next
> door gave
> > me the
> > details of the changes, and told me that 1.3 is
> actually no
> > longer
> > supported by them after Friday this week! So I'll
> make it 1.4
> > only.
> >
> > cheers,
> > Richard
> >
> --
> Richard Holland (BioMart Team)
> EMBL-EBI
> Wellcome Trust Genome Campus
> Hinxton
> Cambridge CB10 1SD
> UNITED KINGDOM
> Tel: +44-(0)1223-494416
>
>
>
>
> --
> Best Regards,
>
>
> Seth Johnson
> Senior Bioinformatics Associate
>
> Ph: (202) 470-0900
> Fx: (775) 251-0358
--
Richard Holland (BioMart Team)
EMBL-EBI
Wellcome Trust Genome Campus
Hinxton
Cambridge CB10 1SD
UNITED KINGDOM
Tel: +44-(0)1223-494416
More information about the Biojava-l
mailing list