[Bioperl-l] xml sequence download from ncbi

Geer, Lewis (NLM) lewisg@mail.nih.gov
Tue, 5 Sep 2000 12:18:38 -0400


Hi, Ralf,

Thanks, I've reported it to the developer.  There are two issues listed in
your message.  The first is that the record is incomplete.  All sequence
records from NCBI derive from the asn.1 record, including the text format
you attach below, so the asn.1 record is not the problem.  Rather, the xml
output below includes only the nucleotide part of a nuc-prot record.  The
information you want is in the part of the record common to both the
nucleotide sequence and protein sequence.  The solution is to output the
common section. 

The second issue you report is that there is a temporary ID in the record
(tmpseq_1), which shouldn't be there.  This needs to be deleted.

Lewis

> -----Original Message-----
> From: Sigmund, Ralf [mailto:Ralf.Sigmund@MPIHAN.MPG.de]
> Sent: Tuesday, September 05, 2000 7:10 AM
> To: Geer, Lewis (NLM); 'Bioperl'
> Subject: AW: [Bioperl-l] xml sequence download from ncbi
> 
> 
> Hi!
> I have been toying around with this.
> I compared what I get when I query Genbank with the gi 3095101
> 
> The genbank result starts with:
>  LOCUS       AF043257     2981 bp    mRNA            ROD      
>  05-MAY-1998
>  DEFINITION  Mus musculus beta5B integrin mRNA, complete cds.
>  ACCESSION   AF043257
>  VERSION     AF043257.1  GI:3095101
> 
> but the downloaded Sequence in XML format does not include 
> the DEFINITION
> data.
> The Object ID is tmpseq_1 and there is no way to find out, 
> that this entry
> represents an integrin mRNA-
> Now I wonder if this due to the asn.1 format the XML output 
> is based on or
> if this is due to an singular inconsistence in the database data?
> I append the XML result i got from:
> http://www.ncbi.nlm.nih.gov/entrez/viewer.cgi?cmd&save=on&view
> =xml&val=30951
> 01
> Thanks for Your Help!
> Ralf
> 
> <?xml version="1.0"?>
> <!--DOCTYPE Seq-entry PUBLIC "-//NCBI//NCBI Seqset/EN" 
> "NCBI_Seqset.dtd"-->
> <Seq-entry>
>   <Seq-entry_seq>
>     <Bioseq>
>       <Bioseq_id>
>         <Seq-id>
>           <Seq-id_local>
>             <Object-id>
>               <Object-id_str>tmpseq_1</Object-id_str>
>             </Object-id>
>           </Seq-id_local>
>         </Seq-id>
>         <Seq-id>
>           <Seq-id_genbank>
>             <Textseq-id>
>               <Textseq-id_name>AF043257</Textseq-id_name>
>               <Textseq-id_accession>AF043257</Textseq-id_accession>
>               <Textseq-id_version>1</Textseq-id_version>
>             </Textseq-id>
>           </Seq-id_genbank>
>         </Seq-id>
>         <Seq-id>
>           <Seq-id_gi>3095101</Seq-id_gi>
>         </Seq-id>
>       </Bioseq_id>
>       <Bioseq_descr>
>         <Seq-descr>
>           <Seqdesc>
>             <Seqdesc_molinfo>
>               <MolInfo>
>                 <MolInfo_biomol value="mRNA">3</MolInfo_biomol>
>                 <MolInfo_completeness
> value="complete">1</MolInfo_completeness>
>               </MolInfo>
>             </Seqdesc_molinfo>
>           </Seqdesc>
>           <Seqdesc>
>             <Seqdesc_update-date>
>               <Date>
>                 <Date_std>
>                   <Date-std>
>                     <Date-std_year>1998</Date-std_year>
>                     <Date-std_month>5</Date-std_month>
>                     <Date-std_day>5</Date-std_day>
>                   </Date-std>
>                 </Date_std>
>               </Date>
>             </Seqdesc_update-date>
>           </Seqdesc>
>         </Seq-descr>
>       </Bioseq_descr>
>       <Bioseq_inst>
>         <Seq-inst>
>           <Seq-inst_repr value="raw"/>
>           <Seq-inst_mol value="rna"/>
>           <Seq-inst_length>2981</Seq-inst_length>
>           <Seq-inst_strand value="ss"/>
>           <Seq-inst_seq-data>
>             <Seq-data>
>               <Seq-data_iupacna>
>  
> <IUPACna>GGGGGCTCGGCGAGGTGCGTCCGGAGCAGCGACAACTCCGAGCGTCCCAGCGG
> GCCAGCGAGGAGGA
> TGGTGGCGGCCGGGCGCGGACCAGCCCGGCCGCGGGCGCCGTGAGCCGGAGCGCAGCGCCCG
> GCATGCGGCTGCGG
> TCCCCGGCCTCGGCCCCGCTCCGCCCCCGCCGAGCGCCCCAGCCGAGCGGCGCGCATCATGC
> CGCGGGTGCCCGCG
> ACCCTCTACGCCTGTCTGCTCGGGCTCTGCGCGCTCGTTCCGCGCCTCGCAGGGCTCAACAT
> ATGCACTAGTGGAA
> GTGCCACCTCGTGTGAAGAATGCCTGTTGATCCACCCAAAATGTGCCTGGTGCTCCAAAGAG
> TACTTTGGCAATCC
> ACGGTCCATCACCTCTCGGTGTGACCTGAAGGCAAACCTCATCCGGAATGGCTGTGAAGGTG
> AGATTGAGAGTCCA
> GCCAGCAGCACCCACGTCCTCCGGAACCTACCTCTCAGCAGCAAGGGTTCCAGTGCCACGGG
> CTCTGACGTCATCC
> AGATGACGCCGCAGGAGATTGCAGTGAGCCTCCGGCCAGGCGAGCAGACTACGTTCCAGCTG
> CAGGTGCGCCAGGT
> GGAGGACTACCCTGTAGACCTGTACTACCTGATGGACCTCTCCCTCTCCATGAAGGATGACT
> TGGAGAACATCCGG
> AGCCTGGGCACCAAGCTTGCGGAGGAAATGAGGAAGCTCACTAGTAACTTCCGCTTAGGTTT
> CGGGTCTTTTGTTG
> ACAAGGACATCTCTCCTTTCTCCTACACGGCACCGAGATACCAGACCAATCCGTGTATTGGT
> TACAAGTTATTCCC
> CAACTGCGTCCCCTCCTTCGGGTTCCGGCATCTGCTGCCTCTCACAGACAGAGTCGACAGCT
> TCAACGAGGAAGTG
> AGGAAGCAGAGGGTGTCCCGGAACCGAGATGCCCCCGAGGGGGGGTTTGATGCGGTCCTCCA
> GGCTGCTGTCTGCA
> AGGAGAAGATCGGATGGCGAAAAGATGCTCTGCACTTGCTGGTGTTCACAACAGACGATGTG
> CCCCACATCGCACT
> GGATGGAAAACTGGGTGGCCTGGTCCAGCCCCACGATGGCCAGTGTCACCTGAATGAAGCCA
> ATGAGTACACAGCC
> TCTAACCAGATGGACTATCCATCGCTTGCCTTGCTTGGGGAGAAGCTGGCAGAGAACAATAT
> CAACCTCATTTTTG
> CTGTGACGAAGAACCACTATATGCTCTACAAGAATTTTACAGCCCTGATACCTGGAACCACT
> GTGGAGATTTTGCA
> TGGAGATTCCAAAAATATTATTCAACTGATTATCAATGCGTACAGTAGCATCCGGGCTAAAG
> TGGAGCTGTCAGTG
> TGGGATCAGCCAGAAGACCTTAATCTCTTCTTCACTGCCACCTGCCAAGATGGCATATCTTA
> CCCTGGTCAGAGGA
> AGTGTGAGGGTCTGAAGATTGGGGACACGGCATCCTTTGAAGTGTCCGTGGAGGCTCGGAGC
> TGCCCCGGCAGACA
> AGCAGCACAGTCTTTCACCTTGAGGCCCGTGGGCTTCCGGGACAGTCTGCAGGTGGAAGTCG
> CCTACAATTGCACA
> TGCGGCTGTAGCACGGGGCTGGAGCCCAACAGTGCCAGATGCAGTGGGAATGGAACATACAC
> CTGTGGGCTGTGCG
> AGTGTGACCCCGGCTACCTGGGCACTAGGTGCGAGTGCCAGGAGGGGGAGAACCAGAGCGGG
> TACCAGAACCTGTG
> CCGGGAGGCAGAGGGCAAGCCTCTGTGCAGCGGGCGTGGAGAGTGTAGCTGCAACCAGTGCT
> CCTGCTTCGAGAGT
> GAGTTCGGGAGGATCTACGGACCTTTCTGCGAGTGTGACAGCTTTTCCTGTGCCAGAAACAA
> GGGCGTCCTATGCT
> CAGGCCATGGAGAGTGTCACTGTGGAGAATGCAAATGCCACGCAGGTTACATTGGGGACAAT
> TGTAACTGCTCAAC
> AGACGTCAGCACATGCAAGGCCAAGGATGGGCAGATCTGCAGTGACCGAGGCCGTTGTGTCT
> GTGGGCAGTGCCAG
> TGCACAGAGCCTGGAGCCTTTGGGGAGACGTGTGAGAAGTGCCCAACCTGCCCGGATGCTTG
> CAGCTCTAAGAGAG
> ACTGTGTCGAATGCTTGCTACTTCACCAGGGGAAACCTGACAACCAGACCTGCCACCACCAG
> TGCAAAGATGAGGT
> GATCACGTGGGTAGACACCATCGTCAAAGATGACCAGGAGGCTGTGCTTTGCTTCTACAAAA
> CTGCTAAGGACTGC
> GTTATGATGTTCAGCTACACAGAACTGCCCAATGGGAGGTCCAACTTGACGGTCCTCCGGGA
> GCCAGAATGTGGAA
> GTGCCCCCAATGCCATGACCATCCTGCTGGCTGTGGTTGGCAGCATCCTCCTGATTGGGATG
> GCACTCCTGGCCAT
> CTGGAAGCTGCTCGTCACCATCCACGACCGCCGAGAGTTTGCCAAGTTCCAAAGCCTCAAAC
> CCCCTGTACAGAAA
> GCCCATCTCCACACACACTGTCGATTTCGCCTTCAACAAGTTCAACAAATCCTACAATGGCT
> CAGTGGACTGAGGC
> TCCTGGATGGCTGGAGGGGGACTAAGGATGAAGACTCTGGCGTGCCTTGGACTTCCTGGACC
> ATTTGCTCACGCTA
> GCTAGGCACGCACGGATAATGGAGATGCCCTCCATTGAGCCCTAAGGGACCTGGTAGCCACA
> CAGCGGGCCACAGG
> CACTTGGGGCCACTTCCCTCCAAGCCAGGGAAAGCAAGGAGACTCTGGTGTTCTCAGCTTCC
> CCTCTGCCGCCTCC
> AGCTTGCTGTCTCCATGAACCTCTGAAGGCCTGGCTGCCCTCTTCCCTGCTGGGCCAGACAA
> GAAGGTATCCGGAA
> GAGTCTGTGTGTACAAAGCTAGCGCGCAGCCTGGCTTTTTCCAGTTGATCGTTTTTTTTTCT
> ATGAAATAAAAAGG
> TCACGCATTTAAAAAAAAAAAAAAAA</IUPACna>
>               </Seq-data_iupacna>
>             </Seq-data>
>           </Seq-inst_seq-data>
>         </Seq-inst>
>       </Bioseq_inst>
>     </Bioseq>
>   </Seq-entry_seq>
> </Seq-entry>
> 
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l@bioperl.org
> http://bioperl.org/mailman/listinfo/bioperl-l
>