[Bioperl-l] SeqIO issue? EUtilities Cookbook

Sat Mar 27 13:51:14 UTC 2010

Hi Chris,
I also see there is a bunch of NCBI toolkit code that deals with asn.1 
conversion. They even have some precompiled code:

http://www.ncbi.nlm.nih.gov/Web/Newsltr/V14N1/toolkit.html

Thanks for your help,
Phillip
Chris Fields wrote:
> That format is ASN.1. and there isn't a BioPerl parser for GenBank ASN.1
> format (it tends to be too cumbersome).  
>
> However, there is a pure-perl-based one for the EntrezGene ASN.1 format
> (Bio::ASN1::EntrezGene).
>
> chris
>
>
> On Fri, 2010-03-26 at 13:28 -0400, Phillip San Miguel wrote:
>   
>> Ah, yes. That does the trick. Actually I have already downloaded a few 
>> thousand records in whatever that format that is returned when 'genbank' 
>> is specified instead of 'gb'. (See below, it begins with 'Seq-entry ::= 
>> seq {') Any idea what format that is and how to convert it to something 
>> SeqIO can use?
>>
>> If not, I can just pull them all down again by sending about 200 gi's 
>> per request. That should not offend the genbank gods...
>>
>> Thanks for your help,
>> Phillip
>>
>> Chris Fields wrote:
>>     
>>> Change the rettype from 'genbank' to 'gb' or 'gbwithparts' (the latter is if you always want a full nucleotide sequence instead of possibly getting contig files).  'genbank' used to be an alias for 'gb', but apparently no longer, and appears to be something that was changed on NCBI's end.
>>>
>>> Also, note that the email is now required (you'll get a warning about this with code from SVN).  I'll update the wiki to reflect both.
>>>
>>> chris
>>>
>>> On Mar 26, 2010, at 10:52 AM, Phillip San Miguel wrote:
>>>
>>>   
>>>       
>>>> Could someone tell me what I am doing wrong? This seems simple, but I have not been able to get it to work.
>>>>
>>>> I am trying to use the code provided at:
>>>>
>>>> http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook#Retrieve_raw_data_records_from_GenBank.2C_save_raw_data_to_file.2C_then_parse_via_Bio::SeqIO
>>>>
>>>> and modified to request gi228534658
>>>>
>>>> The EUtilities downloads a record from genbank and SeqIO seems as if it is parsing it, but also seems not to return anything.
>>>>
>>>> Nothing is printed with I run the following script on a Solaris box running perl 5.10.0 and bioperl 1.6.1:
>>>>
>>>> #!/usr/bin/perl
>>>> use strict;
>>>> use warnings;
>>>>
>>>> use Bio::SeqIO;
>>>> use Bio::DB::EUtilities;
>>>>
>>>> my @ids;
>>>> push @ids, '228534658';
>>>> my $factory = Bio::DB::EUtilities->new(
>>>>                       -eutil => 'efetch',
>>>>                       -db => 'nucleotide',
>>>>                       -rettype => 'genbank',
>>>>                       -id => \@ids);
>>>>
>>>> my $file = 'myseqs.gb';
>>>>
>>>> # dump HTTP::Response content to a file (not retained in memory)
>>>> $factory->get_Response(-file => $file);
>>>>
>>>> my $seqin = Bio::SeqIO->new(-file => $file,
>>>>                          -format => 'genbank');
>>>>
>>>> while (my $seq = $seqin->next_seq) {
>>>>  print "I see a sequence\n";
>>>>  print $seq->species();
>>>> }
>>>>
>>>>
>>>> "myseqs.gb" does have content:
>>>>
>>>> Seq-entry ::= seq {
>>>> id {
>>>>  general {
>>>>    db "gpid:36555" ,
>>>>    tag
>>>>      str "contig49313" } ,
>>>>  genbank {
>>>>    accession "EZ113652" ,
>>>>    version 1 } ,
>>>>  gi 228534658 } ,
>>>> descr {
>>>>  title "TSA: Zea mays contig49313, mRNA sequence." ,
>>>>  source {
>>>>    genome genomic ,
>>>>    org {
>>>>      taxname "Zea mays" ,
>>>>      db {
>>>>        {
>>>>          db "taxon" ,
>>>>          tag
>>>>            id 4577 } } ,
>>>>      orgname {
>>>>        name
>>>>          binomial {
>>>>            genus "Zea" ,
>>>>            species "mays" } ,
>>>>        lineage "Eukaryota; Viridiplantae; Streptophyta; Embryophyta;
>>>> Tracheophyta; Spermatophyta; Magnoliophyta; Liliopsida; Poales; Poaceae;
>>>> PACCAD clade; Panicoideae; Andropogoneae; Zea" ,
>>>>        gcode 1 ,
>>>>        mgcode 1 ,
>>>>        div "PLN" } } } ,
>>>>  molinfo {
>>>>    biomol mRNA ,
>>>>    tech tsa } ,
>>>>  pub {
>>>>    pub {
>>>>      article {
>>>>        title {
>>>>          name "Deep sampling of the Palomero maize transcriptome by a high
>>>> throughput strategy of pyrosequencing." } ,
>>>>        authors {
>>>>          names
>>>>            std {
>>>>              {
>>>>                name
>>>>                  name {
>>>>                    last "Vega-Arreguin" ,
>>>>                    initials "J.C." } } ,
>>>>              {
>>>>                name
>>>>                  name {
>>>>                    last "Ibarra-Laclette" ,
>>>>                    initials "E." } } ,
>>>>              {
>>>>                name
>>>>                  name {
>>>>                    last "Jimenez-Moraila" ,
>>>>                    initials "B." } } ,
>>>>              {
>>>>                name
>>>>                  name {
>>>>                    last "Martinez" ,
>>>>                    initials "O." } } ,
>>>>              {
>>>>                name
>>>>                  name {
>>>>                    last "Vielle-Calzada" ,
>>>>                    initials "J.P." } } ,
>>>>              {
>>>>                name
>>>>                  name {
>>>>                    last "Herrera-Estrella" ,
>>>>                    initials "L." } } ,
>>>>              {
>>>>                name
>>>>                  name {
>>>>                    last "Herrera-Estrella" ,
>>>>                    initials "A." } } } } ,
>>>>        from
>>>>          journal {
>>>>            title {
>>>>              iso-jta "BMC Genomics" ,
>>>>              ml-jta "BMC Genomics" ,
>>>>              issn "1471-2164" ,
>>>>              name "BMC genomics" } ,
>>>>            imp {
>>>>              date
>>>>                std {
>>>>                  year 2009 ,
>>>>                  month 7 ,
>>>>                  day 6 } ,
>>>>              volume "10" ,
>>>>              issue "1" ,
>>>>              pages "299" ,
>>>>              language "ENG" ,
>>>>              pubstatus aheadofprint ,
>>>>              history {
>>>>                {
>>>>                  pubstatus received ,
>>>>                  date
>>>>                    std {
>>>>                      year 2008 ,
>>>>                      month 12 ,
>>>>                      day 2 } } ,
>>>>                {
>>>>                  pubstatus accepted ,
>>>>                  date
>>>>                    std {
>>>>                      year 2009 ,
>>>>                      month 7 ,
>>>>                      day 6 } } ,
>>>>                {
>>>>                  pubstatus aheadofprint ,
>>>>                  date
>>>>                    std {
>>>>                      year 2009 ,
>>>>                      month 7 ,
>>>>                      day 6 } } ,
>>>>                {
>>>>                  pubstatus other ,
>>>>                  date
>>>>                    std {
>>>>                      year 2009 ,
>>>>                      month 7 ,
>>>>                      day 8 ,
>>>>                      hour 9 ,
>>>>                      minute 0 } } ,
>>>>                {
>>>>                  pubstatus pubmed ,
>>>>                  date
>>>>                    std {
>>>>                      year 2009 ,
>>>>                      month 7 ,
>>>>                      day 8 ,
>>>>                      hour 9 ,
>>>>                      minute 0 } } ,
>>>>                {
>>>>                  pubstatus medline ,
>>>>                  date
>>>>                    std {
>>>>                      year 2009 ,
>>>>                      month 7 ,
>>>>                      day 8 ,
>>>>                      hour 9 ,
>>>>                      minute 0 } } } } } ,
>>>>        ids {
>>>>          pii "1471-2164-10-299" ,
>>>>          doi "10.1186/1471-2164-10-299" ,
>>>>          pubmed 19580677 } } ,
>>>>      pmid 19580677 } } ,
>>>>  pub {
>>>>    pub {
>>>>      sub {
>>>>        authors {
>>>>          names
>>>>            std {
>>>>              {
>>>>                name
>>>>                  name {
>>>>                    last "Vega-Arreguin" ,
>>>>                    first "Julio" ,
>>>>                    initials "J.C." } } ,
>>>>              {
>>>>                name
>>>>                  name {
>>>>                    last "Ibarra-Laclette" ,
>>>>                    first "Enrique" ,
>>>>                    initials "E." } } ,
>>>>              {
>>>>                name
>>>>                  name {
>>>>                    last "Jimenez-Moraila" ,
>>>>                    first "Beatriz" ,
>>>>                    initials "B." } } ,
>>>>              {
>>>>                name
>>>>                  name {
>>>>                    last "Martinez" ,
>>>>                    first "Octavio" ,
>>>>                    initials "O." } } ,
>>>>              {
>>>>                name
>>>>                  name {
>>>>                    last "Vielle-Calzada" ,
>>>>                    first "Jean" ,
>>>>                    initials "J.Philippe." } } ,
>>>>              {
>>>>                name
>>>>                  name {
>>>>                    last "Herrera-Estrella" ,
>>>>                    first "Luis" ,
>>>>                    initials "L." } } ,
>>>>              {
>>>>                name
>>>>                  name {
>>>>                    last "Herrera-Estrella" ,
>>>>                    first "Alfredo" ,
>>>>                    initials "A." } } } ,
>>>>          affil
>>>>            std {
>>>>              affil "Laboratorio Nacional de Genomica para la Biodiversidad" ,
>>>>              div "Cinvestav Campus Guanajuato" ,
>>>>              city "Irapuato" ,
>>>>              sub "Guanajuato" ,
>>>>              country "Mexico" ,
>>>>              street "Km 9.6 Libramiento Norte, Carretera Irapuato-Leon" ,
>>>>              postal-code "36821" } } ,
>>>>        medium other ,
>>>>        date
>>>>          std {
>>>>            year 2009 ,
>>>>            month 3 ,
>>>>            day 23 } } } } ,
>>>>  user {
>>>>    type
>>>>      str "GenomeProjectsDB" ,
>>>>    data {
>>>>      {
>>>>        label
>>>>          str "ProjectID" ,
>>>>        data
>>>>          int 36555 } ,
>>>>      {
>>>>        label
>>>>          str "ParentID" ,
>>>>        data
>>>>          int 0 } } } ,
>>>>  create-date
>>>>    std {
>>>>      year 2009 ,
>>>>      month 5 ,
>>>>      day 5 } ,
>>>>  update-date
>>>>    std {
>>>>      year 2009 ,
>>>>      month 7 ,
>>>>      day 14 } } ,
>>>> inst {
>>>>  repr raw ,
>>>>  mol rna ,
>>>>  length 450 ,
>>>>  seq-data
>>>>    ncbi2na '77499DA7905DD417DCB7F1D538536238E08229108D89A87E2CDA6282DA3AD02
>>>> 0524AE9C0D4154576794E0420BFA8E351A9ED347A504D3B6FE927E94E475EB17A52427227B820A
>>>> A21086117F7597EFB837ED2FB463AEF9F9E774052FD00FA0C1C803A521131212AFFB00D11CDD63
>>>> 760CFF0'H } }
>>>>
>>>>
>>>> Maybe I am using the wrong format? This looks more like ASN than genbank format to me.
>>>>
>>>> Phillip
>>>> _______________________________________________
>>>> Bioperl-l mailing list
>>>> Bioperl-l at lists.open-bio.org
>>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>>     
>>>>         
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>>
>>>   
>>>       
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>>     
>
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>