[Bioperl-l] getting pubmed id from genbank files

Hilmar Lapp hlapp at gmx.net
Tue Jul 26 17:42:08 EDT 2005


Right - but don't tell only me :-)

On Jul 26, 2005, at 1:29 PM, Barry Moore wrote:

> Then would it be safe to assume that in the case of multi-line JOURNAL
> entries, all lines following the initial tagged JOURNAL line would be
> untagged?  If so, the regex could probably be made a bit safer.
>
> Barry
>
> -----Original Message-----
> From: Hilmar Lapp [mailto:hlapp at gnf.org]
> Sent: Tuesday, July 26, 2005 2:09 PM
> To: Barry Moore
> Cc: bioperl-l; n.haigh at sheffield.ac.uk
> Subject: Re: [Bioperl-l] getting pubmed id from genbank files
>
> There are indeed JOURNAL entries spanning multiple lines; the parser
> was once unable to deal with this and was subsequently fixed ... as we
> see this introduced other problems ...
>
> On Jul 26, 2005, at 1:07 PM, Barry Moore wrote:
>
>> Nathan-
>>
>> That sounds like you are using bioperl 1.4?  The error is in
>> Bio/SeqIO/genbank.pm  and was fixed by Jason in cvs version 1.102 of
>> that file.  However the current code still looks a bit odd to me.
>> Starting at line 1068 of the current cvs version (1.119) of
> genebank.pm
>> we have:
>>
>> 1068  if (/^\s{2}JOURNAL\s+(.*)/o) {
>> 1069     push(@loc, $1);
>> 1070     while ( defined($_ = $self->_readline) ) {
>> 1071           # we only match when there are at least 4 spaces
>> 1072           # there is probably a better way to match this
>> 1073           # as it assumes that the describing tag is short enough
>> 1074           /^\s{4,}(.*)/o && do { push(@loc, $1);
>> 1075           next;
>> 1076     };
>> 1077     last;
>> 1078  }
>> 1079  $ref->location(join(' ', @loc));
>>
>> This is all dealing with parsing the Journal line which is handled
> fine
>> by lines 1068-69.  The while loop at 1070 looks at successive lines to
>> find something to add to the Journal line.  The regex at line 1074
> used
>> to read /^\s{3,}(.*)/o which would not match if the next line after
>> JOURNAL began with '  MEDLINE', but would match '   PUBMED' (Nathan's
>> situation) causing that line to be added to the JOURNAL line.  Is
> there
>> ever a JOURNAL entry with more than one line?  If so, shouldn't the
>> following lines always be untagged and thus indented 12 making the
>> regex
>> /^\s{12}(.*)/o safer.  The current situation would add any line to
>> JOURNAL line if it's tag is shorter than 6 characters, and I don't
>> think
>> that's what we want.
>>
>> Barry
>>
>> -----Original Message-----
>> From: bioperl-l-bounces at portal.open-bio.org
>> [mailto:bioperl-l-bounces at portal.open-bio.org] On Behalf Of Hilmar
> Lapp
>> Sent: Tuesday, July 26, 2005 11:05 AM
>> To: n.haigh at sheffield.ac.uk
>> Cc: 'bioperl-l'
>> Subject: Re: [Bioperl-l] getting pubmed id from genbank files
>>
>>
>> On Jul 26, 2005, at 7:49 AM, Nathan Haigh wrote:
>>
>>> -- snip --
>>> $VAR1 = bless( {
>>>        'authors' => 'Clauss,M.J. and Mitchell-Olds,T.',
>>>        'location' => 'Genetics 166 (3), 1419-1436 (2004) PUBMED
>>> 15082560',
>>>        'title' => 'Functional divergence in tandemly duplicated
>>> Arabidopsis
>>> thaliana trypsin inhibitor genes',
>>>        'tagname' => 'reference'
>>>      }, 'Bio::Annotation::Reference' );
>>> -- snip --
>>
>> This is odd. The PUBMED line should not be concatenated with the
>> JOURNAL line. I wonder where this happens and why. Can you download
> the
>> record from NCBI (using the web interface, format 'GenBank', 'Send all
>> to file') and then parse it with Bio::SeqIO? If it works then the
>> problem must be in the code that deals with the HTTP-response.
>>
>> 	-hilmar
>>
>>
>>>
>>> -----Original Message-----
>>> From: Jason Stajich [mailto:jason.stajich at duke.edu]
>>> Sent: 26 July 2005 15:28
>>> To: Bioperl-l at portal.open-bio.org
>>> Cc: Nathan Haigh
>>> Subject: [Bioperl-l] getting pubmed id from genbank files
>>>
>>>
>>>
>>> Here is part of the synopsis in Bio::Seq:
>>>
>>>      foreach my $ref ( $ann->get_Annotations('reference') ) {
>>>          print "Reference ",$ref->title,"\n";
>>>      }
>>>
>>>   so do $ref->pubmed instead of $ref->title.
>>>
>>>
>>> -jason
>>>> On Jul 26, 2005, at 6:02 AM, Nathan Haigh wrote:
>>>>
>>>>> I want to be able to supply a list of GI's, retrieve the genbank
>>>>> files and
>>>>> parse out the pubmed id's.
>>>>>
>>>>>
>>>>>
>>>>> I know I can do the first steps of retrieving the genbank files
>>>>> directly,
>>>>> but how do I get the pubmed id's? I've been playing around with
>>>>> things and
>>>>> haven't yet found out if this can be done.
>>>>>
>>>>>
>>>>>
>>>>> Cheers,
>>>>>
>>>>> Nathan
>>>>>
>>>>>
>>>>>
>>>>> ----------------------------------
>>>>>
>>>>> Nathan Haigh
>>>>>
>>>>> Bioinformatics PostDoctoral Research Associate
>>>>>
>>>>>
>>>>>
>>>>> Room B2 211
>>>>>
>>>>> Department of Animal and Plant Sciences
>>>>>
>>>>> University of Sheffield
>>>>>
>>>>> Western Bank
>>>>>
>>>>> Sheffield
>>>>>
>>>>> S10 2TN
>>>>>
>>>>>
>>>>>
>>>>> Tel: +44 (0)114 22 20112
>>>>>
>>>>> Mob: +44 (0)7742 533 569
>>>>>
>>>>> Fax: +44 (0)114 22 20002
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> Bioperl-l mailing list
>>>>> Bioperl-l at portal.open-bio.org
>>>>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>>>>
>>>> --
>>>> Jason Stajich
>>>> http://www.duke.edu/~jes12
>>>> jason.stajich -at- duke.edu
>>>>
>>>>
>>> --
>>> Jason Stajich
>>> http://www.duke.edu/~jes12
>>> jason.stajich -at- duke.edu
>>>
>>>
>>>
>>> _______________________________________________
>>> Bioperl-l mailing list
>>> Bioperl-l at portal.open-bio.org
>>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>>
>> -- 
>> -------------------------------------------------------------
>> Hilmar Lapp                            email: lapp at gnf.org
>> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
>> -------------------------------------------------------------
>>
>>
>> _______________________________________________
>> Bioperl-l mailing list
>> Bioperl-l at portal.open-bio.org
>> http://portal.open-bio.org/mailman/listinfo/bioperl-l
>>
> -- 
> -------------------------------------------------------------
> Hilmar Lapp                            email: lapp at gnf.org
> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> -------------------------------------------------------------
>
>
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------




More information about the Bioperl-l mailing list