[Biopython] SeqIO.parse for imgt
Liu, Chang
cliu32 at wustl.edu
Fri Nov 11 17:00:47 UTC 2016
Thank you very much, James!
Hi, Peter, here you go - thank you in advance for updating the 'imgt' parser. I really appreciate it. Please let me know if I can be of any assistance!
Chang
-----Original Message-----
From: James Robinson [mailto:jrobinso at ebi.ac.uk]
Sent: Friday, November 11, 2016 10:54 AM
To: Liu, Chang <cliu32 at wustl.edu>
Cc: p.j.a.cock at googlemail.com
Subject: Re: [IPD #99553] hla.dat file and biopython, follow up
Hi,
The key changes post 3.16 are the addition of an SV value to the ID line, these additions should make the format more similar to the ENA style.
ID HLA00001 standard; DNA; HUM; 3503 BP.
becomes
ID HLA00001; SV 1; standard; DNA; HUM; 3503 BP.
We have also added the SV value as a line in the file;
SV HLA00001.1
this is added between the AC and DT lines.
The other change, is the removal of a third DT line, we previously had 3 lines, but have reduced this to two;
DT 01-AUG-1989 (Rel. 1.0.0, Created, Version 1)
DT 16-DEC-1998 (Rel. 1.0.0, Sequence Updated, Version 1)
DT 14-APR-2014 (Rel. 3.16.0, Current Release, Version 1)
becomes
DT 01-AUG-1989 (Rel. 1.0.0, Created, Version 1)
DT 14-OCT-2016 (Rel. 3.26.0, Last Updated, Version 1)
In addition the text within the CC lines has changed from;
CC --------------------------------------------------------------------------
CC Copyrighted by the IMGT/HLA Database, Distributed under the Creative
CC Commons Attribution-NoDerivs License, see;
CC http://www.ebi.ac.uk/imgt/hla/licence.html for further details.
CC --------------------------------------------------------------------------
to
CC --------------------------------------------------------------------------
CC IPD-IMGT/HLA Release Version 3.26.0
CC --------------------------------------------------------------------------
CC Copyrighted by the IPD-IMGT/HLA Database, Distributed under the Creative
CC Commons Attribution-NoDerivs License, see;
CC http://www.ebi.ac.uk/ipd/imgt/hla/licence.html for further details.
CC --------------------------------------------------------------------------
Thanks
James
> On 11 Nov 2016, at 10:25, Liu, Chang <cliu32 at wustl.edu> wrote:
>
> Hi, James,
> I am trying to follow up on #99553. Is there documentation of changes in the format of hla.dat file? Especially from version 3.16.0 to later versions?
> The biopython team can use this information to update the parser for this file. Thank you for your help!
> Chang
>
> -----Original Message-----
> From: Liu, Chang
> Sent: Tuesday, November 08, 2016 10:12 AM
> To: 'James Robinson' <jrobinso at ebi.ac.uk>
> Subject: RE: [IPD #99553] hla.dat file and biopython
>
> Hi, James,
> Thanks for the reply. Let me copy you the emails between me and Peter Cock, one of the authors of Biopython (see below).
> For versions after 3.16.0, the ID looks like this:
> ID HLA00001; SV 1; standard; DNA; HUM; 3503 BP.
> For versions of 3.16.0 and before, the ID looks like this:
> ID HLA00001 standard; DNA; HUM; 3503 BP.
> This is the only difference that I know of. I told Peter that I am requesting more info from ebi, so that Peter is informed about all the changes made to hla.dat after 3.16.0. Peter can modify the parser code accordingly to accommodate all the changes. Let me know if you have any additional questions. Thanks in advance.
> Chang
>
>
> -----Original Message-----
> From: Peter Cock [mailto:p.j.a.cock at googlemail.com]
> Sent: Friday, November 04, 2016 11:18 AM
> To: Liu, Chang <cliu32 at wustl.edu>
> Cc: biopython at mailman.open-bio.org
> Subject: Re: [Biopython] SeqIO.parse for imgt
>
> Hello Chang,
>
> It looks like the IMGT file format has changed slightly, and someone may need to modify the parser code to cope with this.
>
> As you said, I could parse this file fine with the current version of Biopython:
>
> $ curl -L -O https://github.com/ANHIG/IMGTHLA/raw/3160/hla.dat
>
> $ python
> Python 2.7.10 (default, Oct 23 2015, 19:19:21) [GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin Type "help", "copyright", "credits" or "license" for more information.
>>>> from Bio import SeqIO
>>>> for r in SeqIO.parse("hla.dat", "imgt"): print(r.id)
> ...
> HLA00001
> HLA02169
> HLA01244
>
> ...
> HLA02801
> HLA02802
> HLA02803
>
> I can confirm the latest file is a problem:
>
> $ curl -L -O https://github.com/ANHIG/IMGTHLA/raw/Latest/hla.dat
>
> $ python
> Python 2.7.10 (default, Oct 23 2015, 19:19:21) [GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin Type "help", "copyright", "credits" or "license" for more information.
>>>> for r in SeqIO.parse("hla.dat", "imgt"): print(r.id)
> ...
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> File "/Library/Python/2.7/site-packages/Bio/SeqIO/__init__.py", line 600, in parse
> for r in i:
> File "/Library/Python/2.7/site-packages/Bio/GenBank/Scanner.py",
> line 479, in parse_records
> record = self.parse(handle, do_features) File
> "/Library/Python/2.7/site-packages/Bio/GenBank/Scanner.py",
> line 463, in parse
> if self.feed(handle, consumer, do_features):
> File "/Library/Python/2.7/site-packages/Bio/GenBank/Scanner.py",
> line 430, in feed
> self._feed_first_line(consumer, self.line) File
> "/Library/Python/2.7/site-packages/Bio/GenBank/Scanner.py",
> line 633, in _feed_first_line
> raise ValueError('Did not recognise the ID line layout:\n' + line)
> ValueError: Did not recognise the ID line layout:
> ID HLA00001; SV 1; standard; DNA; HUM; 3503 BP.
>
> Technically, the Biopython changes are most likely to be in Bio/GenBank/Scanner.py to class _ImgtScanner, although if recent EMBL format files have also changed we may just need to update class EmblScanner only. Specifically I would think EMBL method _feed_first_line needs updating, or a new IMGT specific _feed_first_line needs defining.
>
> I'm not familiar with IPD - IMGT/HLA, so if you have any more information their release 3.16.0 and what was changed, it would be very helpful. Especially if this is linked to EMBL changes.
>
> Thanks,
>
> Peter
>
> On Fri, Nov 4, 2016 at 3:30 PM, Liu, Chang <cliu32 at wustl.edu> wrote:
>> Hi, everyone,
>>
>> I am new to this mail list, so please bear with my ignorance.
>>
>> I am using SeqIO to parse the hla.dat file from the IMGT/HLA database
>> (https://github.com/ANHIG/IMGTHLA/tree/3160):
>>
>> Handle='hla.dat'
>>
>> records=SeqIO.parse(handle, 'imgt')
>>
>> The code only works for files up to version 3.16.0, but not any data
>> files after that. The following was raised:
>>
>> ValueError: Did not recognise the ID line layout:
>>
>> ID HLA00001; SV 1; standard; DNA; HUM; 3503 BP.
>>
>> Apparently the format has changed in the data file, which looks like
>> this for the ID line before 3.16.0:
>>
>> ID HLA00001 standard; DNA; HUM; 3503 BP.
>>
>> Could someone tell me how the module can be updated to parse current
>> and future data files. Thank you so much!!
>>
>> Chang
>>
>>
>>
>>
>>
>> ________________________________
>>
>> The materials in this message are private and may contain Protected
>> Healthcare Information or other information of a sensitive nature. If
>> you are not the intended recipient, be advised that any unauthorized
>> use, disclosure, copying or the taking of any action in reliance on
>> the contents of this information is strictly prohibited. If you have
>> received this email in error, please immediately notify the sender via telephone or return mail.
>>
>>
>> _______________________________________________
>> Biopython mailing list - Biopython at mailman.open-bio.org
>> http://mailman.open-bio.org/mailman/listinfo/biopython
>
> -----Original Message-----
> From: James Robinson [mailto:jrobinso at ebi.ac.uk]
> Sent: Tuesday, November 08, 2016 9:45 AM
> To: Liu, Chang <cliu32 at wustl.edu>
> Subject: Re: [IPD #99553] hla.dat file and biopython
>
> Dear Chang Liu
>
> Thank you for contacting us about Biopython. The files are designed to be EMBL-ENA like so I would expect parsing to work but there may be issues if we add new fields.n The releases you detail are around 2 years old. Do you have more details on the error, and I can see what this is.
>
> Regards
>
> James Robinson
>
>> On 4 Nov 2016, at 11:24, The RT System itself via RT <ipd at ebi.ac.uk> wrote:
>>
>> Fri Nov 04 16:24:36 2016: Request 99553 was acted upon.
>> Transaction: Queue changed from support to IPD by RT_System
>> Queue: IPD
>> Subject: hla.dat file and biopython
>> Owner: Nobody
>> Requestors: cliu32 at wustl.edu
>> Status: new
>> Ticket <URL: https://helpdesk.ebi.ac.uk/Ticket/Display.html?id=99553
>>>
>>
>>
>> This ticket has been moved to your queue by RT_System The original
>> request:
>>
>> User's email address: cliu32 at wustl.edu User's name: chang liu
>> Feedback
>> topic: IMGT/HLA HELP: Other Referrer URL:
>> http://www.ebi.ac.uk/ipd/imgt/hla/docs/manual.html
>> User's IP address: 128.252.119.43
>> Web browser used: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:49.0)
>> Gecko/20100101 Firefox/49.0 Message sent: Friday 4 November 2016,
>> 16:24
>>
>> Message content:
>> ------------------------
>>
>> Hi, there seems to be a recent change to the hla.dat file, I believe happend between 3.16.0 and 3.17.0, that prevent the biopython SeqIO to parse the data file properly. Could someone provide a list of changes so that I can pass it on to the biopython team? Thank you so much!!!
>>
>> Chang Liu, MD, PhD
>> Assistant Professor | Division of Laboratory and Genomic Medicine,
>> Department of Pathology & Immunology, Washington University School of
>> Medicine | 660 South Euclid Avenue, Campus Box 8118, St Louis MO,
>> 63110. | Office: 314-747-5773. Pager: 314-508-7862. Email:
>> cliu32 at wustl.edu
>>
>
>
> ________________________________
> The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
________________________________
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
More information about the Biopython
mailing list