[Biopython] SeqIO.parse for imgt

Liu, Chang cliu32 at wustl.edu
Fri Nov 4 16:37:38 UTC 2016


Hi, Peter,
Thank you for the quick response. I have sent a message to embl-ebi requesting for a list of changes. Hope this can be fixed with a minor tweak. Will keep you posted when I hear back.
Best regards,
Chang

Chang Liu, MD, PhD
Assistant Professor | Division of Laboratory and Genomic Medicine, Department of Pathology & Immunology, Washington University School of Medicine | 660 South Euclid Avenue, Campus Box 8118, St Louis MO, 63110. | Office: 314-747-5773. Pager: 314-508-7862. Email: cliu32 at wustl.edu


-----Original Message-----
From: Peter Cock [mailto:p.j.a.cock at googlemail.com]
Sent: Friday, November 04, 2016 11:18 AM
To: Liu, Chang <cliu32 at wustl.edu>
Cc: biopython at mailman.open-bio.org
Subject: Re: [Biopython] SeqIO.parse for imgt

Hello Chang,

It looks like the IMGT file format has changed slightly, and someone may need to modify the parser code to cope with this.

As you said, I could parse this file fine with the current version of Biopython:

$ curl -L -O https://github.com/ANHIG/IMGTHLA/raw/3160/hla.dat

$ python
Python 2.7.10 (default, Oct 23 2015, 19:19:21) [GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin Type "help", "copyright", "credits" or "license" for more information.
>>> from Bio import SeqIO
>>> for r in SeqIO.parse("hla.dat", "imgt"): print(r.id)
...
HLA00001
HLA02169
HLA01244

...
HLA02801
HLA02802
HLA02803

I can confirm the latest file is a problem:

$ curl -L -O https://github.com/ANHIG/IMGTHLA/raw/Latest/hla.dat

$ python
Python 2.7.10 (default, Oct 23 2015, 19:19:21) [GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.0.59.5)] on darwin Type "help", "copyright", "credits" or "license" for more information.
>>> for r in SeqIO.parse("hla.dat", "imgt"): print(r.id)
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/Bio/SeqIO/__init__.py", line 600, in parse
    for r in i:
  File "/Library/Python/2.7/site-packages/Bio/GenBank/Scanner.py",
line 479, in parse_records
    record = self.parse(handle, do_features)
  File "/Library/Python/2.7/site-packages/Bio/GenBank/Scanner.py",
line 463, in parse
    if self.feed(handle, consumer, do_features):
  File "/Library/Python/2.7/site-packages/Bio/GenBank/Scanner.py",
line 430, in feed
    self._feed_first_line(consumer, self.line)
  File "/Library/Python/2.7/site-packages/Bio/GenBank/Scanner.py",
line 633, in _feed_first_line
    raise ValueError('Did not recognise the ID line layout:\n' + line)
ValueError: Did not recognise the ID line layout:
ID   HLA00001; SV 1; standard; DNA; HUM; 3503 BP.

Technically, the Biopython changes are most likely to be in Bio/GenBank/Scanner.py to class _ImgtScanner, although if recent EMBL format files have also changed we may just need to update class EmblScanner only. Specifically I would think EMBL method _feed_first_line needs updating, or a new IMGT specific  _feed_first_line needs defining.

I'm not familiar with IPD - IMGT/HLA, so if you have any more information their release 3.16.0 and what was changed, it would be very helpful. Especially if this is linked to EMBL changes.

Thanks,

Peter

On Fri, Nov 4, 2016 at 3:30 PM, Liu, Chang <cliu32 at wustl.edu> wrote:
> Hi, everyone,
>
> I am new to this mail list, so please bear with my ignorance.
>
> I am using SeqIO to parse the hla.dat file from the IMGT/HLA database
> (https://github.com/ANHIG/IMGTHLA/tree/3160):
>
> Handle='hla.dat'
>
> records=SeqIO.parse(handle, 'imgt')
>
> The code only works for files up to version 3.16.0, but not any data
> files after that. The following was raised:
>
> ValueError: Did not recognise the ID line layout:
>
> ID   HLA00001; SV 1; standard; DNA; HUM; 3503 BP.
>
> Apparently the format has changed in the data file, which looks like
> this for the ID line before 3.16.0:
>
> ID   HLA00001   standard; DNA; HUM; 3503 BP.
>
> Could someone tell me how the module can be updated to parse current
> and future data files. Thank you so much!!
>
> Chang
>
>
>
>
>
> ________________________________
>
> The materials in this message are private and may contain Protected
> Healthcare Information or other information of a sensitive nature. If
> you are not the intended recipient, be advised that any unauthorized
> use, disclosure, copying or the taking of any action in reliance on
> the contents of this information is strictly prohibited. If you have
> received this email in error, please immediately notify the sender via telephone or return mail.
>
>
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython

________________________________
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.



More information about the Biopython mailing list