[Biopython] SeqIO.parse for imgt

Fri Nov 4 17:21:12 UTC 2016

Perfect! Thank you so much!!

-----Original Message-----
From: Peter Cock [mailto:p.j.a.cock at googlemail.com]
Sent: Friday, November 04, 2016 12:12 PM
To: Liu, Chang <cliu32 at wustl.edu>
Cc: biopython at mailman.open-bio.org
Subject: Re: [Biopython] SeqIO.parse for imgt

On Fri, Nov 4, 2016 at 5:03 PM, Liu, Chang <cliu32 at wustl.edu> wrote:
> ID before 3.16.0 has only three semicolons, which was compatible with
> the 'imgt' parser at the time.
> ID   HLA02803   standard; DNA; HUM; 1883 BP.

Yes, the EMBL/IMGT header line has been through various changes.
The current version in the hal.dat file is not one we've seen before though - but looks similar to recent EMBL files but with one field missing.

> One additional question I have is regarding the features:
>...
>
> The features in the original file look like this:
> FT   source          1..1883
> FT                   /organism="Homo sapiens"
> FT                   /mol_type="genomic DNA"
> FT                   /db_xref="taxon:9606"
> FT                   /ethnic="Caucasoid"
> FT                   /cell_line="QBL"
> FT   CDS             join(499..570,702..950,1189..1384)
> FT                   /codon_start=1
> FT                   /gene="HLA-V"
> FT                   /allele="HLA-V*01:01:01:03"
> FT                   /product="MHC Class I HLA-V*01:01:01:03 sequence"
> FT   UTR             1..498
> FT   exon            499..570
> FT                   /number="1"
> FT   intron          571..701
> FT                   /number="1"
> FT   exon            702..950
> FT                   /number="2"
> FT   intron          951..1188
> FT                   /number="2"
> FT   exon            1189..1384
> FT                   /number="3"
> FT   UTR             1385..1883
>
> My understanding is that the exon number was not captured in the
> features after parsing. Is this correct? The exon numbers is very
> important for downstream applications, because many analysis will need
> to extract exon 2 and 3 for class I HLA genes. If exons are not
> labeled in features, I wouldn't know which exons to keep. Could this
> information be retained after parsing? Thank you for your help!!!
> Chang

It is recorded. Try this for a more detailed output from Biopython:

for f in record.features:
    print(f.qualfiers)

In theory there could be multiple entries for each qualifier key, so the dictionary gives you a list. You'd want f.qualifiers["number"][0]

Peter

________________________________
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.