[Biopython] SeqIO.parse for imgt

Fri Nov 4 17:11:44 UTC 2016

On Fri, Nov 4, 2016 at 5:03 PM, Liu, Chang <cliu32 at wustl.edu> wrote:
> ID before 3.16.0 has only three semicolons, which was compatible
> with the 'imgt' parser at the time.
> ID   HLA02803   standard; DNA; HUM; 1883 BP.

Yes, the EMBL/IMGT header line has been through various changes.
The current version in the hal.dat file is not one we've seen before
though - but looks similar to recent EMBL files but with one field
missing.

> One additional question I have is regarding the features:
>...
>
> The features in the original file look like this:
> FT   source          1..1883
> FT                   /organism="Homo sapiens"
> FT                   /mol_type="genomic DNA"
> FT                   /db_xref="taxon:9606"
> FT                   /ethnic="Caucasoid"
> FT                   /cell_line="QBL"
> FT   CDS             join(499..570,702..950,1189..1384)
> FT                   /codon_start=1
> FT                   /gene="HLA-V"
> FT                   /allele="HLA-V*01:01:01:03"
> FT                   /product="MHC Class I HLA-V*01:01:01:03 sequence"
> FT   UTR             1..498
> FT   exon            499..570
> FT                   /number="1"
> FT   intron          571..701
> FT                   /number="1"
> FT   exon            702..950
> FT                   /number="2"
> FT   intron          951..1188
> FT                   /number="2"
> FT   exon            1189..1384
> FT                   /number="3"
> FT   UTR             1385..1883
>
> My understanding is that the exon number was not captured in the features
> after parsing. Is this correct? The exon numbers is very important for
> downstream applications, because many analysis will need to extract
> exon 2 and 3 for class I HLA genes. If exons are not labeled in features,
> I wouldn't know which exons to keep. Could this information be retained
> after parsing? Thank you for your help!!!
> Chang

It is recorded. Try this for a more detailed output from Biopython:

for f in record.features:
    print(f.qualfiers)

In theory there could be multiple entries for each qualifier key, so the
dictionary gives you a list. You'd want f.qualifiers["number"][0]

Peter