[Biopython] SeqIO.parse for imgt

Liu, Chang cliu32 at wustl.edu
Fri Nov 4 17:03:00 UTC 2016


ID before 3.16.0 has only three semicolons, which was compatible with the 'imgt' parser at the time.
ID   HLA02803   standard; DNA; HUM; 1883 BP.

One additional question I have is regarding the features:
>>> record.features
[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1883), strand=1), type='source'), SeqFeature(CompoundLocation([FeatureLocation(ExactPosition(498), ExactPosition(570), strand=1), FeatureLocation(ExactPosition(701), ExactPosition(950), strand=1), FeatureLocation(ExactPosition(1188), ExactPosition(1384), strand=1)], 'join'), type='CDS', location_operator='join'), SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(498), strand=1), type='UTR'), SeqFeature(FeatureLocation(ExactPosition(498), ExactPosition(570), strand=1), type='exon'), SeqFeature(FeatureLocation(ExactPosition(570), ExactPosition(701), strand=1), type='intron'), SeqFeature(FeatureLocation(ExactPosition(701), ExactPosition(950), strand=1), type='exon'), SeqFeature(FeatureLocation(ExactPosition(950), ExactPosition(1188), strand=1), type='intron'), SeqFeature(FeatureLocation(ExactPosition(1188), ExactPosition(1384), strand=1), type='exon'), SeqFeature(FeatureLocation(ExactPosition(1384), ExactPosition(1883), strand=1), type='UTR')]
The features in the original file look like this:
FT   source          1..1883
FT                   /organism="Homo sapiens"
FT                   /mol_type="genomic DNA"
FT                   /db_xref="taxon:9606"
FT                   /ethnic="Caucasoid"
FT                   /cell_line="QBL"
FT   CDS             join(499..570,702..950,1189..1384)
FT                   /codon_start=1
FT                   /gene="HLA-V"
FT                   /allele="HLA-V*01:01:01:03"
FT                   /product="MHC Class I HLA-V*01:01:01:03 sequence"
FT   UTR             1..498
FT   exon            499..570
FT                   /number="1"
FT   intron          571..701
FT                   /number="1"
FT   exon            702..950
FT                   /number="2"
FT   intron          951..1188
FT                   /number="2"
FT   exon            1189..1384
FT                   /number="3"
FT   UTR             1385..1883

My understanding is that the exon number was not captured in the features after parsing. Is this correct? The exon numbers is very important for downstream applications, because many analysis will need to extract exon 2 and 3 for class I HLA genes. If exons are not labeled in features, I wouldn't know which exons to keep. Could this information be retained after parsing? Thank you for your help!!!
Chang

-----Original Message-----
From: Peter Cock [mailto:p.j.a.cock at googlemail.com]
Sent: Friday, November 04, 2016 11:54 AM
To: Liu, Chang <cliu32 at wustl.edu>
Cc: biopython at mailman.open-bio.org
Subject: Re: [Biopython] SeqIO.parse for imgt

This does look like a small change, we would expect six semi-colons in the ID line, e.g.

ID X56734; SV 1; linear; mRNA; STD; PLN; 1859 BP.
ID CD789012; SV 4; linear; genomic DNA; HTG; MAM; 500 BP.

0. Primary accession number
1. Sequence version number
2. Topology: 'circular' or 'linear'
3. Molecule type (e.g. 'genomic DNA')
4. Data class (e.g. 'STD')
5. Taxonomic division (e.g. 'PRO')
6. Sequence length (e.g. '4639675 BP.')

However, the hla.dat file uses only five semi-colons. It looks like the data class has been removed leaving just HUM (human) as the taxonomic division. e.g.

ID   HLA00001; SV 1; standard; DNA; HUM; 3503 BP.
ID   HLA02169; SV 1; standard; DNA; HUM; 3291 BP.

Peter


On Fri, Nov 4, 2016 at 4:37 PM, Liu, Chang <cliu32 at wustl.edu> wrote:
> Hi, Peter,
> Thank you for the quick response. I have sent a message to embl-ebi
> requesting for a list of changes. Hope this can be fixed with a minor
> tweak. Will keep you posted when I hear back.
> Best regards,
> Chang

________________________________
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.



More information about the Biopython mailing list