[Biopython] SeqIO.parse for imgt
Liu, Chang
cliu32 at wustl.edu
Fri Nov 4 17:03:00 UTC 2016
ID before 3.16.0 has only three semicolons, which was compatible with the 'imgt' parser at the time.
ID HLA02803 standard; DNA; HUM; 1883 BP.
One additional question I have is regarding the features:
>>> record.features
[SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1883), strand=1), type='source'), SeqFeature(CompoundLocation([FeatureLocation(ExactPosition(498), ExactPosition(570), strand=1), FeatureLocation(ExactPosition(701), ExactPosition(950), strand=1), FeatureLocation(ExactPosition(1188), ExactPosition(1384), strand=1)], 'join'), type='CDS', location_operator='join'), SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(498), strand=1), type='UTR'), SeqFeature(FeatureLocation(ExactPosition(498), ExactPosition(570), strand=1), type='exon'), SeqFeature(FeatureLocation(ExactPosition(570), ExactPosition(701), strand=1), type='intron'), SeqFeature(FeatureLocation(ExactPosition(701), ExactPosition(950), strand=1), type='exon'), SeqFeature(FeatureLocation(ExactPosition(950), ExactPosition(1188), strand=1), type='intron'), SeqFeature(FeatureLocation(ExactPosition(1188), ExactPosition(1384), strand=1), type='exon'), SeqFeature(FeatureLocation(ExactPosition(1384), ExactPosition(1883), strand=1), type='UTR')]
The features in the original file look like this:
FT source 1..1883
FT /organism="Homo sapiens"
FT /mol_type="genomic DNA"
FT /db_xref="taxon:9606"
FT /ethnic="Caucasoid"
FT /cell_line="QBL"
FT CDS join(499..570,702..950,1189..1384)
FT /codon_start=1
FT /gene="HLA-V"
FT /allele="HLA-V*01:01:01:03"
FT /product="MHC Class I HLA-V*01:01:01:03 sequence"
FT UTR 1..498
FT exon 499..570
FT /number="1"
FT intron 571..701
FT /number="1"
FT exon 702..950
FT /number="2"
FT intron 951..1188
FT /number="2"
FT exon 1189..1384
FT /number="3"
FT UTR 1385..1883
My understanding is that the exon number was not captured in the features after parsing. Is this correct? The exon numbers is very important for downstream applications, because many analysis will need to extract exon 2 and 3 for class I HLA genes. If exons are not labeled in features, I wouldn't know which exons to keep. Could this information be retained after parsing? Thank you for your help!!!
Chang
-----Original Message-----
From: Peter Cock [mailto:p.j.a.cock at googlemail.com]
Sent: Friday, November 04, 2016 11:54 AM
To: Liu, Chang <cliu32 at wustl.edu>
Cc: biopython at mailman.open-bio.org
Subject: Re: [Biopython] SeqIO.parse for imgt
This does look like a small change, we would expect six semi-colons in the ID line, e.g.
ID X56734; SV 1; linear; mRNA; STD; PLN; 1859 BP.
ID CD789012; SV 4; linear; genomic DNA; HTG; MAM; 500 BP.
0. Primary accession number
1. Sequence version number
2. Topology: 'circular' or 'linear'
3. Molecule type (e.g. 'genomic DNA')
4. Data class (e.g. 'STD')
5. Taxonomic division (e.g. 'PRO')
6. Sequence length (e.g. '4639675 BP.')
However, the hla.dat file uses only five semi-colons. It looks like the data class has been removed leaving just HUM (human) as the taxonomic division. e.g.
ID HLA00001; SV 1; standard; DNA; HUM; 3503 BP.
ID HLA02169; SV 1; standard; DNA; HUM; 3291 BP.
Peter
On Fri, Nov 4, 2016 at 4:37 PM, Liu, Chang <cliu32 at wustl.edu> wrote:
> Hi, Peter,
> Thank you for the quick response. I have sent a message to embl-ebi
> requesting for a list of changes. Hope this can be fixed with a minor
> tweak. Will keep you posted when I hear back.
> Best regards,
> Chang
________________________________
The materials in this message are private and may contain Protected Healthcare Information or other information of a sensitive nature. If you are not the intended recipient, be advised that any unauthorized use, disclosure, copying or the taking of any action in reliance on the contents of this information is strictly prohibited. If you have received this email in error, please immediately notify the sender via telephone or return mail.
More information about the Biopython
mailing list