[EMBOSS] EMBOSS seqret : IntelliGenetics and new DOS lines
Peter
biopython at maubp.freeserve.co.uk
Mon Jul 20 15:41:43 UTC 2009
Hi all,
I've just updated my Mac to EMBOSS 6.1.0, and have found an
issue with seqret conversion of IntelliGenetics files. After some
digging, I think this problem relates to having DOS new lines in
a file on Unix (in my case, Mac OS X).
For illustration, I'm using the example file from the EMBOSS
website, saved to disk (using Unix new lines on a Mac):
http://emboss.sourceforge.net/docs/themes/seqformats/ig
Using EMBOSS 6.0.1, there was a problem:
$ embossversion
Writes the current EMBOSS version number to a file
6.0.1
$ seqret -sequence emboss_ig.txt -sformat ig -osformat fasta -auto -filter
>HSFAU
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttcagtcaaaaaaaaaaH-sapiensfaugenebasesH
SFAUctaccattttccctctcgattctatatgtacactcgggacaagttctcctgatcga
aaacggcaaaactaaggccccaagtaggaatgccttagttttcggggttaacaatgatta
acactgagcctcacacccacgcgatgccctcagctcctcgctcagcgctctcaccaacag
ccgtagcccgcagccccgctggacaccggttctccatccccgcagcgtagcccggaacat
ggtagctgccatctttacctgctacgccagccttctgtgcgcgcaactgtctggtcccgc
cccgtcctgcgcgagctgctgcccaggcaggttcgccggtgcgagcgtaaaggggcggag
ctaggactgccttgggcggtacaaatagcagggaaccgcgcggtcgctcagcagtgacgt
gacacgcagcccacggtctgtactgacgcgccctcgcttcttcctctttctcgactccat
cttcgcggtagctgggaccgccgttcaggtaagaatggggccttggctggatccgaaggg
cttgtagcaggttggctgcggggtcagaaggcgcggggggaaccgaagaacggggcctgc
tccgtggccctgctccagtccctatccgaactccttgggaggcactggccttccgcacgt
gagccgccgcgaccaccatcccgtcgcgatcgtttctggaccgctttccactcccaaatc
tcctttatcccagagcatttcttggcttctcttacaagccgtcttttctttactcagtcg
ccaatatgcagctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccagg
aaacggtcgcccagatcaaggtaaggctgcttggtgcgccctgggttccattttcttgtg
ctcttcactctcgcggcccgagggaacgcttacgagccttatctttccctgtaggctcat
gtagcctcactggagggcattgccccggaagatcaagtcgtgctcctggcaggcgcgccc
ctggaggatgaggccactctgggccagtgcggggtggaggccctgactaccctggaagta
gcaggccgcatgcttggaggtgagtgagagaggaatgttctttgaagtaccggtaagcgt
ctagtgagtgtggggtgcatagtcctgacagctgagtgtcacacctatggtaatagagta
cttctcactgtcttcagttcagagtgattcttcctgtttacatccctcatgttgaacaca
gacgtccatgggagactgagccagagtgtagttgtatttcagtcacatcacgagatccta
gtctggttatcagcttccacactaaaaattaggtcagaccaggccccaaagtgctctata
aattagaagctggaagatcctgaaatgaaacttaagatttcaaggtcaaatatctgcaac
tttgttctcattacctattgggcgcagcttctctttaaaggcttgaattgagaaaagagg
ggttctgctgggtggcaccttcttgctcttacctgctggtgccttcctttcccactacag
gtaaagtccatggttccctggcccgtgctggaaaagtgagaggtcagactcctaaggtga
gtgagagtattagtggtcatggtgttaggactttttttcctttcacagctaaaccaagtc
cctgggctcttactcggtttgccttctccctccctggagatgagcctgagggaagggatg
ctaggtgtggaagacaggaaccagggcctgattaaccttcccttctccaggtggccaaac
aggagaagaagaagaagaagacaggtcgggctaagcggcggatgcagtacaaccggcgct
ttgtcaacgttgtgcccacctttggcaagaagaagggccccaatgccaactcttaagtct
tttgtaattctggctttctctaataaaaaagccacttagttcagtcatcgcattgtttca
tctttacttgcaaggcctcagggagaggtgtgcttctcgg
i.e. The two sequences have been munged into one, with the
name of the second sequence as part of the sequence.
Using EMBOSS 6.1.0, the following now works:
$ embossversion
Reports the current EMBOSS version number
6.1.0
$ seqret -sequence emboss_ig.txt -sformat ig -osformat fasta -auto -filter
>HSFAU H.sapiens fau mRNA, 518 bases
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttcagtcaaaaaaaaaa
>HSFAU1 H.sapiens fau 1 gene, 2016 bases
ctaccattttccctctcgattctatatgtacactcgggacaagttctcctgatcgaaaac
ggcaaaactaaggccccaagtaggaatgccttagttttcggggttaacaatgattaacac
tgagcctcacacccacgcgatgccctcagctcctcgctcagcgctctcaccaacagccgt
agcccgcagccccgctggacaccggttctccatccccgcagcgtagcccggaacatggta
gctgccatctttacctgctacgccagccttctgtgcgcgcaactgtctggtcccgccccg
tcctgcgcgagctgctgcccaggcaggttcgccggtgcgagcgtaaaggggcggagctag
gactgccttgggcggtacaaatagcagggaaccgcgcggtcgctcagcagtgacgtgaca
cgcagcccacggtctgtactgacgcgccctcgcttcttcctctttctcgactccatcttc
gcggtagctgggaccgccgttcaggtaagaatggggccttggctggatccgaagggcttg
tagcaggttggctgcggggtcagaaggcgcggggggaaccgaagaacggggcctgctccg
tggccctgctccagtccctatccgaactccttgggaggcactggccttccgcacgtgagc
cgccgcgaccaccatcccgtcgcgatcgtttctggaccgctttccactcccaaatctcct
ttatcccagagcatttcttggcttctcttacaagccgtcttttctttactcagtcgccaa
tatgcagctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaac
ggtcgcccagatcaaggtaaggctgcttggtgcgccctgggttccattttcttgtgctct
tcactctcgcggcccgagggaacgcttacgagccttatctttccctgtaggctcatgtag
cctcactggagggcattgccccggaagatcaagtcgtgctcctggcaggcgcgcccctgg
aggatgaggccactctgggccagtgcggggtggaggccctgactaccctggaagtagcag
gccgcatgcttggaggtgagtgagagaggaatgttctttgaagtaccggtaagcgtctag
tgagtgtggggtgcatagtcctgacagctgagtgtcacacctatggtaatagagtacttc
tcactgtcttcagttcagagtgattcttcctgtttacatccctcatgttgaacacagacg
tccatgggagactgagccagagtgtagttgtatttcagtcacatcacgagatcctagtct
ggttatcagcttccacactaaaaattaggtcagaccaggccccaaagtgctctataaatt
agaagctggaagatcctgaaatgaaacttaagatttcaaggtcaaatatctgcaactttg
ttctcattacctattgggcgcagcttctctttaaaggcttgaattgagaaaagaggggtt
ctgctgggtggcaccttcttgctcttacctgctggtgccttcctttcccactacaggtaa
agtccatggttccctggcccgtgctggaaaagtgagaggtcagactcctaaggtgagtga
gagtattagtggtcatggtgttaggactttttttcctttcacagctaaaccaagtccctg
ggctcttactcggtttgccttctccctccctggagatgagcctgagggaagggatgctag
gtgtggaagacaggaaccagggcctgattaaccttcccttctccaggtggccaaacagga
gaagaagaagaagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgt
caacgttgtgcccacctttggcaagaagaagggccccaatgccaactcttaagtcttttg
taattctggctttctctaataaaaaagccacttagttcagtcatcgcattgtttcatctt
tacttgcaaggcctcagggagaggtgtgcttctcgg
i.e. There was a problem with this example file in EMBOSS 6.0.1,
but things look fine in EMBOSS 6.1.0. Great :)
However, if we now convert this input file to use DOS/Windows
newlines, and repeat the test (on Mac OS X, so Unix):
$ embossversionReports the current EMBOSS version number
6.1.0
$ seqret -sequence emboss_ig.txt -sformat ig -osformat fasta -auto -filter
H.sapiens fau mRNA, 518 bases
ttcctctttctcgactccatcttcgcggtagctgggaccgccgttcagtcgccaatatgc
agctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaacggtcg
cccagatcaaggctcatgtagcctcactggagggcattgccccggaagatcaagtcgtgc
tcctggcaggcgcgcccctggaggatgaggccactctgggccagtgcggggtggaggccc
tgactaccctggaagtagcaggccgcatgcttggaggtaaagttcatggttccctggccc
gtgctggaaaagtgagaggtcagactcctaaggtggccaaacaggagaagaagaagaaga
agacaggtcgggctaagcggcggatgcagtacaaccggcgctttgtcaacgttgtgccca
cctttggcaagaagaagggccccaatgccaactcttaagtcttttgtaattctggctttc
tctaataaaaaagccacttagttcagtcaaaaaaaaaa
H.sapiens fau 1 gene, 2016 bases
ctaccattttccctctcgattctatatgtacactcgggacaagttctcctgatcgaaaac
ggcaaaactaaggccccaagtaggaatgccttagttttcggggttaacaatgattaacac
tgagcctcacacccacgcgatgccctcagctcctcgctcagcgctctcaccaacagccgt
agcccgcagccccgctggacaccggttctccatccccgcagcgtagcccggaacatggta
gctgccatctttacctgctacgccagccttctgtgcgcgcaactgtctggtcccgccccg
tcctgcgcgagctgctgcccaggcaggttcgccggtgcgagcgtaaaggggcggagctag
gactgccttgggcggtacaaatagcagggaaccgcgcggtcgctcagcagtgacgtgaca
cgcagcccacggtctgtactgacgcgccctcgcttcttcctctttctcgactccatcttc
gcggtagctgggaccgccgttcaggtaagaatggggccttggctggatccgaagggcttg
tagcaggttggctgcggggtcagaaggcgcggggggaaccgaagaacggggcctgctccg
tggccctgctccagtccctatccgaactccttgggaggcactggccttccgcacgtgagc
cgccgcgaccaccatcccgtcgcgatcgtttctggaccgctttccactcccaaatctcct
ttatcccagagcatttcttggcttctcttacaagccgtcttttctttactcagtcgccaa
tatgcagctctttgtccgcgcccaggagctacacaccttcgaggtgaccggccaggaaac
ggtcgcccagatcaaggtaaggctgcttggtgcgccctgggttccattttcttgtgctct
tcactctcgcggcccgagggaacgcttacgagccttatctttccctgtaggctcatgtag
cctcactggagggcattgccccggaagatcaagtcgtgctcctggcaggcgcgcccctgg
aggatgaggccactctgggccagtgcggggtggaggccctgactaccctggaagtagcag
gccgcatgcttggaggtgagtgagagaggaatgttctttgaagtaccggtaagcgtctag
tgagtgtggggtgcatagtcctgacagctgagtgtcacacctatggtaatagagtacttc
tcactgtcttcagttcagagtgattcttcctgtttacatccctcatgttgaacacagacg
tccatgggagactgagccagagtgtagttgtatttcagtcacatcacgagatcctagtct
ggttatcagcttccacactaaaaattaggtcagaccaggccccaaagtgctctataaatt
agaagctggaagatcctgaaatgaaacttaagatttcaaggtcaaatatctgcaactttg
ttctcattacctattgggcgcagcttctctttaaaggcttgaattgagaaaagaggggtt
ctgctgggtggcaccttcttgctcttacctgctggtgccttcctttcccactacaggtaa
agtccatggttccctggcccgtgctggaaaagtgagaggtcagactcctaaggtgagtga
gagtattagtggtcatggtgttaggactttttttcctttcacagctaaaccaagtccctg
ggctcttactcggtttgccttctccctccctggagatgagcctgagggaagggatgctag
gtgtggaagacaggaaccagggcctgattaaccttcccttctccaggtggccaaacagga
gaagaagaagaagaagacaggtcgggctaagcggcggatgcagtacaaccggcgctttgt
caacgttgtgcccacctttggcaagaagaagggccccaatgccaactcttaagtcttttg
taattctggctttctctaataaaaaagccacttagttcagtcatcgcattgtttcatctt
tacttgcaaggcctcagggagaggtgtgcttctcgg
i.e. The ">" is missing on all the FASTA sequences.
So, it looks like EMBOSS 6.1.0 fixed one problem with
IntelliGenetics files, but that there is still an issue here.
Peter C.
P.S. Should I have reported this possible bug via sourceforge?
P.P.S. Back in 2006, I reported a similar issue with a data
corruption reading stockholm/pfam with DOS newlines
(Sourceforge Bug #1588956, long since fixed). It seems to
me that EMBOSS would benefit from explicit testing of all
the file formats using DOS/Windows newlines when run on
Unix, and vice versa. Does that sound feasible, or just
hopelessly ambitious?
More information about the EMBOSS
mailing list