[Biojava-dev] Biojava cant parse Uniprot Q5ZT67
Andreas Prlic
andreas at sdsc.edu
Fri Jun 14 14:30:34 UTC 2013
yes please, it would be great if you could commit the patch (on github).
Only one thought, what if this is run on Windows, shouldn't this also
check for "\r\n" ?
Thanks
Andreas
On Fri, Jun 14, 2013 at 6:23 AM, Simon Foote <simon.foote at nrc-cnrc.gc.ca>wrote:
> I had similar issues when parsing some bacterial sequences.
>
> I made the change in the org.biojavax.bio.seq.io.**UniProtFormat file at
> line 317 and now it works fine.
>
> } else if (sectionKey.equals(SOURCE_TAG)**) {
> // use SOURCE_TAG and TAXON_TAG values
> String sciname = null;
> String comname = null;
> List synonym = new ArrayList();
> int taxid = 0;
> for (int i = 0; i < section.size(); i++) {
> String tag = ((String[])section.get(i))[0];
> 317: String value = ((String[])section.get(i))[1].*
> *trim();
> // Replace any newlines with spaces
> value = value.replace("\n", " ");
>
> I can commit the change if you like.
>
> Cheers,
> Simon
>
>
> On 06/13/2013 04:49 PM, Spencer Bliven wrote:
>
>> What if we just strip out the newline characters in OS records? That seems
>> better than ignoring them or throwing an exception.
>>
>>
>> On Thu, Jun 13, 2013 at 4:22 AM, <chris.morris at stfc.ac.uk> wrote:
>>
>> HI,
>>>
>>> BioJava1.8.2 is unable to parse:
>>> http://www.uniprot.org/**uniprot/Q5ZT67.txt<http://www.uniprot.org/uniprot/Q5ZT67.txt>
>>>
>>> It reports:
>>>
>>> NCBI taxonomy names cannot embed new lines - at:23, in name: <strain
>>> Philadelphia 1 / ATCC 33152 / DSM 7513>
>>> because of these lines:
>>>
>>> OS Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 /
>>> OS ATCC 33152 / DSM 7513).
>>>
>>> It seems to me that the record is mistaken.
>>>
>>> If not, biojava needs a fix. The fix would be to replace:
>>>
>>> public SimpleNCBITaxonName(String nameClass, String name) {
>>> 045 if (nameClass==null) throw new
>>> IllegalArgumentException("Name class cannot be null");
>>> 046 if (name==null) throw new IllegalArgumentException("Name
>>> cannot be null");
>>> 047 if (name.indexOf('\n') >= 0) throw new
>>> IllegalArgumentException("NCBI taxonomy names cannot embed new lines -
>>> at:"+name.indexOf('\n')+", in name: <"+name+">");
>>> 048 this.nameClass = nameClass;
>>> 049 this.name = name;
>>> 050 }
>>>
>>> With:
>>>
>>> public SimpleNCBITaxonName(String nameClass, String name) {
>>> if (nameClass==null) throw new IllegalArgumentException("Name
>>> class cannot be null");
>>> if (name==null) throw new IllegalArgumentException("Name cannot
>>> be null");
>>> this.nameClass = nameClass;
>>> this.name = name.replaceAll("\\n", " ");
>>> }
>>>
>>> Regards,
>>> Chris Morris
>>>
>>> -----Original Message-----
>>> From: Morris, Chris (STFC,DL,SC)
>>> Sent: 13 June 2013 12:14
>>> To: 'Nikos Pinotsis'
>>> Subject: RE: error: Cannot recognise format of the record, please refer
>>> to
>>> the help pages
>>>
>>> Hi Nikos,
>>>
>>> Thank you for this important defect report.
>>>
>>> The library that PiMS uses to process Uniprot files reports this problem:
>>>
>>> NCBI taxonomy names cannot embed new lines - at:23, in name: <strain
>>> Philadelphia 1 / ATCC 33152 / DSM 7513>
>>>
>>> In this part of the Uniprot record:
>>>
>>> GN Name=legC7; OrderedLocusNames=lpg2298;
>>> OS Legionella pneumophila subsp. pneumophila (strain Philadelphia 1 /
>>> OS ATCC 33152 / DSM 7513).
>>> OC Bacteria; Proteobacteria; Gammaproteobacteria; Legionellales;
>>> OC Legionellaceae; Legionella.
>>>
>>> I will release PiMS4.4 next week, and I will include a workaround in it.
>>> I
>>> will also report the problem to Uniprot.
>>>
>>> Meanwhile, if you use a reference to the gene instead:
>>> GenBank YP_007567339.1
>>> Then PiMS does upload the sequences successfully.
>>>
>>> Regards,
>>> Chris
>>>
>>> -----Original Message-----
>>> From: owner-pims-defects at dlmail2.dl.**ac.uk<owner-pims-defects at dlmail2.dl.ac.uk>[mailto:
>>> owner-pims-defects at dlmail2.dl.**ac.uk<owner-pims-defects at dlmail2.dl.ac.uk>]
>>> On Behalf Of Nikos Pinotsis
>>> Sent: 12 June 2013 19:24
>>> To: pims-defects
>>> Subject: error: Cannot recognise format of the record, please refer to
>>> the
>>> help pages
>>>
>>> Hi ,
>>>
>>> I am using the PIMS in the http://pims.structuralbiology.**eu:8080<http://pims.structuralbiology.eu:8080>site and
>>> I am trying to download the target Q5ZT67_LEGPH or Q5ZT67 from several
>>> databases, however I am always getting the same error that the format of
>>> the record is not recognisable. Can you suggest me any solution
>>>
>>> thanks
>>> Nikos
>>>
>>> --
>>> Dr. Nikos Pinotsis
>>> Professor Gabriel Waksman's Group
>>> Crystallography , Birkbeck College
>>> University of London
>>> Malet Street
>>> London WC1E 7HX, UK
>>> T: +44 (0)207 631 6827
>>> F: +44 (0)207 631 6803
>>> M: +44 (0)792 384 3593
>>>
>>>
>>> ______________________________**_________________
>>> biojava-dev mailing list
>>> biojava-dev at lists.open-bio.org
>>> http://lists.open-bio.org/**mailman/listinfo/biojava-dev<http://lists.open-bio.org/mailman/listinfo/biojava-dev>
>>>
>>> ______________________________**_________________
>> biojava-dev mailing list
>> biojava-dev at lists.open-bio.org
>> http://lists.open-bio.org/**mailman/listinfo/biojava-dev<http://lists.open-bio.org/mailman/listinfo/biojava-dev>
>>
>
> --
> Bioinformatics Specialist
> National Research Council of Canada | Conseil national de recherches Canada
> Government of Canada | Gouvernement du Canada
> 100 Sussex Dr, Ottawa, Canada K1A 0R6
> Telephone | Téléphone 613-990-3600 / Facsimile | Télécopieur 613-952-9092
>
>
> ______________________________**_________________
> biojava-dev mailing list
> biojava-dev at lists.open-bio.org
> http://lists.open-bio.org/**mailman/listinfo/biojava-dev<http://lists.open-bio.org/mailman/listinfo/biojava-dev>
>
More information about the biojava-dev
mailing list