[Biojava-l] Bad PDB files and batch processing with PDBFileReader

Thu Oct 28 23:45:05 UTC 2010

It's not a big deal - after all if you use CA only, chains with no
CA's aren't important, and the error messages aren't that long.  But
I'm going to switch anyway...
I'm getting the dreaded "can't read line length in file" error while
trying to checkout biojava-live/trunk, though.

-da

On Thu, Oct 28, 2010 at 10:28, Andreas Prlic <andreas at sdsc.edu> wrote:
> Hi Daniel,
>
> I just checked, this is a bug which is already resolved in 3.0... If
> it is an issue for you, you might want to upgrade... (should be very
> easy, if you start using Maven ...)
>
> Thanks,
> Andreas
>
> On Wed, Oct 27, 2010 at 9:04 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>> I'm using 1.7, partially because my distro had a package for it and
>> partially because I was initially using the online Javadoc a lot.
>> PDB ID 1a02 with CA only parses but gives 4 StructureExceptions; I've
>> pasted them below.  Chain A exists in the PDB but is DNA, polypeptide
>> chain F appears to parse correctly.
>>
>> -da
>>
>> org.biojava.bio.structure.StructureException: could not find chain A
>>        at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
>>        at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
>>        at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
>>        at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>        at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>        at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>        at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>        at fragalign.pair.getStructs(pair.java:42)
>>        at fragalign.Main.main(Main.java:40)
>> org.biojava.bio.structure.StructureException: could not find chain B
>>        at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:217)
>>        at org.biojava.bio.structure.StructureImpl.findChain(StructureImpl.java:223)
>>        at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2303)
>>        at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>        at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>        at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>        at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>        at fragalign.pair.getStructs(pair.java:42)
>>        at fragalign.Main.main(Main.java:40)
>> org.biojava.bio.structure.StructureException: did not find chain with
>> chainId >A<
>>        at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
>>        at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
>>        at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
>>        at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>        at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>        at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>        at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>        at fragalign.pair.getStructs(pair.java:42)
>>        at fragalign.Main.main(Main.java:40)
>> org.biojava.bio.structure.StructureException: did not find chain with
>> chainId >B<
>>        at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:541)
>>        at org.biojava.bio.structure.StructureImpl.getChainByPDB(StructureImpl.java:548)
>>        at org.biojava.bio.structure.io.PDBFileParser.linkChains2Compound(PDBFileParser.java:2340)
>>        at org.biojava.bio.structure.io.PDBFileParser.triggerEndFileChecks(PDBFileParser.java:2210)
>>        at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:2107)
>>        at org.biojava.bio.structure.io.PDBFileParser.parsePDBFile(PDBFileParser.java:1963)
>>        at org.biojava.bio.structure.io.PDBFileReader.getStructureById(PDBFileReader.java:452)
>>        at fragalign.pair.getStructs(pair.java:42)
>>        at fragalign.Main.main(Main.java:40)
>>
>>
>> On Wed, Oct 27, 2010 at 17:47, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>> I assume AtomCache is a new class in BioJava3?
>>>
>>> yes it is... http://biojava.org/wiki/BioJava:CookBook:PDB:read3.0
>>>
>>>>
>>>> I must give you my embarrassed apology...after a bunch of testing I
>>>> finally figured out that I had misunderstood where the Parser's error
>>>> handling returns control and started going after the wrong exceptions.
>>>>  It does looks like if setParseCAOnly is true, the reader excepts on
>>>> chains with no CA's instead of just skipping them, though the other
>>>> chains are still parsed into the structure.
>>>
>>> This sounds like there might be  a problem with CA only.. do you have
>>> an example ID? also: are you on biojava 1.7 or 3.0 ?
>>>
>>> Andreas
>>>
>>>
>>>
>>>>
>>>> -da
>>>>
>>>> On Tue, Oct 26, 2010 at 22:19, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>> Hi Daniel,
>>>>>
>>>>> PDB files are better nowadays, due to remediation, however there are
>>>>> still issues..
>>>>>
>>>>> it sounds like you just want to figure out how to do the try/catch
>>>>> block properly. You could do something like that:
>>>>>
>>>>>                boolean splitFileOrganisation = true;
>>>>>                AtomCache cache = new
>>>>> AtomCache("/path/to/your/installation/",splitFileOrganisation);
>>>>>
>>>>>                String[] pdbIDs = new String[]{"4hhb", "1cdg","5pti","1gav", "WRONGID" };
>>>>>
>>>>>                for (String pdbID : pdbIDs){
>>>>>
>>>>>                        try {
>>>>>                                Structure s = cache.getStructure(pdbID);
>>>>>                                if ( s == null) {
>>>>>                                        System.out.println("could not find structure " + pdbID);
>>>>>                                        continue;
>>>>>                                }
>>>>>                                // do something with the structure - your inner loop
>>>>>                                System.out.println(s);
>>>>>
>>>>>                        } catch (Exception e){
>>>>>                                // something crazy happened...
>>>>>                                System.err.println("Can't load structure " + pdbID + " reason: " +
>>>>> e.getMessage());
>>>>>                                e.printStackTrace();
>>>>>                        }
>>>>>                }
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Oct 26, 2010 at 9:59 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>> Glad to hear it, who doesn't like support or clean interfaces?.  No
>>>>>> offense intended, by the way, with respect to PDB errors - obviously
>>>>>> the PDB is an indispensable resource for all protein scientists.
>>>>>>
>>>>>> I am looking at many (fixed-length) pieces of protein chains and doin'
>>>>>> stuff with 'em.  My current code has a pair of nested while loops; the
>>>>>> outer iterates over PDB entries (locally rsync'd copy), parsing them
>>>>>> and the inner iterates over the pieces from each.  When
>>>>>> StructureExceptions come out of my PDBFileReader object I want to
>>>>>> continue the outer loop, moving on to the next set of files without
>>>>>> executing any of the code that depends on correct StructureImpl
>>>>>> objects from the reader (database updates, the inner loop).
>>>>>> Since the reader's methods have their own try-catch blocks, a thrown
>>>>>> StructureException is stopped there and never reaches my own error
>>>>>> handling.  I just need to know when those errors occur so I can skip
>>>>>> those proteins - I am presuming that the correct entries will outweigh
>>>>>> the problem ones by a significant factor and the overall data wont be
>>>>>> seriously impacted.
>>>>>>
>>>>>> -da
>>>>>>
>>>>>> On Tue, Oct 26, 2010 at 21:11, Andreas Prlic <andreas at sdsc.edu> wrote:
>>>>>>> Hi Daniel,
>>>>>>>
>>>>>>> can you explain a bit more what you are doing, in particular what
>>>>>>> errors you would like to deal with on your end?  You should not need
>>>>>>> to worry too much about exception handling. Are there any special
>>>>>>> cases you are interested in?  In this case we should support you with
>>>>>>> a clean interface rather than exception handling from your end...
>>>>>>>
>>>>>>> Andreas
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 26, 2010 at 8:54 PM, Daniel Asarnow <dasarnow at gmail.com> wrote:
>>>>>>>> Hi all,
>>>>>>>> Let me first say thanks to all the BioJava community members for
>>>>>>>> delivering such a useful set of libraries, and that I'm still a newbie
>>>>>>>> when it comes to BioJava (and Java) so forgive me if my question is
>>>>>>>> too trivial.
>>>>>>>>
>>>>>>>> I am doing work on lots (at least thousands) of PDB files from RCSB.
>>>>>>>> As is commonly known, these are often rife with errors which can lead
>>>>>>>> to exceptions during parsing with PDBFileParser.  Because
>>>>>>>> PDBFileParser's methods contain their own try-catch blocks, exception
>>>>>>>> propagation stops there and my code proceeds blindly along regardless
>>>>>>>> of any error checking I do.  I would like to catch the exceptions up
>>>>>>>> in my code where the parser is called, so that I can branch to a
>>>>>>>> continue statement and have my batch processing loops move on to the
>>>>>>>> next file.
>>>>>>>> Should I edit out the try-catch blocks and compile my own version of
>>>>>>>> the library?  Or should I test the returned StructureImpl objects for
>>>>>>>> possession of the fields in question?  In that case, I'm not sure
>>>>>>>> which properties will give the most general success information...and
>>>>>>>> I'd rather not have to check for /every/ property being correct.
>>>>>>>>
>>>>>>>> If there is some great way to check if an exception was caught down a
>>>>>>>> series of nested method calls, please hit me over the head with it.
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>> -da
>>>>>>>> _______________________________________________
>>>>>>>> Biojava-l mailing list  -  Biojava-l at lists.open-bio.org
>>>>>>>> http://lists.open-bio.org/mailman/listinfo/biojava-l
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>
>>>
>>>
>>> --
>>> -----------------------------------------------------------------------
>>> Dr. Andreas Prlic
>>> Senior Scientist, RCSB PDB Protein Data Bank
>>> University of California, San Diego
>>> (+1) 858.246.0526
>>> -----------------------------------------------------------------------
>>>
>>
>
>
>
> --
> -----------------------------------------------------------------------
> Dr. Andreas Prlic
> Senior Scientist, RCSB PDB Protein Data Bank
> University of California, San Diego
> (+1) 858.246.0526
> -----------------------------------------------------------------------
>