[Biojava-dev] Chain name vs id

Mon Oct 3 19:45:56 UTC 2016

Hi Spencer

Some answers below

I have questions about the new chain id/name system post-#469.
>
> - Are all Chains guaranteed to have an ID defined? What about a name?
>

Yes, they should always have an id and a name.

> - Are IDs set for structures loaded from PDB?
>

Yes. The ids don't exist in PDB files but the parser goes through the file
assigning ids with the same rules and conventions as the mmCIF files use
(unique ids for every distinct polymer/non-polymer molecule in the AU). The
results will not always be perfect, e.g. for files without TER records the
separation of non-polymer from polymer chains won't work properly, and then
no distinct id would be assigned to them.

> - Are IDs guaranteed to be unique within a Structure?
>

Yes, they are unique within the asymmetric unit (which is what a Structure
is). If the Structure represents a bioassembly then they are still unique
because the symmetry mates get new ids: <original chain id>_<operator id>

> - Are all groups within a Chain object guaranteed to have the same ID?
>

Yes, that's exactly the new definition of Chain in biojava 5. All groups
within a chain should have the same id (and also the same name).

Chain <-> id is a 1:1 relationship, whilst Chain <-> name is a many:1
relationship.

> - How are cases where the id and name differ mapped to chain objects? Is a
> new chain object created for every tuple (id,name) that has groups defined
> in the file?
>

The id is the primary key of the Chains within a Structure as explained
above. Name is only a secondary identifier that may or may not coincide
with the id.

> - Should the chain selection syntax (e.g. "4hhb.A") refer to ID or name?
> Should it be specifically restricted to polymer chains (with ligands
> automatically added from all chains based on proximity)?
>

This is something that can be discussed. The 2 possibilities:

1) It refers to name: most backwards compatible. A selection would then
pull both polymers and the non-polymers (ligands) associated to them, as
annotated in the file. In this option, adding ligands based on proximity
would be confusing in my opinion.

2) It refers to id: not backwards compatible, but less ambiguous. A
selection then refers strictly to a single molecule (be it polymer or
non-polymer). We could then have an extra switch in the syntax to also pull
non-polymers by proximity. Pulling by proximity will not always result in
the same selection as what's annotated in the file by using names (e.g. you
might pull symmetry mates from next cell too). A disadvantage of this
option is that some other databases (e.g. SCOP) use names and not ids, thus
we'd need to convert between them.

Jose
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biojava-dev/attachments/20161003/4e8caea9/attachment.html>