[Biojava-dev] Chain name vs id

Tue Oct 4 10:42:13 UTC 2016

Thanks for the clear description, Jose. My main fear was that some
structures might combine several names under the same id, giving a
many:many relationship that would require some group-level storage. This
seems not to be the case, at least for wwPDB structures. For instance,
waters do not get unified into a single id, but remain as separate chains
based on their nearest polymer. I guess one could construct a valid cif
file that would disobey this rule, but in that case they would just have to
deal with a few incorrect names.

I do think that the current selection syntax should stay backwards
compatible. Even RCSB is displaying chain names rather than switching to
ids. I agree that pulling ligands by proximity is a bit confusing, but this
is the most consistent way I can see to deal with them if we want to
support ligands with residue selections. With this approach, "1QTY.X"
refers to the polymer components with name X (mapping to a unique Chain
object), plus any ligands that form contacts to that polymer. I would lean
towards dropping waters from substructures completely (otherwise many
waters will be included with multiple polymers), although in principal they
could be treated the same way as ligands.

I think we're going to define or reuse a more powerful selection language
soon, so we can break backwards compatibility at that point.

-S

On Mon, Oct 3, 2016 at 9:45 PM, Jose Duarte <jose.duarte at rcsb.org> wrote:

> Hi Spencer
>
> Some answers below
>
>
> I have questions about the new chain id/name system post-#469.
>>
>> - Are all Chains guaranteed to have an ID defined? What about a name?
>>
>
> Yes, they should always have an id and a name.
>
>
>
>> - Are IDs set for structures loaded from PDB?
>>
>
> Yes. The ids don't exist in PDB files but the parser goes through the file
> assigning ids with the same rules and conventions as the mmCIF files use
> (unique ids for every distinct polymer/non-polymer molecule in the AU). The
> results will not always be perfect, e.g. for files without TER records the
> separation of non-polymer from polymer chains won't work properly, and then
> no distinct id would be assigned to them.
>
>
>
>> - Are IDs guaranteed to be unique within a Structure?
>>
>
> Yes, they are unique within the asymmetric unit (which is what a Structure
> is). If the Structure represents a bioassembly then they are still unique
> because the symmetry mates get new ids: <original chain id>_<operator id>
>
>
>
>> - Are all groups within a Chain object guaranteed to have the same ID?
>>
>
> Yes, that's exactly the new definition of Chain in biojava 5. All groups
> within a chain should have the same id (and also the same name).
>
> Chain <-> id is a 1:1 relationship, whilst Chain <-> name is a many:1
> relationship.
>
>
>
>> - How are cases where the id and name differ mapped to chain objects? Is
>> a new chain object created for every tuple (id,name) that has groups
>> defined in the file?
>>
>
> The id is the primary key of the Chains within a Structure as explained
> above. Name is only a secondary identifier that may or may not coincide
> with the id.
>
>
>
>> - Should the chain selection syntax (e.g. "4hhb.A") refer to ID or name?
>> Should it be specifically restricted to polymer chains (with ligands
>> automatically added from all chains based on proximity)?
>>
>
> This is something that can be discussed. The 2 possibilities:
>
> 1) It refers to name: most backwards compatible. A selection would then
> pull both polymers and the non-polymers (ligands) associated to them, as
> annotated in the file. In this option, adding ligands based on proximity
> would be confusing in my opinion.
>
> 2) It refers to id: not backwards compatible, but less ambiguous. A
> selection then refers strictly to a single molecule (be it polymer or
> non-polymer). We could then have an extra switch in the syntax to also pull
> non-polymers by proximity. Pulling by proximity will not always result in
> the same selection as what's annotated in the file by using names (e.g. you
> might pull symmetry mates from next cell too). A disadvantage of this
> option is that some other databases (e.g. SCOP) use names and not ids, thus
> we'd need to convert between them.
>
> Jose
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biojava-dev/attachments/20161004/353e3a67/attachment.html>