Semantic diff
Spec version: 1.0.0-draft (see Overview)
This document defines the semantic diff model: how two versions of a design data dataset are compared to produce a structured change report with awareness of renames, deprecations, and property-level changes.
Token identity
A token identity determines whether a token in the old dataset and a token in the new dataset refer to the same logical token.
NORMATIVE: Implementations MUST use the following matching rules, in order:
- UUID match — If a token in the old dataset and a token in the new dataset share the same
uuidvalue, they are the same token regardless of name. - Name-object equivalence — When a UUID match is not found for a token — because the old token, the new token, or both lack a
uuid, or because no counterpart with the matchinguuidexists in the other dataset — two tokens are the same if theirnameobjects are deeply equal (all fields present with identical values). - Replacement link — When passes 1 and 2 leave an old token unpaired and it carries a
replaced_byfield whose UUID matches an unpaired new token, the pair is established. This enables diff classification as renamed for tokens that were deprecated with a machine-readable replacement pointer.
NORMATIVE: UUID matching MUST take precedence over name-object equivalence, which MUST take precedence over replacement link matching. A UUID match always identifies the token pair, even if name objects differ (which constitutes a rename).
RATIONALE: UUID-based identity allows tokens to be renamed without breaking continuity tracking. Name-object equivalence is a fallback for legacy datasets that predate UUID adoption. Replacement link matching is a tertiary fallback for deprecated tokens that carry an explicit replaced_by UUID pointing to their successor.
Change taxonomy
A semantic diff classifies every token into exactly one of six categories:
| Category | Definition |
|---|---|
| renamed | Token exists in both datasets (matched by identity) but the name has changed. |
| deprecated | Token exists only in the new dataset (unmatched) and carries a deprecated field. |
| reverted | Token existed with deprecated in the old dataset and no longer carries it in the new dataset. |
| added | Token exists only in the new dataset and is not renamed, deprecated, or pre-existing. |
| deleted | Token exists only in the old dataset and is not the source of a rename. |
| updated | Token exists in both datasets (matched by identity) with the same name but changed properties. |
NORMATIVE: Every token that appears in the old dataset, the new dataset, or both MUST be classified into exactly one category. Categories are mutually exclusive.
Category partitioning
NORMATIVE: Implementations MUST resolve categories in the following order to ensure mutual exclusivity:
- Renamed — Identify all identity-matched pairs where the name has changed. These tokens are removed from further classification as added or deleted.
- Deprecated — Among remaining unmatched new tokens, identify those carrying a
deprecatedfield. These are removed from the "added" pool. - Reverted — Among identity-matched pairs, identify tokens where
deprecatedwas present in the old version and absent in the new. These are removed from the "updated" pool. - Added — Remaining unmatched new tokens that are not renamed, deprecated, or pre-existing in the old dataset.
- Deleted — Remaining unmatched old tokens that are not the source of a rename.
- Updated — Remaining identity-matched pairs with unchanged names but differing properties.
RATIONALE: This ordering mirrors the pipeline in existing tooling and ensures that a renamed token does not also appear as "added" + "deleted", a deprecated token does not appear as "added", and so forth. A matched token that newly gains a deprecated field is classified as updated — the deprecation surfaces as a property-level change in the updated sub-categories, not as a new-token deprecation.
Deprecation normalization
NORMATIVE: When a token uses the legacy sets structure and all set entries carry deprecated: true, the token MUST be treated as deprecated at the token level for diff classification purposes, even if the top-level token object does not carry deprecated.
RATIONALE: Set-level deprecation that covers all variants is semantically equivalent to token-level deprecation. Normalizing this prevents diff noise from implementation-level differences in where the deprecated flag is placed.
Property-level changes
For tokens classified as updated (or renamed with additional property changes), a semantic diff SHOULD produce property-level change records.
Change record format
Each property-level change is described by:
| Field | Type | Description |
|---|---|---|
path |
string | Dot-separated path from the token root (e.g. value, name.colorScheme, sets.light.value). |
new_value |
any | The value in the new dataset. Present for additions and updates. |
original_value |
any | The value in the old dataset. Present for deletions and updates. |
Property change sub-categories
| Sub-category | Condition |
|---|---|
| added-properties | Property exists in new but not in old. |
| deleted-properties | Property exists in old but not in new. |
| updated-properties | Property exists in both but with different values. |
NORMATIVE: Property comparison MUST be recursive: nested objects are traversed and changes reported at the leaf level with full dot-separated paths.
NORMATIVE: Property comparison MUST use deep equality for values. Two values are equal if their JSON serializations are identical.
Output ordering
RECOMMENDED: Diff output SHOULD be deterministic. Within each category, tokens SHOULD be sorted by their canonical name (or new name for renames) in lexicographic order.
RATIONALE: Deterministic output makes diff reports suitable for snapshot testing and human review.
Cross-format compatibility
NORMATIVE: A conforming diff engine MUST accept both legacy format (JSON object maps) and cascade format (JSON arrays with .tokens.json extension) as inputs for either the old or new dataset, including mixed-format comparisons (e.g. legacy old, cascade new).
RATIONALE: The diff operates on the token graph abstraction, which normalizes both formats into the same TokenRecord structure. This enables diffing across format migrations without special-case handling.