Goodhart Risk

Archive registry entry

Goodhart Risk

Goodhart_risk measures the likelihood that a proxy becomes a target and then stops faithfully representing the underlying coherence condition it was meant to measure.

draftid: diagnostic-goodhart-riskversion: 0.1.0updated: 2026-05-31
Archive Progress

This section can be read now; registry depth and cross-references are still being strengthened.

Foundation
Online

The section has a stable overview route and basic reader context.

Technical Layer
Online

A deeper technical overview is available.

Registry
Current

60 registry entries are available.

Cross-links
Curating

Related concepts are being connected conservatively for accuracy.

1) Diagnostic Identity

Diagnostic Name: Goodhart Risk

Short Name / Symbol: Goodhart_risk

Diagnostic Class: Proxy Failure / Metric Capture / Optimization Risk / Φ–O Divergence / Regime Diagnostic

Primary Function: Estimate the risk that a proxy, metric, target, reward signal, score, KPI, benchmark, classification, narrative, or optimization objective will detach from the real coherence it was meant to represent.

Primary Use: Determine whether the system is optimizing the measurement of success rather than the actual condition, function, repair, or coherence the measurement was intended to track.

Core Risk if Ignored: The system may improve metrics while degrading real coherence, creating pseudo-success, hidden debt, affected-node burden, gaming, legitimacy shock, and eventual collapse of trust in the metric system.

Core Risk if Overtrusted: Metrics, benchmarks, summaries, standards, and quantitative proxies may be rejected too quickly, even when they remain useful, reality-linked, auditable, and appropriately bounded.


2) Mechanical Definition

Goodhart_risk measures the likelihood that a proxy becomes a target and then stops faithfully representing the underlying coherence condition it was meant to measure.

Goodhart_risk answers:

Are we optimizing the sign of success instead of the thing success was supposed to mean?

A proxy can be:

metric
benchmark
score
KPI
ranking
label
classification
dashboard
compliance indicator
engagement number
safety score
repair-complete status
legitimacy narrative
audit result
performance target

Goodhart risk rises when a system begins optimizing the proxy directly.

The core pattern:

proxy chosen to represent coherence
→ proxy becomes target
→ behavior adapts to improve proxy
→ proxy detaches from coherence
→ Φ rises while O falls

In UTS terms:

Φ↑ while O↓ or H↑ ⇒ Goodhart_risk ↑

Goodhart risk is not “metrics are bad.”

It means the metric must stay subordinate to reality contact, affected-node validation, auditability, repair outcomes, and coherence indicators.


3) What the Diagnostic Measures

Direct Measurement Target

Goodhart_risk measures:

  • proxy-to-reality detachment
  • metric target capture
  • optimization pressure around Φ
  • proxy manipulation
  • proxy overuse
  • metric authority inflation
  • dashboard blindness
  • benchmark overfitting
  • compliance theater
  • repair theater
  • safety theater
  • legitimacy theater
  • selected metric narrowing
  • gaming incentive
  • displacement of real O by measurable Φ
  • whether success signals still track coherence
  • whether proxy improvement creates hidden debt

Indirect / Proxy Signals

Goodhart_risk can be estimated from:

  • metrics improving while affected-node cost rises
  • benchmark improvement without real-world improvement
  • repair-complete status while recurrence continues
  • compliance increase without boundary recovery
  • safety scores improving while stress failures persist
  • response-time metrics improving while quality falls
  • outputs increasing while meaning collapses
  • dashboard health while users report harm
  • teams optimizing what is counted
  • work shifting toward visible metrics
  • hidden labor increasing to maintain score
  • local adaptation being suppressed by standardized targets
  • metric definitions changing to preserve success
  • proxy performance improving under low-stress conditions only
  • narrative becoming tied to metric defense
  • dissenting evidence dismissed because “the numbers are good”

What It Does Not Measure

Goodhart_risk does not directly measure:

  • whether the metric is useless
  • whether measurement is incoherent
  • whether optimization is always bad
  • whether qualitative evidence is always superior
  • whether the system should stop tracking performance
  • whether all proxy improvement is fake
  • whether affected-node reports are automatically complete
  • whether metrics should never become targets
  • whether success cannot be measured
  • whether all standards create distortion

High Goodhart_risk means the proxy is likely becoming detached from the reality it represents.

It does not mean the metric should automatically be abandoned.

Low Goodhart_risk means the proxy remains reality-linked enough for its intended use.

It does not mean the metric can replace direct observation or repair validation.


4) Canonical State Variables Involved

Canonical state vector:

S = {O, H, ε, ι, Au, µᵢ, BΣ, K, R, Φ}

Primary Variables

  • Φ: Goodhart risk centers on fitness proxy distortion
  • O: the real coherence condition the proxy is supposed to track
  • H: hidden debt rises when proxy success hides real degradation
  • ι: inversion risk rises when pseudo-success is mistaken for coherence
  • Au: auditability is needed to test whether the proxy still maps to reality
  • R: restoration may be optimized as status rather than actual recovery

Secondary Variables

  • ε: visible errors may drop because they are hidden, reclassified, or displaced
  • µᵢ: integrity declines when claim, metric, action, and consequence diverge
  • BΣ: boundary costs can be hidden by proxy success
  • K: compatibility may be claimed from shared metrics while real coupling degrades

Variables Commonly Confused With Goodhart_risk

Variable / DiagnosticDifference from Goodhart_risk
Φ − OActual proxy-coherence divergence; Goodhart_risk estimates risk of proxy capture and future divergence
narrative_metric_gapStory/evidence divergence; Goodhart risk focuses on proxy optimization detaching from O
stress_divergenceBaseline/stress gap; Goodhart risk often appears when metrics fail under stress
pseudo_damping_riskFalse settling; Goodhart risk may create pseudo-damping through metric recovery
affected_node_costLocal burden; Goodhart risk often hides affected-node cost
FI_integrityFeedback can correct the system; weak FI lets Goodhart drift persist
selection_traceabilityTrace of why a metric/target was selected; needed to audit Goodhart risk
Metric useMetrics can be healthy; risk arises when the metric becomes detached or over-authoritative

5) Localization Signature

Primary Legibility Layers

  • U4 — Classification / Metrics / Narratives: primary layer where proxies, scores, classifications, dashboards, and success stories form
  • U3 — Execution: where behavior changes to optimize the metric
  • U5 — Coordination / Time: where incentives, reporting cadence, targets, and review cycles shape optimization
  • U6 — Coherence Field: where proxy success either supports or distorts real coherence
  • U7 — Memory / Recurrence: where metric success becomes durable memory, precedent, or canon
  • U8 — Environment / Forcing: where stress reveals whether proxy success generalizes

Primary Leverage Layers

  • U4: recalibrate metric meaning, scope, and classification boundaries
  • U3: inspect behavior induced by the metric
  • U5: change reporting cadence, targets, incentives, and review loops
  • U6: verify coherence field effects beyond proxy performance
  • U7: correct metric-derived memory and success claims
  • U2: constrain harmful optimization incentives

Verification Layers

  • U4: does the metric still mean what it claims?
  • U3: what behavior is the metric causing?
  • U5: does cadence or target pressure distort reality?
  • U6: does O improve with Φ?
  • U7: does recurrence validate metric success?
  • U8: does metric success survive stress?

Common Mislocalizations

  • Treating metric improvement as coherence improvement
  • Treating compliance as repair
  • Treating benchmark success as real-world safety
  • Treating dashboard health as affected-node recovery
  • Treating low reported error as low harm
  • Treating high output as high value
  • Treating fast response as good response
  • Treating proxy failure as data failure only
  • Treating affected-node signal as anecdotal against metrics
  • Treating metric criticism as anti-accountability
  • Treating quantitative precision as truth
  • Treating standardization as coherence

6) Input Requirements

Required Inputs

To estimate Goodhart_risk, the system needs:

  • proxy, metric, benchmark, label, or target being evaluated
  • intended real-world referent
  • current Φ behavior
  • current O indicators
  • affected variables in S
  • optimization pressure around the proxy
  • how the proxy influences behavior
  • affected-node feedback
  • hidden debt indicators
  • stress behavior
  • recurrence data
  • metric lineage
  • selection rationale for the proxy
  • feedback pathways that can challenge the proxy
  • whether the proxy is used for consequence, reward, status, or closure

Optional Inputs

These improve precision:

  • metric history
  • benchmark design
  • gaming evidence
  • incentive map
  • dashboard data
  • field outcomes
  • audit reports
  • affected-node cost data
  • narrative_metric_gap
  • stress-test results
  • edge-case performance
  • public/private metric comparison
  • false positive / false negative analysis
  • alternative metric set
  • proxy retirement criteria
  • metric revision history
  • metric-to-repair linkage
  • external validation
  • recurrence after metric success

Missing Input Behavior

If Goodhart_risk inputs are missing:

  • If O indicators are missing, do not infer coherence from Φ
  • If affected-node feedback is missing, proxy success is under-validated
  • If metric lineage is missing, proxy meaning may be stale
  • If optimization pressure is unknown, Goodhart risk may be underestimated
  • If stress data is missing, proxy success is baseline-only
  • If hidden debt indicators are missing, metric success may hide H
  • If FI is weak, the proxy may be unfalsifiable
  • If consequence use is unknown, metric authority may be underestimated

Default missing-input posture:

treat proxy success as provisional → compare Φ to O/H/affected-node state → stress-test and audit incentive effects

7) Diagnostic States / Ranges

These ranges are qualitative and should be domain-calibrated.

Healthy / Coherence-Supporting Range

Proxy remains useful, bounded, audited, and reality-linked.

Signals:

  • Φ tracks O reasonably well
  • metric scope is explicit
  • affected-node feedback supports metric interpretation
  • hidden debt does not rise under metric success
  • stress tests validate proxy meaning
  • feedback can challenge the metric
  • metric does not dominate all selection
  • incentives do not encourage gaming
  • recurrence declines when metric improves
  • U7 memory preserves proxy limits

Recommended posture:

continue metric use
preserve scope notes
monitor Φ−O
audit incentives
validate through recurrence and stress

Watch Range

Proxy is still useful but beginning to gain too much authority or lose context.

Signals:

  • metric becomes central in decisions
  • teams begin optimizing the number
  • affected-node feedback is mixed
  • metric improvement outpaces qualitative improvement
  • hidden debt is uncertain
  • stress behavior is not fully tested
  • narrative depends heavily on metric success
  • alternative evidence is underweighted
  • proxy scope is often forgotten

Recommended posture:

restate metric scope
add O and affected-node checks
review incentives
reduce metric monoculture
avoid metric-only closure

Degraded Range

Proxy is detaching from real coherence and shaping behavior toward metric success.

Signals:

  • Φ rises while O stagnates or falls
  • affected-node cost rises under metric improvement
  • recurrence continues after metric success
  • hidden debt accumulates
  • gaming or metric optimization appears
  • metric criticism is dismissed
  • benchmark success fails in reality
  • compliance improves but repair does not
  • local adaptation is suppressed by target
  • narrative defends the metric more than reality

Recommended posture:

activate Ξ
pause metric-based closure
audit Φ−O
repair metric design and incentives
restore affected-node validation

Contraindicated:

scaling from metric success
public certainty from proxy
repair-complete claims
punitive enforcement of target
canonizing the metric
automation based only on proxy

Critical / Collapse-Prone Range

Proxy has become an inversion engine; the system optimizes success signs while real coherence degrades.

Signals:

  • proxy success requires hiding or exporting cost
  • metric is immune to challenge
  • O is deteriorating while Φ remains high
  • affected nodes reject metric reality
  • official memory stores metric success as real success
  • hidden debt becomes active failure
  • stress reveals benchmark overfit
  • legitimacy shock follows metric exposure
  • system cannot abandon metric without destabilizing narrative
  • gaming becomes the real operating system

Recommended posture:

stop proxy-dependent actuation
preserve evidence
quarantine metric authority
rebuild O indicators
repair affected-node burden
correct U7 success memory
redesign or retire proxy

False Positive Risk

Goodhart_risk may appear high when:

  • metric improvement genuinely reflects O improvement
  • affected-node feedback has not yet caught up
  • early metric discipline is needed to stabilize chaos
  • temporary target focus supports repair
  • stress testing is pending but not failed
  • metric criticism reflects poor understanding of scope
  • proxy is one bounded input among many
  • metric appears central because it is currently the most auditable signal

False Negative Risk

Goodhart_risk may appear low when:

  • O is not measured
  • affected-node cost is hidden
  • proxy gaming is normalized
  • metric has strong legitimacy narrative
  • stress tests are too narrow
  • dissent has exited
  • metric scope is forgotten
  • hidden labor maintains metric success
  • recurrence window is too short
  • dashboard health masks boundary strain

8) Leading Indicators

Goodhart_risk degradation appears early as:

  • people ask “what counts?” more than “what helps?”
  • metric becomes the decision language
  • proxy scope notes disappear
  • teams optimize visible indicators
  • affected-node feedback is called anecdotal
  • edge cases are excluded because they hurt scores
  • performance improves while trust does not
  • compliance rises while repair stagnates
  • metric exceptions become common
  • local adaptations are discouraged
  • hidden labor increases to meet target
  • narrative becomes metric-defensive
  • recurrence is explained away despite score improvement
  • alternative measures are treated as threats

9) Lagging Indicators

Goodhart failure has already accumulated debt when:

  • metric success is exposed as false
  • benchmark performance fails in real world
  • affected nodes reject official scores
  • hidden debt surfaces after long metric improvement
  • gaming becomes public
  • external audit contradicts dashboard
  • legitimacy shock occurs
  • system must abandon or redesign metric
  • memory correction is required
  • performance incentives are blamed for harm
  • real repair is delayed by metric defense
  • O must be rebuilt after Φ collapse

10) Interpretation Rules

How to Read Goodhart_risk

Goodhart_risk should be read as:

risk that proxy optimization is replacing reality-contact

It is not a rejection of measurement.

A system may have:

  • high Φ and high O — healthy proxy use
  • high Φ and low O — Goodhart pattern
  • low Φ and high O — metric mismatch
  • low Φ and low O — poor performance and poor coherence
  • high metric accuracy at low stress but low accuracy under stress
  • useful local metric that fails when scaled
  • metric that works until tied to reward or consequence

What Changes Its Meaning

Goodhart_risk changes meaning under:

  • high Φ pressure
  • high narrative_metric_gap
  • weak FI_integrity
  • low Au_eff
  • high affected_node_cost
  • high pseudo_damping_risk
  • high stress_divergence
  • high recovery_asymmetry
  • low variance_preserved
  • high innovation_exit
  • low truth_tolerance
  • high immunity_index
  • low MS_symmetry_index
  • high automation
  • high consequence severity

Context Modifiers

High Φ pressure: metric becomes target.

High narrative gap: story may defend proxy success.

Weak FI: feedback cannot falsify metric.

Low Au_eff: proxy lineage cannot be audited.

High affected-node cost: metric may be exporting burden.

High pseudo-damping: metric recovery may be false calm.

High stress divergence: metric works only under baseline conditions.

Low variance preserved: metric may have narrowed adaptation.

High automation: proxy logic can scale rapidly.

Domain Calibration Notes

Goodhart_risk should be calibrated by domain:

  • in engineering: uptime metrics, incident counts, story points, test coverage, latency targets, deployment frequency
  • in AI: benchmark scores, safety labels, refusal rates, helpfulness ratings, eval pass rates, memory confidence
  • in institutions: case closure rates, satisfaction scores, compliance counts, productivity KPIs, audit scores
  • in governance: enforcement stats, service metrics, public approval, deficit targets, crime numbers, wait-time averages
  • in relationships: visible harmony, response frequency, apology count, conflict reduction, agreement language
  • in archives: page counts, canon count, glossary completion, link volume, formatting consistency, reader engagement

11) Operator Sequencing Implications

If Goodhart_risk Is Low

Allowed with ordinary gate checks:

  • Γ can use metric as one selection input
  • Π can constrain around proxy with scope notes
  • Τ can plan using metric trend
  • ℛ can use metric as repair evidence with validation
  • U7 can store metric outcomes with provenance
  • Δ can stress-test metric reliability
  • public reporting can include metric with limits

Recommended:

Φ signal → O/H/affected-node check → Γ bounded selection → U7 metric memory with scope

If Goodhart_risk Is High

Recommended:

pause proxy-based closure → compare Φ to O/H/affected-node cost → audit incentives → redesign or bound metric

Or:

reduce metric authority → restore direct feedback and reality-contact → retest under stress and recurrence

Avoid or delay:

  • scaling from metric success
  • repair-complete claims
  • public certainty
  • automation based only on proxy
  • punitive enforcement of metric
  • canonizing the target
  • suppressing metric criticism
  • selecting from metric-only evidence
  • Ξ: detect proxy inversion
  • Au: audit metric lineage and incentive effects
  • FI: restore feedback that can falsify the proxy
  • Μ: reinterpret metric scope
  • Γ: reselect success criteria
  • Π: constrain metric authority
  • ℛ: repair burden caused by proxy optimization
  • Θ: damp certainty in metric success

Operators Contraindicated Under High Goodhart Risk

  • Γ hard selection from proxy: selects distorted target
  • Π irreversible metric constraint: encodes proxy failure
  • ⊗ deep coupling around shared metric: propagates Goodhart dynamics
  • ⊕ composition: embeds proxy into identity/canon
  • Τ acceleration: scales metric distortion
  • Σ escalation: sacralizes proxy
  • ✕ force: enforces proxy at cost of O

12) Gate Implications

Gates Strengthened By Reliable Goodhart_risk

  • Au-Actuation: metric lineage and scope are traceable
  • FI-Gate: feedback can falsify proxy success
  • High Risk Gate: blocks high-risk binding from proxy-only evidence
  • MS-Gate: checks who benefits or carries cost under metric optimization
  • ☷ᵢ: ensures metrics do not override principle constraints

Gates Weakened If Goodhart_risk Is Poorly Known

If Goodhart risk is unknown:

  • Au may trace metric but not meaning
  • FI may not challenge proxy success
  • High Risk Gate may bind classifications from metric-only evidence
  • MS may miss affected-node burden
  • ☷ᵢ may be reduced to measurable compliance
  • Π may overconstrain toward the target
  • Γ may select the option that improves Φ but harms O
  • ℛ may repair the dashboard instead of the system

Gate Outcomes Affected

High Goodhart_risk should push gates toward:

  • Pause metric-based closure
  • Require Φ/O comparison
  • Require affected-node validation
  • Require hidden-debt audit
  • Require incentive audit
  • Require stress test
  • Deny proxy-only claims
  • Deny automated consequence from proxy alone
  • for high-impact actuation based primarily on a metric that may be detached from O

13) Scaling Behavior

Goodhart_risk becomes more dangerous under scale because proxies become standardized, automated, rewarded, and defended.

As systems scale:

  • metrics gain authority
  • incentives align around measurable targets
  • local nuance is compressed
  • edge cases are excluded
  • gaming becomes systematic
  • dashboards replace direct observation
  • affected-node signal is filtered
  • proxy success becomes narrative legitimacy
  • metric definitions harden
  • automation propagates proxy logic
  • hidden labor supports metrics
  • metric criticism becomes costly
  • Φ becomes identity or canon
  • metric drift becomes difficult to reverse

Scaling Risks

  • metric monoculture
  • proxy inversion
  • benchmark overfitting
  • compliance theater
  • safety theater
  • repair theater
  • affected-node cost export
  • local adaptation loss
  • innovation exit
  • hidden debt accumulation
  • legitimacy shock
  • automation of proxy error
  • false success memory
  • metric immunity
  • O collapse under Φ success

Scaling Requirements

To scale metrics safely, systems need:

  • metric lineage
  • scope notes
  • O indicators
  • hidden debt indicators
  • affected-node validation
  • stress tests
  • feedback correction
  • anti-gaming audits
  • incentive audits
  • metric diversity
  • qualitative review
  • edge-case inclusion
  • proxy retirement rules
  • metric revision pathways
  • public/private metric comparison
  • U7 memory of metric limits

Scaling Rule

Proxy authority must scale only with evidence that the proxy continues to track O under stress, recurrence, and affected-node validation.

Sanity constraint:

Φ authority ↑ without O validation ⇒ Goodhart_risk ↑

If a proxy gains authority without direct coherence validation, risk rises.

Second constraint:

Φ↑ + H↑ ⇒ pseudo-success risk ↑

If the metric improves while hidden debt increases, success is likely false.

Third constraint:

shared_metric + high Φ−O ⇒ systemic Goodhart risk ↑

If many nodes share a detached proxy, distortion can propagate system-wide.


14) Interaction / Coupling Behavior

Goodhart_risk reveals whether a coupling is organized around real coherence or shared proxy performance.

What It Reveals About Coupling

  • whether nodes coordinate around metric rather than reality
  • whether one node’s burden funds another’s score
  • whether shared metrics hide local harm
  • whether feedback can challenge proxy alignment
  • whether compatibility is measured or experienced
  • whether repair is done or merely counted
  • whether affected-node cost is excluded
  • whether shared targets propagate distortion

What It Reveals About Boundary Integrity

Metrics can cross boundaries faster than meaning.

When Goodhart_risk is high:

  • local boundaries may be overrun by metric goals
  • refusal may be treated as noncompliance
  • affected-node cost may be ignored
  • BΣ may degrade under target pressure
  • boundary repair may be counted without landing
  • metric authority may override consent or fit

What It Reveals About Compatibility

Compatibility requires proxy humility.

A coupling may be unsafe if:

the shared metric improves only because one node absorbs hidden cost

or:

the relation looks compatible on the dashboard but not in lived operation

Healthy compatibility uses metrics as signals, not sovereign truth.

Relevant Interface Acts

  • ↺ Reflection: compare metric story to lived effect
  • ⇩ Relaxation: reduce target pressure
  • ⊘ Attenuation: reduce coupling around a distorted metric
  • ⊙ Alignment: clarify what the metric is and is not
  • →? Invitation: invite affected-node validation
  • ⚕︎ Restorative Override: requires post-action Φ/O audit
  • ✕ Force: high risk when used to enforce proxy compliance

15) Failure Modes Detected

Primary Failure Modes

Goodhart_risk detects or predicts:

  • proxy inversion
  • metric capture
  • dashboard blindness
  • benchmark overfitting
  • compliance theater
  • repair theater
  • safety theater
  • legitimacy theater
  • affected-node cost export
  • hidden labor growth
  • local adaptation suppression
  • innovation exit
  • metric immunity
  • false success memory
  • proxy-based classification error
  • O collapse under Φ success
  • proxy-driven hidden debt

Composite Regimes Where Goodhart_risk Matters

  • Goodhart Collapse: direct regime
  • Pseudo-Coherent Basin: metric success stabilizes hidden debt
  • Repair Theater: repair metric replaces repair
  • Mission Lock: metric preserves trajectory
  • Taboo Lock: metric cannot be questioned
  • Extraction Regime: metric success hides exported cost
  • Coercive Fusion: one node is forced to serve another’s score
  • Crisis Loop: metric recovery hides recurring origin failure
  • LOS: latent operations maintain formal metric success

16) Accountability & Reintegration Implications

If Goodhart_risk Was Ignored

Likely consequences:

  • metrics improved while coherence degraded
  • affected nodes carried hidden cost
  • repair was counted but not completed
  • hidden labor increased
  • local adaptation was suppressed
  • innovation exited
  • official memory stored false success
  • legitimacy shock followed exposure
  • selected path optimized Φ over O
  • system had to rebuild trust in measurement

Accountability questions:

  • What was the proxy supposed to measure?
  • When did it become the target?
  • Did O improve with Φ?
  • Did H rise under metric success?
  • Who carried cost of improving the metric?
  • Was affected-node feedback included?
  • Did stress tests validate the metric?
  • Was the metric gamed?
  • Did repair land or only score improve?
  • Did the metric become immune to challenge?
  • Was U7 memory corrected after metric failure?

If Goodhart_risk Was Misread

Possible misread forms:

  • useful metric treated as corrupt
  • legitimate target discipline mistaken for proxy capture
  • early improvement dismissed before validation
  • qualitative discomfort treated as superior to data by default
  • metric revision mistaken for manipulation
  • bounded proxy use mistaken for totalizing proxy use
  • failing metric mistaken for failing reality
  • high O / low Φ state misread because metric is outdated
  • metric criticism used to avoid accountability

Required Restoration

When Goodhart failure is found:

identify intended referent
→ compare Φ to O/H/affected-node cost
→ audit incentives and gaming
→ reduce proxy authority
→ redesign metric set
→ repair hidden burden
→ correct U7 success memory
→ validate under stress and recurrence

If proxy optimization burdened some nodes more than others, MS-Gate should review who gained score, who carried cost, and who received repair.


17) Cross-Domain Examples

Technical / Engineering

A team optimizes deployment frequency. Releases increase, but incidents, rework, and user disruption also increase.

Diagnostic implication: deployment count became proxy target detached from real delivery coherence.

Operator sequence: Φ/O audit → quality and incident metrics → affected-user validation → release process repair.


Institutional / Governance

A department optimizes case closure rate. Cases close faster, but unresolved harm and repeat complaints rise.

Diagnostic implication: closure metric replaced real remedy.

Operator sequence: recurrence audit → affected-node validation → closure criteria redesign → repair backlog review.


AI / Algorithmic

A model improves benchmark scores but performs worse on messy real user contexts.

Diagnostic implication: benchmark proxy became overfit.

Operator sequence: stress eval expansion → user-context validation → metric set redesign → U7 eval memory correction.


Interaction / Relational

A relationship uses “we are not fighting anymore” as the success metric, but truth and boundary repair are suppressed.

Diagnostic implication: low conflict became proxy for repair.

Operator sequence: pseudo-damping review → truth tolerance repair → boundary validation → recurrence check.


Archive / Framework Design

The archive tracks number of completed spec sheets, but glossary consistency and cross-link quality lag.

Diagnostic implication: completion count is outpacing real archive coherence.

Operator sequence: Φ/O archive audit → glossary/cross-link repair → status criteria revision → U7 version update.


18) Test Protocols

1. Referent Test

What reality is the metric supposed to represent?

Failure signal: no one can name the real referent.


2. Φ/O Test

Does proxy improvement correspond to coherence improvement?

Failure signal: Φ rises while O stagnates or falls.


3. Hidden Debt Test

Does H increase under metric success?

Failure signal: success creates deferred cost.


4. Affected-Node Test

Do affected nodes experience the metric improvement as real improvement?

Failure signal: dashboard improves while affected nodes worsen.


5. Incentive Test

What behavior does the metric reward?

Failure signal: rewarded behavior differs from coherent behavior.


6. Gaming Test

Can the metric be improved without improving reality?

Failure signal: easy gaming path exists.


7. Stress Test

Does the metric hold under stress or only benchmark conditions?

Failure signal: proxy fails under real load or edge cases.


8. Recurrence Test

Does recurrence decline when the metric improves?

Failure signal: metric improves but same issue returns.


9. Scope Test

Is the metric being used beyond its valid range?

Failure signal: local proxy becomes global truth.


10. Correction Test

Can feedback challenge the metric?

Failure signal: metric becomes immune to contradiction.


19) Anti-Patterns

  • Metric as reality
  • Score as coherence
  • Compliance as repair
  • Closure rate as restoration
  • Benchmark as safety
  • Low complaints as satisfaction
  • Low conflict as trust
  • Output as value
  • Speed as quality
  • Precision as truth
  • Dashboard as affected-node state
  • Metric improvement as legitimacy
  • Proxy criticism as anti-accountability
  • Gaming as efficiency
  • Hidden labor as productivity
  • Local adaptation as metric violation
  • Edge case as nuisance
  • Metric immunity
  • Public score as memory
  • Φ success as O success

20) Spec Validation Check

  • Is this truly a diagnostic, not an operator? Yes.
  • Does it measure state, capacity, risk, or response rather than act directly? Yes.
  • Does it map to S? Yes.
  • Are U-layers specified? Yes.
  • Are leading and lagging indicators separated? Yes.
  • Are interpretation risks defined? Yes.
  • Are operator sequencing implications clear? Yes.
  • Are gate implications clear? Yes.
  • Are scaling risks included? Yes.
  • Are interaction implications included? Yes.
  • Does it avoid new primitives? Yes.

Condensed Archive Summary

Goodhart_risk is the diagnostic estimate of whether a proxy, metric, target, score, benchmark, classification, dashboard, or optimization objective is becoming detached from the real coherence condition it was meant to represent. It does not reject metrics; it checks whether Φ still tracks O. High Goodhart_risk indicates risk of proxy inversion, metric capture, benchmark overfitting, dashboard blindness, compliance theater, repair theater, affected-node cost export, hidden labor growth, local adaptation suppression, innovation exit, metric immunity, false success memory, and O collapse under Φ success. Under high Goodhart risk, the system should pause proxy-based closure, compare Φ to O/H/affected-node state, audit incentives and gaming, reduce proxy authority, restore FI/Au, redesign metric sets, repair hidden burden, correct U7 success memory, and validate under stress and recurrence before scaling, automation, public certainty, or high-impact action.