Why Some Work Must Remain Unscored
Scoring trades context for comparability. For some work, that context is the whole point—and a score does not measure it so much as change what it is allowed to be.
Key takeaways
- Every score trades context for comparability; that trade is ruinous when the context is the point.
- Assessing work means judging it; scoring it means ranking it—these are not the same act.
- Uniform scoring looks fair while quietly penalising work done under constraint.
- At scale, scoring stops evaluating work and starts governing it, especially when the scorers sit outside the domain.
- Mature organisations choose what to leave unscored and hold it to account another way.
Scoring Is Not a Neutral Act
A security team is handed two findings. The first carries a CVSS score of 9.8—critical, red, top of the queue. The second scores 6.1—medium, amber, somewhere below the fold. The team works the 9.8 first, because that is what the number instructs. What the number does not say is that the 9.8 sits on an isolated internal system with no route to anything that matters, while the 6.1 sits on an internet-facing service holding customer records. The real priority is the reverse of the reported one. Nothing about CVSS is broken; it did exactly what it was built to do. The error was treating its output as a judgment about risk rather than as one input into that judgment.
Scoring is a powerful abstraction. It compresses complexity into comparability, which is precisely why organisations reach for it: a number can be ranked, a rank can be queued, and a queue can be managed at a scale no deliberation could match. Within limits this is genuinely useful—CVSS helps triage thousands of vulnerabilities, a risk matrix gives a board a shared shorthand, an OKR turns a vague ambition into something trackable. The trouble starts when the abstraction is mistaken for the thing itself, and scoring is treated as a neutral act. It is not. Every score is a trade: context surrendered in exchange for comparability. That trade is fine when the context being lost is cheap. It is ruinous when the context is the whole point.
Every scoring system encodes a theory of what matters, whether or not anyone wrote it down. CVSS decides that a vulnerability’s intrinsic characteristics matter more than its setting, unless someone deliberately applies the environmental metrics most teams skip. A five-by-five risk matrix decides that likelihood and impact can be reduced to ordinal bands and multiplied—an operation risk specialists have long criticised as mathematically incoherent. An OKR graded at 0.7 decides that a particular slice of measurable activity stands in for progress. None of these assumptions is unreasonable; all of them are choices, and all of them privilege what can be measured consistently over what has to be understood in place.
This is where a particular kind of work runs into trouble. Some work is not waiting to be scored in order to become complete—it is complete precisely because it resists reduction. An architectural assessment, a security threat model, a risk finding written for a regulator, a record of why a design decision was made and what was traded away: these exist to establish what is sound, what remains uncertain, and who is accountable. Their value lives in the context a score is built to discard. To assess such work is to exercise judgment on it; to score it is something narrower and lossier—to rank it against other things on a single scale. The first is essential. The second, applied here, quietly destroys what it claims to measure.
And it destroys it not by force but by accommodation. Faced with a scoring regime, work adapts to survive it. A threat model is trimmed to the findings a scanner recognises. A nuanced risk is rounded to the nearest band so it fits the heat map. A design rationale is rewritten to emphasise whatever the rubric rewards. None of this is dishonest; all of it is rational, given what gets seen and counted. But over time the work is reshaped in the image of the metric, and what cannot be represented numerically—the caveat, the judgment call, the deliberately unresolved question—begins to disappear, not because it stopped mattering but because it no longer had anywhere to go. Some work must remain unscored because scoring does not measure it. It changes what it is allowed to be.
What Scoring Erases
Scoring systems are superb at comparison and poor at responsibility. The moment work is translated into a number or a rank, nuance is compressed and context is stripped away—and that is an acceptable loss only when the thing being measured can afford to lose its context. A great deal of technical work cannot. What scoring erases first is intent: it treats outputs as interchangeable units rather than as situated decisions made by someone for a reason. It collapses difference into variance, discarding the why behind a distinction and keeping only its magnitude. Under that compression, careful restraint and plain underperformance look identical, because the thing that separates them—the reasoning—is exactly what the score throws away.
In security, risk, and governance work, this erasure has teeth. A CVSS base score erases reachability and blast radius. A security rating from a firm like BitSight or SecurityScorecard, assembled from externally observable signals, erases everything happening inside the perimeter that those signals cannot see—compensating controls, network segmentation, the actual sensitivity of what sits behind a flagged port. A risk matrix erases the difference between a well-characterised risk and an educated guess, because both arrive as the same coloured cell. In each case the number is confident and the context is gone, and confidence without context is precisely the condition in which expensive mistakes get made with conviction.
Once erased, these qualities are hard to put back, and the attempt to recover them does its own damage. Work gets rewritten to reintroduce whatever the score demands. Caveats swell into paragraphs of justification. The language turns defensive. The assessment starts arguing for itself instead of standing on its evidence, and in doing so it changes posture—from a piece of work that takes responsibility for a judgment to a piece of work that is lobbying for a grade. That shift, from accountability to persuasion, is subtle and corrosive. An assessment busy defending its score has stopped doing the thing it existed to do.
This is the heart of why some work cannot be meaningfully scored: the act of scoring removes the very qualities that gave it value. You can assess it—judge it, weigh it, hold its author to account for it—without ever reducing it to a number. The reduction is not a more rigorous form of assessment. It is a different and lesser thing wearing assessment’s clothes.
The Illusion of Fairness
Scoring is most often defended in the language of fairness. Apply the same criteria to everyone, with the same instrument, and the playing field is level—or so the argument runs. The appeal is real, and the impartiality is real too, as far as it goes. But it goes only as far as the assumption underneath it: that everything entering the system is alike enough to be measured the same way. Where that assumption fails, uniformity stops producing fairness and starts manufacturing bias—bias that is harder to challenge precisely because it looks so even-handed.
Uniform scoring advantages work that fits the instrument and penalises work that does not, regardless of underlying quality. A vulnerability programme run on raw CVSS scores will pour effort into high-scoring findings on low-value assets while a cleverly-chained set of medium findings on a critical system waits its turn, because the instrument rewards the score and not the scenario. A team operating under genuine regulatory or security constraints—forced to move carefully, document heavily, and refuse shortcuts—will score worse on a velocity-style metric than a team with nothing to lose, and the metric will read that constraint as underperformance. The score is not measuring quality. It is measuring fit with the score.
What has happened is that fairness has quietly shifted from substantive to procedural. Everyone is treated equally; no one is treated appropriately. The system can point to its consistency as proof of its justice while producing outcomes that are systematically skewed against exactly the work that most warrants care—the work done under constraint, with precision, for stakes that do not fit the rubric. For anything grounded in resilience, compliance, or long-term stewardship, this is not a marginal distortion. It is a standing tax on doing the difficult thing properly.
Real fairness, in evaluation, cannot be achieved by ignoring difference; it requires recognising it. Sometimes that means scoring different work differently. Sometimes it means accepting that two things are not comparable and declining to rank them at all. The mark of a fair system is not that it scores everything the same way. It is that it knows when not to compare.
When Scoring Becomes Governance
At a certain scale, scoring systems stop merely evaluating work and begin governing it. The transition is rarely declared, but its effects are everywhere once you look. When a score decides what gets funded, what gets attention, and what gets escalated, it is no longer describing reality—it is shaping it. Work that scores poorly comes to seem less legitimate regardless of its actual role, and over time the scorecard stops reflecting the organisation’s priorities and starts dictating them.
This is most acute when the people doing the scoring sit outside the domain being scored. External cybersecurity ratings are the clearest case: firms such as BitSight and SecurityScorecard assign an organisation a grade from outside the firewall, and those grades increasingly gate real decisions—whether a supplier clears third-party risk review, what a cyber-insurance premium costs, whether a deal proceeds. Security is then being governed, in part, by criteria set by a third party working from incomplete information, with little visibility into the context that would change the verdict. Application portfolio scoring does something similar from the inside: a model like Gartner’s TIME framework sorts systems into tolerate, invest, migrate, or eliminate, and a low score can mark a system for decommissioning before anyone has asked whether the score understood what the system actually does.
Governance by score has a particular character: it bypasses deliberation. It replaces the question “why does this matter?” with the question “did this pass?”—and those are not the same question. A threshold can tell you whether a number cleared a line. It cannot tell you whether the line was in the right place, whether the thing being measured was the thing that mattered, or whether the cost of compliance exceeded the risk it addressed. Accountability narrows to compliance; judgment is replaced by a cut-off; and the system loses the ability to explain itself, because explanation was never something a score was built to provide.
When evaluation hardens into governance like this, the useful question is no longer how to improve the score. It is whether the score should be governing at all—and, for some work, the honest answer is that it should not.
Choosing What Not to Measure
Mature organisations are defined as much by what they refuse to measure as by what they track. This is not a failure of capability or a shortage of instruments; it is discernment—the recognition that some forms of value degrade the moment they are abstracted, and that the responsible move is to leave them intact rather than force them onto a scale that will distort them. Restraint, here, is the sophisticated position, not the lazy one. It is harder to defend a deliberate gap in the metrics than to fill it with a number, which is exactly why the gap tends to need defending.
Choosing not to score is not the same as choosing not to account. The accountability does not disappear; it moves. Instead of a rank, it rests on evidence, peer review, transparency of reasoning, and a durable record of what was decided and why. A security threat model is held to account by review from people who can follow its logic, not by a single composite score. A regulatory finding is answerable through its evidence trail, not its position in a league table. This is more demanding than scoring, because it cannot be delegated to an instrument—it requires people who understand the domain and are willing to put their judgment on the line. That is the cost of keeping the work intact, and for work that matters it is worth paying.
As evaluative systems grow more capable and more pervasive—as more of the estate becomes legible to scanners, ratings, and dashboards—this capacity for restraint becomes more important, not less. The easier it gets to score everything, the more tempting it becomes to score things that should not be scored, and the more valuable the discipline to stop. Without it, scoring regimes overreach by default, and the organisation slowly confuses reliability with legibility: it comes to trust what it can measure and to neglect what it cannot, which is a poor map of where its real risks live.
Some work must remain unscored because its purpose was never to be better than the alternatives. Its purpose is to be dependable, precise, and answerable over time—to be trusted, not ranked. Recognising this is not resistance to evaluation; it is what makes serious evaluation possible at all, because a system that scores everything has lost the ability to tell the difference between what should be compared and what should simply be held to account. Knowing which is which is the whole discipline.
What cannot be reduced still needs to be held.