Stochastic Solutions

Errors of Interpretation I:
Errors of Formulation

Far better an approximate answer to the right question, which is often vague, than the exact answer to the wrong question, which can always be made precise.

— John Tukey, The Future of Data Analysis,
The Annals of Mathematical Statistics, 33 (1), pp. 1–67, 1962.

Errors of Formulation

An error of formulation is a mistake in the approach to solving an analytical problem. They are not always clear-cut: only the simplest problems have unambiguously correct and incorrect approaches, but the difference between different formulations of a problem is often the difference between a useful analysis and one that is substandard or even harmful.

The TDDA Book has various checklists that can help with avoiding errors of interpretation when formulating an analysis, notably the Errors of Interpretation (EOI) checklist.

Here are some examples of how we have helped produce better formulations of problems for different clients, at Stochastic Solutions and in previous companies.

Uplift Modelling: Modelling the Change in Outcome, rather than the Outcome

While at Quadstone, we saw many clients trying to optimize retention activity by modelling customers' propensity to leave and targeting the customers most at risk or the high-value customers most at risk for interventions. This apparently reasonable approach turns out to be not merely suboptimal but often harmful in the sense that the effect of this approach is often to drive some customers away, and sometimes to drive away more customers than it saves.

The reason is that customers likely to leave usually have good reasons: poor service, overcharging, lack of affinity with the provider etc. Attempts to woo them, particularly with aggressive interventions, often trigger the very attrition and comparison shopping they are designed to avoid.

We developed Uplift Modelling, which models the difference between a group treated with the intervention under consideration and a control group. The binary version of the outcome is shown below (The “Fundamental Campaign Segmentation”), highlighting the true target segment (the Persuadables, who stay if treated but leave otherwise), the Sleeping Dogs (who are driven to go by the retention activity) and the two groups who are unaffected by the retention action.

A 2x2 “Boston Box” showing the Fundamental Campaign Segmentation. The columns show whether a customer will if leave if not treated—left no, right yes. The shows show whether a customer will leave if treated, no at the bottom, yes at the top. Bottom left (NO, NO; do not leave in either case) is labelled ‘SURE THINGS‘ (no colour). Top right (YES, YES; leave in either case) is labelled ‘LOST CAUSES’. Bottom right left if not treated but stay if treated) is labelled ‘PERSUADABLES’, coloured green. Top left (stay if not treated; leave if treated) is labelled ‘SLEEPING DOGS’, and coloured red.

This better formulation can not merely increase ROI on retention activity, but turn it from negative to positive.

See the page on Uplift Modelling for more details.

Pareto Optimization: Surfacing Trade-offs rather than Disguising them with a single Numeraire

Traditional optimization approaches assume that all solutions to a problem can be ranked from best to worst using a quality measure. In reality, solutions perform differently across different performance measures that are not directly comparable. For example, different sources of power generation have different costs, planetary impacts, availabilities, resilience, and ramp times. Similarly, different transport solutions have different costs, speeds, variances, planetary impacts, and implications for independence.

The traditional approach to handling this is to convert all the performance measures to a single numeraire or scale—invariably money. Thus, a price is put on pollution, on a human life, on a health impact, on a journey time, and so forth, the favoured approach of cost-benefit analysis. There are several problems with this:

Some examples of how this multicriterion thinking has been used by us are:

Clustering Considered Harmful

Cluster analysis is a powerful method for finding clumps in multidimensional data when these exist, but is a weak method compared to supervised learning if there is a goal beyond simply producing some segmentation of a dataset.

Just as non-commensurate quality measures lead to a false appearance of objectivity in optimization, performing cluster analysis on non-commensurate dimensions such as income, marital status, geography, and house price lead to different groups according to how the variables are scaled, normalized, and converted. It is easy to fix a scale in a pseudo-scientific way (such as using z-scores), but this solves only the superficial technical problem (How do we get these variables onto a common scale?) rather than the underlying problem that the variables are non-commensurate and cannot meaningfully be made equivalent (in general).

If there is a specific business goal for a segmentation, a supervised learning method such as a decision tree will create segments directly related to that business goal, by setting an appropriate target variable, whereas clustering will simply create clusters based on the scale factors chosen. We have worked with multiple B2C organizations that built segmentations with clustering that were ineffective, and helped them move to more useful segmentations built by hand or with supervised methods.

Read more about why we consider clustering harmful on the Scientific marketer blog or in the book.

Work With Stochastic Solutions

We can’t guarantee we will never misinterpret anything, or produce a bad formulation, but we have a track record of deep and successful thinking about better problem formulations and a set of tools, experiences, war stories and checklists to help make these sorts of problems less likely. Try us.

Company number SC329851. Registered office: 16 Summerside Street, Edinburgh, EH6 4NU.
Copyright © Stochastic Solutions Limited 2007–2026.