Test-driven data analysis (TDDA) is an approach to improving the correctness and robustness of analytical processes by transferring the ideas of test-driven development from the arena of software development to the domain of data analysis, extending and adjusting them where appropriate.
TDDA is primarily a methodology that can be implemented in many different ways, but good tool support can facilitate and drive the uptake of TDDA. Stochastic Solutions provides an open-source (MIT-licensed) Python module, tdda, for this purpose.
Reference Tests. Reproducible research emphasises the need to capture executable analytical processes and inputs to allow others to reproduce and verify them. Reference tests build on these ideas by also capturing expected outputs and a verification procedure (a “diff” tool) for validating that the output is as expected. The tdda Python module supports testing using comparisons of complex objects with exclusions and regeneration of verified reference outputs.
Constraint Discovery & Verification. There are often things we know should be true of input, output and intermediate datasets, that can be expressed as constraints—allowed ranges of values, uniqueness and existence constraints, allowability of nulls etc. The Python tdda module not only verifies constraints, but generates them from example datasets, thus significantly reducing the effort needed to capture and maintain constraints as processes are used and evolve. Constraints can be thought of as (unit) tests for data.
Getting data analysis right is hard. In addition to all the ordinary problems of software development, with data analysis we often face other challenges, including poorly specified analytical goals problematical input data—poorly specified, missing values, incorrect linkage, outliers, data corruption possibility of misapplying methods problems with interpreting input data and results changes in distributions of inputs, invalidating previous analytical choices.