Testing Data & Data Processes with AI & Python
Half-day Training • Edinburgh • 20th March 2019
Location: BMA Scotland, 14 Queen Street, Edinburgh, EH2 1LL, Scotland.
Tickets: £25 + VAT
DataFest 2019 brings together local and international talent,
industry, academia and enthusiasts who all share at least one interest
— data! With a desire across sectors to succeed at Data Driven
Innovation, how can we be sure that our data — our raw material — is
as good as it should be?
This training brings the ideas and benefits of test driven development
to the arena of data analysis. Using the open source Python TDDA
library(test-driven data analysis), we'll work with data in CSV files,
Pandas DataFrames, and relational databases.
Part 1: Testing Data Processes and Pipelines
Introduction to reference tests and how these can be written for
various kinds of analytical processes over different data
types. Topics will include:
- Motivation for and introduction to testing
- Special considerations for testing analytical software and processes
- Testing and regenerating complex and partially variable outputs, and supporting diff tools.
Part 2: Using AI to Generate Constraints from Data and their use for Detecting Bad Data
Using constraints to verify data, including:
- identification of unexpected changes, outliers, duplicates, missing and disallowed values
- advanced string verification, including automatic generation of regular expressions to characterise patterns in text data using rexpy.
Crucially, we will show not only how constrains can be used to detect
change and problems in data, but also how those constraints can be
automatically generated using AI methods in the tdda library.
The methods and tools are applicable to structured data and data
pipelines using any software, not just Python.
WHO IS THIS TRAINING FOR?
The course is primarily aimed at practising data scientists with some
familiarity with Python, or programmers coming to data
science. Previous experience of testing and Pandas will be
advantageous but is not required.
Although the specific library used
is Python, the data testing is almost entirely language neutral, and
even the testing of data processes can be used with other languages,
from within a Python test script.
Non-programmers with an interest in
QA for data and data processes will also benefit from some of the
overview material, and are welcome to attend, but may need more help
with the hands-on parts of the course.
PREREQUISITES
It is essential that attendees bring a laptop (Mac, Linux or Windows)
with a working python environment installed with Pandas, NumPy, as
well as the TDDA library (tdda; available with pip from PyPI, and in
source form on Github).
Detailed instructions on system configuration will be supplied to
registered attendees before the session, as well as instructions on
how to test the installation.
Help will be available at the venue in the 30 mins prior to the
start of the workshop (from 13:30) for anyone unable to configure
their environment.