When and How to Refactor Data Products

Removing accidental complexity from Machine Learning infrastructures

Posted by Dante Pawlow

on March 27, 2024 · 6 mins read

Introduction

Data Products are typically evolved rather than designed: they are the result of exploratory experimentation by Data Scientists that eventually leads to a breakthrough. Much like with evolved organisms, this process produces vestigial structures, redundancies, inefficiencies and weird looking appendages.

While this organic growth is the natural output of the exploratory phase, the selection pressures acting on your Data Product suddenly change when it’s deployed into a production environment. This shift in environment must be accompanied with a redesign to withstand large volumes of data, or it will inevitably collapse.

In this article, we explore when (and if) refactors are necessary and which are the best strategies to approach them.

Symptoms

Embarking in a sizable refactor can be a daunting task (or even a fool’s errand), so it’s not recommended unless it’s absolutely necessary. The first thing you need to do is to evaluate the symptoms that your system is experiencing and take stock of your tech debt.

The usual symptoms a Data Product experiences are:

Lack of visibility into the process
Difficulty in implementing new features
Slow experiment iteration rate
Multiple and disjointed data sources
Multiple unrecoverable failure points, which produce dirty writes
Unpredictable, untrustworthy and non-replicable results
Slow data processing
Low visibility of the data as it goes through the pipeline

Some of these symptoms by themselves have relatively easy fixes, but having several of them simultaneously points to a larger problem with the design: accidental complexity.

Accidental Complexity is a concept coined by Frederick P. Brooks, Jr.¹, which refers to the unnecessary complexity a piece of software accrues that does not help to solve the problem at hand.

"From the complexity comes the difficulty of communication among team members, which leads to product flaws, cost overruns, schedule delays. 
From the complexity comes the difficulty of enumerating, much less understanding, all the possible states of the program, and from that comes the unreliability. 
From the complexity of the functions comes the difficulty of invoking those functions, which makes programs hard to use.
From complexity of structure comes the difficulty of extending programs to new functions without creating side effects."

This accidental complexity is the cause of an all-too-familiar feeling to programmers: the inability to reason and communicate about a system. It impedes the mental representation needed to conceptualize distributed systems and leads to fixation on smaller, easier to grasp problems, such as implementation details.

"Not only technical problems but management problems as well come from the complexity. This complexity makes overview hard, thus impeding conceptual integrity. It makes it hard to find and control all the loose ends. It creates the tremendous learning and understanding burden that makes personnel turnover a disaster."

This also creates a vicious cycle: because the software is difficult to understand, changes are layered on rather than integrated into, which further increases complexity. A refactor is needed to eliminate accidental complexity and should redesign the system in such a way as to minimize complexity creep in the future.

Strategy

Refactoring a piece of software that’s being actively maintained can be tricky, so we recommend the following strategy.

Understanding

The first step is to perform a materialist analysis on the system:

What does it purport to do versus what does it actually do?

This question is deceivingly difficult to answer, given all that we talked about accidental complexity, but following it through will make plain the core contradictions within our system, from which all other symptoms originate.

To start to unravel this problem, we need to look into the inputs and outputs of our system and work from there. It’s a tedious but useful exercise, that will reveal the paths data takes through our system, where they cross, tangle and which ones never meet.

Multiple and disjointed inputs and outputs tend to signify accidental complexity

Another good way to understand the ins and outs of the system and its pain points is to try to implement a small change into it and wrestle with the ripple effects: is the change limited to a single function or class or does it have multiple effects downstream? This is especially useful if you’re not familiar with the project.

Without proper interfaces, a simple change can have cascading effects

Incremental Improvements

We have to keep in mind that this is a productive system that still needs to be maintained, so we can’t simply start the refactor from scratch in a different repository and let the previous version rot away while we take a few months to finish the new one. As such, we need to strategically pick what parts to refactor first, in incremental iterations and keeping as much of the original code as we can salvage.

A good place to start is to consolidate all inputs into a “build dataset” module, and all the different outputs into one at the end. This will naturally provide starting and ending points for the flow of data, even though there’ll still be much plumbing to be done in the middle.

Well defined, incremental refactoring allows for new development in between improvements

Solutions

Now that we have a good understanding of the system and an incremental strategy to follow, we can start to implement our solutions.

Untangle

There’s a tendency in data products to have modules and functions with multiple responsibilities, stemming from the fact that exploratory code evolves adding modules where they are needed, but without putting emphasis on an overall design.

It’s useful, at this stage, to think about the three basic operations in Data Engineering: Extract, Transform, Load. The first step in untangling highly coupled modules is to break them down into (and group them by) these three basic operations.

Simply reorganizing the order of operations can enhance understanding

We can approach this process with a divide and conquer strategy: start at the function and class level and slowly work our way through to reorganize the entire design in this manner. As we progress, we’ll start to see opportunities to remove duplicated or pointless work.

Modularize

Now that our code is more organized, we can begin to conceptually separate it into modules. Modules should follow the single responsibility principle and be as clear and as concise as possible regarding the work they perform, which will allow us to reutilize them when scaling.

Data products usually grow piece by piece by the demands of the business. This leads to modules being closely linked to business logic, making them too specific or have too many responsibilities. Either way, this renders them hard to reutilize and, consequently, increases accidental complexity and reduces scalability.

With this in mind, we should feel free to reorganize the existing modules into new, cleaner ones, freeing ourselves from the existing artificial restrictions in the process. We’ll likely end up with smaller classes and functions than before, and they will be more reusable.

Having modules does not necessarily a good modularization make

The process of modularisation will also enhance, through abstraction, the way we think about our code. We can now rearrange whole modules without risk and it’s even likely that we’ll find out that previously tightly coupled processes can be completely separated into independent tasks.

Design Interfaces

Once we have every module conceptually separated, we can start planning how to separate the pipeline into orchestrated smaller tasks. To achieve this, we need to define interfaces for each task and strictly follow them.

In data pipelines we like to think of the data itself as the interface between tasks, both following the strict structure and constraints we impose into that data. With this in mind, we usually define intermediary tables as interfaces between tasks and structured dataframes as interfaces between modules of the task.

Independent tasks need to persist data somewhere

There are several benefits of having intermediary tables as strong interfaces:

They enforce a schema between modules.
They provide checkpoints in between processing, in case there’s the need to retry.
They provide visibility on the pipeline and can be used to monitor data quality
They reduce resource consumption by reutilizing built datasets
They provide a dataset and output history to reproduce experiments or train new ones.
Tasks can be plug-and-play if they adhere to the interface

Having tables as interfaces unlocks a lot of potential in the design

Optimize

Optimization in Big Data systems rarely comes from getting the most performance out of any given line of code but rather from avoiding duplicate work, choosing the right tools for each task and minimizing data transfer over the network.

This can only be fully achieved after applying the previous solutions in this list, as the modularity and understanding will allow us to rearrange the flow of data to our advantage and having tables as interfaces will allow us to have an array of tools to choose from to interact with the data.

Example of a refactored design showcasing different technologies for each task

Conclusions

Refactoring a Data Product is not an easy task, but can be greatly rewarding when done correctly. Ideally, however, these design considerations should be taken into account before deploying the Data Science MVP into production, as they are easier to implement before that stage.

Footnotes

Frederick P. Brooks, Jr, "No Silver Bullet --Esscence and Accident in Software Engineering" ↩