Harnessing Gen AI's Power for Robust Production Systems

Setting up a Framework for GenAI

Posted by Pablo Lorenzatto

on November 27, 2023 · 4 mins read

Introduction

By now, you've probably noticed a considerable buzz around GenAI. It might seem like the possibilities are endless, and that's because they are. GenAI is paving the way for a wide range of applications.

We don't call ourselves #DataNerds for nothing, which is why over the past year, we've been nerding out, experimenting with different aspects of GenAI, each yielding varying levels of success.

Our main takeaway? Most GenAI capabilities can't be readily employed in client-facing or production-ready one-shot applications. Their outputs aren't quite there yet; they can be unpredictable, error-prone, yield invalid results, or fail to precisely meet desired criteria.

But don’t worry! There are a number of strategies to produce valuable results ready for production environments.

So Gen AI Isn't Perfect (just yet): What Can We Do About It?

Picture an application in which the model must produce outputs at scale—images, document summaries, and answers to customer questions via a chatbot UI. It's safe to assume that, no matter what we do, some samples will be unusable for our intended goal.

So, is this a showstopper? And what can we do about it?

Well, it depends, especially on the impact of the unnoticed faulty output. The greater the impact of the error, the greater our effort to reduce its frequency should be.

Here’s an example: If we are using a bot for social media PR then a “bot blunder” might severely affect the company’s reputation. So, in this context, one would consider incorporating additional checks carried out by humans or other means to achieve the targeted error level. The drawback? This approach would mean increased costs.

So is this a showstopper? To ensure an application is viable, both the frequency of errors and their associated costs should be within a reasonable limit.

Decoding GenAI Success: Defining Optimal Acceptance Rates

Great, so the frequency of errors and associated costs should be within a limit. How do we define that limit? What's our acceptance criteria? As with any other engineering problem, the first step is understanding what the solution should look like.

Our solution is to understand the acceptance rate of the system’s output as a continuum that enables different use cases. When it comes to implementing GenAI-based features, we identify two distinct thresholds of acceptance rates:

Opt-in: This is the minimum acceptance rate required to open the feature to users, allowing them to either accept or reject it. If the acceptance rate falls below this threshold, the results may be considered unusable.
Opt-out: At this level, the number of erroneous results is low enough that the feature should be enabled by default, and users who do not wish to use it should proactively disable it.

Note that what constitutes a valid result is highly dependent on the application and UX being designed. In some cases reworking the initial problem side by side with the client can be the key to discover successful solutions.

The Opt-In / Opt-Out Framework In Action

Let's walk through what implementing an opt-in / opt-out framework to find workable solutions within a business and feature design context might actually look like. Keep in mind that the listed steps are examples and may vary based on individual cases.

Imagine a Gen AI system used to generate product descriptions for a website. In this scenario, we'll set the acceptance threshold at 85% for the opt-in case and 98% for the opt-out case.

Let’s go through the steps to get to the Opt-in threshold:

As a starting point, we consider only one output text, and we deem it acceptable only for 68% of the input products.
We notice that if we adapt the UX to allow the user to select one of 3 images the acceptance rate improves by 14%.
Doing some prompt engineering we achieve an additional 5% increase, landing us above the opt-in threshold.

Now, the opt-out threshold:

We can identify bad outputs, filter them out and replace them by sampling new ones from the generative process. Let's say that this would yield an 8% improvement.
At this point if we run out of improvements the last percentage needed to reach the opt-out level might be to add a human moderation filter to the process. Note that this might not be feasible in many applications.

The Human Validation: Quantitatively Assessing Costs and Benefits

For some applications, where “real” samples also go through moderation (such as images on a website) human validation is already part of the existing pipelines. Still, there is a balancing act between the cost of catching those mistakes and the cost of having actual people checking every submission.

In more quantitative terms we can think of a simplified model for the net income per unit based on the following variables:

Cg: Cost of generating an output.
Ch: Cost of having a person validate the mentioned output.
Ce: Average cost of an error whether or not it was missed by a person.
Pg: Chances of having an erroneous output generated by the system.
Ph: Chances of having an erroneous output generated by the system that was later validated by a person.
R: Average revenue generated by having a correct output.

Now let’s define the net income:

$Net Income = Total Revenue - Total Cost$

In the case of an end to end system without human intervention we have:

$Total Revenue = R * (1 - Pg)$

$Total Cost = Cg + Ce * Pg$

$Net Income = R - Cg - Pg * (Ce - R)$

With human intervention we have:

$Total Revenue = R * (1 - Ph)$

$Total Cost = Cg + Ch + Ce * Ph$

$Net Income = R - Cg - Ch - Ph * (Ce - R)$

By comparing these scenarios it’s possible to make a gross estimation of the benefit

Some caveats:

Errors vary in severity; certain errors, like generating offensive or illegal content through an image generator or producing false information on sensitive topics via a text generation system, can be more costly.
In tasks like automation, incorporating human validation may defeat the purpose. What’s the point of a shopping assistant if every output requires manual checks before reaching the user?
The same happens with low latency use cases for which real-time human validation is infeasible.

Wrapping Up

We've tackled some of the challenges of integrating GenAI into production environments, shining the spotlight on the tricky terrain of error management and acceptance rates. Remember! It’s crucial to strike a balance between automation and manual checks.

If you're interested in GenAI (and if you've made it this far, we're guessing you are), stay tuned for the next entry in our new series: “Exploring Gen AI's real-world applications”. We'll be focusing on practical cases in image and text generation in upcoming posts. Additionally, we'll delve into discussions about various sampling heuristics aimed at minimizing the chances of bad samples reaching users