When designing production-grade implementations of language models, several problems and their solution patterns frequently arise. As language modeling technologies advance rapidly, staying up to date with all the bleeding-edge discoveries is fundamental to continuously provide value to our customers. Regardless of the use case, whether it’s for customer service assistants, workload automation agents, or textual advertising optimization, the following key concepts serve as guidelines for leading a successful project.
#1 Design state
The Transformer neural network model has no notion of state persisted between inference rounds. The perceived memory of models like GPT-3.5 or others is mostly a byproduct of the application of the self-attention mechanism over the contents of the input context. We used to call this feature the ‘ghost state’ because it is flickery and lacks structure. The ghost state is not reliable enough to provide state keeping for our purposes, so it is necessary to encapsulate the model within a larger construct, which we refer to as the ‘external state.’ How is it designed?
Conversational state machines: We develop an in-memory structure based on state machines to model the steps and conversational flow of each of the assistant or agent functions.
Backend calls: As the conversation progresses, and depending on the current state, the language model will generate specific backend calls to update the external state.
Agent prompting patterns: Advanced prompting techniques, such as ReAct and Plan&Solve, can be employed in complex tasks to circumvent the rigidity of the state machine design. For example, they can extend the repertoire of backend calls in a flexible manner through the composition of more primitive backend calls.
By utilizing these design patterns, it is possible to develop a predictable and reliable implementation with an error rate on par with that of a human operator.
When developing systems involving natural language, it is easy to fall into the trap of confirmation bias. This bias will lead developers to believe that all possible or likely inputs of the system have already been handled as cases, which is usually far from the truth. Therefore, to create a robust solution, controls are necessary over the textual sequence to reduce noise and steer the conversation in the desired direction. What controls can be implemented?
Input filter: This task involves filtering out textual sequences or terms from the input that may contain malicious text (such as prompt injection attacks) or input with information mostly unrelated to the function of the assistant or agent.
Output filter: This task involves filtering out textual sequences or terms from the output that may contain inappropriate content that could pose a litigation risk for the company or disclose sensitive information.
Guidance techniques: Tools like NeMo Guardrails, MS Guidance, and the use of techniques such as stop sequences, token healing, guided decoding, regex pattern guides, conditional beam search, and others allow developers to impose rigid constraints on the content or structure of the model’s output. The drawback of using such tools is that they often require multiple calls to the language model with incremental context inputs to shape the output, which, in turn, increases costs and latency.
The use of filtering enables a safe exchange of curated information between the model and the users, relieving the client from the risk associated with the generation of inappropriate content and also sparing the implementation from having to handle arbitrary user inputs.
Each call to a language model incurs expenses and is associated with latency. The client aims to reduce costs, while the end user desires quick responses. Reducing these two variables is, therefore, in the best interest of the client, as keeping both within acceptable boundaries determines the project’s viability. The utilization of open-source models is a potential choice, and sometimes the only one, to lower costs and latency in some or all language tasks of the implementation. Typically, models like ChatGPT or Bedrock offer cost-effectiveness and produce higher-quality output for more complex or loosely defined tasks that require a large model to function effectively. On the other hand, self-served models may be a better fit for more specific tasks that can be addressed by smaller language models or those that involve handling confidential information. What steps can be taken to optimize these variables?
Prompt condensation: There is both good and bad prompt engineering. Every revision of the prompt for conciseness can significantly reduce both costs and latency. Furthermore, the entire system’s design should be carried out in a manner that minimizes prompts through task decomposition.
Inference services and frameworks: Many service providers, such as HuggingFace and MosaicML, offer inference-as-a-service for customized models. When running open-source language models, various inference frameworks are available, each with its own trade-offs that should be carefully analyzed to select the most suitable one for the task at hand. A notable example is the combination of NVIDIA TensorRT and Triton.
Batching: When running open-source language models, the proper selection of batch sizes for each task, or even a dynamic batch strategy, should be carefully considered to strike the best balance between cost and latency. Generally, increasing batch size enhances model throughput but also leads to increased latency.
Caching: We are discussing two distinct types of caches here: Semantic Caching, a technique applicable to every language model, is based on caching prompt responses and returning them when semantically similar prompts are presented to the language model. This cache can be particularly valuable for specific tasks like Q&A from knowledge bases. Another type of cache is known as the KV cache, which is essential for self-serving models to prevent quadratic complexity costs when recalculating the K and V values for each new token. KV caching is a straightforward concept but is typically implemented at the model level, and not every inference framework supports KV caching for every language model. Running large contexts without KV caching often becomes prohibitively costly in terms of both cost and latency.
Optimized inference: Specific techniques like quantization, network graph optimization, 2:4 sparsity optimization, and others can yield substantial improvements in latency and throughput, depending on the model and the type of workload.
Embedding fallbacks: The use of a Transformers language model is not always necessary for every aspect of the implementation, as some tasks can be restructured into an embeddings problem, which is much faster and cost-effective to handle.
Generation heuristics: The use of appropriate generation heuristics for each task is important to strike a balance between performance and the desired quality of output.
A successful optimization stage following these guidelines leads to a production-grade implementation because it can be adjusted to stay within the bounds of acceptable cost and latency.
#4 Fine tune
Fine-tuning preexisting models is one of the secret sauces of deep learning to produce high-quality implementations that don’t leave system users frustrated due to accuracy issues. Language models provide us with the possibility of using them without training, even for zero-shot tasks, but this doesn’t mean that fine-tuning is unnecessary or unhelpful. What are the key considerations to keep in mind?
Smaller models: As mentioned earlier, small language models are highly valuable for addressing simpler tasks with improved cost and latency. Smaller models typically require fine-tuning to function effectively.
High-quality data: For any type of neural network, the creation of very high-quality training data is essential for successful fine-tuning, leading to optimal accuracy. In the case of language models, how can we generate data for fine-tuning?
External corpus: Whether open or closed-source, preexisting corpora of specific types of text can be valuable for fine-tuning, depending on the task.
Real usage data & RLHF (Reinforcement Learning from Human Feedback): RLHF involves soliciting human feedback to score the model’s outputs and then applying reinforcement learning. Real usage data is also valuable, where successful interactions of the assistant or agent are identified and subsequently used for fine-tuning the model. Depending on the context, unsuccessful interactions may also be utilized.
Manual creation: Generating text inputs resembling real-use cases through manual testing or manually created datasets based on rules is typically a useful input for fine-tuning. However, the time required for this approach should be carefully considered. The use of scripting or programmatic generation may also fall into this category.
Generative AI: Using generative AI for fine-tuning generative AI may sound like a concept from Inception (and also sounds like bias), but applying this method judiciously can produce a high-quality and curated dataset.
*-of-thought: While not a fine-tuning procedure in itself, the use of *-of-thought style input incorporated within the fine-tuning data for solving certain tasks is often a way to enhance model performance.
A successful implementation of fine-tuned models results in significantly higher response accuracy and an improved user experience, helping to meet SLAs.
Testing any type of software is essential as a preliminary step before bringing it into production to meet SLAs, and it should also continue during development. Since user interactions with language models are inherently flexible and non-deterministic, once we’ve ruled out manual testing, which is not sustainable, our only option is to test the system using a generative AI agent. Such an agent can dynamically shape and adapt its interactions based on the responses from the system. It is an intriguing piece of software because, in a way, it serves as a counterpart to the implementation. It is grown and extended in capabilities to serve multiple purposes. Here are some key points to consider:
Test for function: We aim to define testing flows or sequences that encompass all the functions of the assistant or agent.
Manufacture entropy: While using a high temperature when running models does result in a wider range of outputs, it often isn’t sufficient. A more effective approach is to employ specific prompting procedures for entropy generation, which can achieve a much higher level of entropy in the interactions of the testing agent, better simulating real use cases.
Mimic user behavior: Some of the conducted tests should be capable of emulating user behavior based on collected data, especially in the case of assistants. Fine tuning models is highly useful for replicating such behaviors.
Continuous testing: Most language models from retail providers undergo periodic adjustments and fine-tuning. Consequently, even when the language models run with zero temperature, the outputs of the model given the same input context may experience “time drift.” Because of this, it becomes necessary to continuously test the system to receive early notifications of potential issues arising from this drift in responses. A robust log collection strategy facilitates quick hotfixes and root cause analysis to generate better and more stable prompts that suffer less from drift.
Test for training data generation: The inputs and outputs of model testing serve as an excellent source of training data for fine-tuning self-serve language models.
By employing all these strategies, the client can be confident that the provided implementations consistently maintain high quality, while developers have an invaluable tool for testing functions and fine-tuning models.
Generative AI for text is the new reality of productivity enhancement for companies and individuals. It can do it faster, it can do it cheaper and, sometimes, it can do it even better. At Mutt Data, we are dedicated to developing top-quality, cost-effective generative language solutions to power the workforce of the future. We will analyze your company’s business goals to determine the best technical strategy for your specific use case, providing you with a clear roadmap and a set of achievable milestones. To learn more, get in touch with one of our sales representatives at email@example.com, or explore our sales booklet and blog for additional information.