Select Page
AI Testing

Comprehensive LLM Software Testing Guide

Unsure of how to do LLM Software Testing? Here's a complete guide that will help you get started on testing AI and LLM solutions.

Blue And White Bold Tech Youtube Thumbnail (2)

Large Language Model (LLM) software testing requires a different approach compared to conventional mobile, web, and API testing. This is due to the fact that the output of such LLM or AI applications is unpredictable. A simple example is that even if you give the same prompt twice, you will receive unique outputs from the LLM model. We faced similar challenges when we ventured into GenAI development. So based on our experience of testing the AI applications we developed and other LLM testing projects we have worked on, we were able to develop a strategy for testing AI and LLM solutions. So in this blog, we will be helping you get a comprehensive understanding of LLM software testing.

LLM Software Testing Approach

By identifying the quality problems associated with LLMs, you can effectively strategize your LLM software testing approach. So let’s start by comprehending the prevalent LLM quality and safety concerns and learn how to find them with LLM quality checks.

Hallucination

As the word suggests, Hallucination is when your LLM application starts providing irrelevant or nonsensical responses. It is in reference to how humans hallucinate and see things that do not exist in real life and think them to be real.

Example:

Prompt: How many people are living on the planet Mars?

Response: 50 million people are living on Mars.

How to Detect Hallucinations?

Given that the LLM can hallucinate in multiple ways for different prompts, detecting these hallucinations is a huge challenge that we have to overcome during LLM software testing. We recommend using the following methods,

Check Prompt-Response Relevance – Checking the relevance between a given prompt and response can assist in recognizing hallucinations. We can use the BLEU scoreBLEU scoreMeasures how closely a generated text matches reference texts by comparing short sequences of words and BERT scoreBERT scoreAssesses how similar a generated text is to reference texts by comparing their meanings using BERT language model embeddings to check the relevance between prompt and LLM response.

  • BLEU score is calculated with exact matching by utilizing the Python Evaluate library. The score ranges from 0 to 1 and a higher score indicates a greater similarity between your prompt and response.
  • BERT score is calculated with semantic matching and it is a powerful evaluation metric to measure text similarity.

Check Against Multiple Responses – We can check the accuracy of the actual response by comparing it to various randomly generated responses for a given prompt. We can use Sentence Embedding Cosine Distance & LLM Self-evaluation to check the similarity.

Testing Approach

  1. Shift Left Testing – Before deploying your LLM application, evaluate your model or RAG implementation thoroughly
  2. Shift Right Testing – Check BERT score for production prompts and responses

Prompt Injections

Jailbreak – Jailbreak is a direct prompt injection method to get your LLM to ignore the established safeguards that tell the system what not to do. Let’s say a malicious user asks a restricted question in the Base64 formatBase64 formatIt is a way of encoding binary data into a text format using a set of 64 different ASCII characters , your LLM application should not answer the question. Security experts have already identified various Jailbreaking methods in commonly used LLMs. So it is important to analyze such methods and ensure your LLM system is not affected by them.

Indirect Injection

  • Hidden prompts are often added by attackers in your original prompt.
  • Attackers intentionally make the model to get data from unreliable sources. Once training data is incorrect, the response from LLM will also be incorrect.

Refusals – Let’s say your LLM model refuses to answer for a valid prompt, it could be because the prompt might be modified before sending it to LLM.

How to prevent Prompt Injection?

  • Ensure your training data doesn’t have sensitive information
  • Ensure your model doesn’t get data from unreliable external sources
  • Perform all the security checks for LLM APIs
  • Check substrings like (Sorry, I can’t, I am not allowed) in response to detect refusals
  • Check response sentiment to detect refusals

RAG Injection

RAG is an AI framework that can effectively retrieve and incorporate outside information with the prompt provided to LLM. This allows the model to generate an accurate response when contextual cues are given by the user. The outside or external information is usually retrieved and stored in a vector database.

If poisoned data is obtained from an external source, how will LLM respond? Clearly, your model will start producing hallucinated responses. This phenomenon in LLM software testing is referred to as RAG injection.

Data Leakage

Data Leakage occurs when confidential or personal information is exposed either through a Prompt or LLM response.

Data Leak from Prompt – Let’s assume a user mentions their credit card number or password in their prompt. In that case, the LLM application must identify this information to be confidential even before it sends the request to the model for processing.

Data Leak from Response – Let’s take a Healthcare LLM application as an example here. Even if a user asks for medical records, the model should never disclose sensitive patient information or personal data. The same applies to other types of LLM applications as well.

How to prevent Data Leakage?

  • Ensure training data doesn’t store any personal or confidential information.
  • Use Regex to check all the incoming prompts or outgoing responses for Personal Identifiable Information.(PII)

Grounding Issues

Grounding is a method for tailoring your LLM to a particular domain, persona, or use case. We can cover this in our LLM software testing approach through prompt instructions. When an LLM is limited to a specific domain, all of its responses must fall within that domain. So manual testers have a vital responsibility here in identifying any LLM grounding problems.

Testing Approach

  • Ask multiple questions that are not relevant to the Grounding instructions.
  • Add an active response monitoring mechanism in Production to check the Groundedness score.

Token Usage

There are numerous LLM APIs in the market that charge a fee for the tokens generated from the prompts. Let’s say your LLM application is generating more tokens after a new deployment, this will result in a surge in the monthly billing for API usage.

The pricing of LLM products for many companies is typically determined by Token consumption and other resources utilized. If you don’t calculate & monitor token usage, your LLM product will not make the expected revenue from it.

Testing Approach

  • Monitor token usage and the monthly cost constantly.
  • Ensure the response limit is working as expected before each deployment.
  • Always look for optimizing token usage.

General LLM Software Testing Tips

For effective LLM software testing, there are several key steps that should be followed. The first step is to clearly define the objectives and requirements of your application. This will provide a clear roadmap for testing and help determine what aspects need to be focused on during the testing process

Moreover, continuous integration (CI) plays an important role in ensuring a smooth development workflow by constantly integrating new code into the existing codebase while running automated tests simultaneously. This helps catch any issues early on before they pile up into bigger problems.

It is crucial to have a dedicated team responsible for monitoring and managing quality assurance throughout the entire development cycle. A competent team will ensure effective communication between developers and testers resulting in timely identification and resolution of any issues found during testing.

Conclusion:

LLM software testing may seem like a daunting and time-consuming process, but it is an essential step in delivering a high-quality product to end-users. By following the steps outlined above, you can ensure that your LLM application is thoroughly tested and ready for success in the market. As it is an evolving technology, there will be rapid advancements in the way we approach LLM application testing. So make sure to keep updating your approach by keeping yourself updated. Also, make sure to keep an eye out on this space for more informative content.

Comments(0)

Submit a Comment

Your email address will not be published. Required fields are marked *

Talk to our Experts

Amazing clients who
trust us


poloatto
ABB
polaris
ooredo
stryker
mobility