Need to Evaluate your LLM? Check Sagacify’s Library!

Facebook IconFacebook IconFacebook Icon

Need to Evaluate your LLM? Check Sagacify’s Library!


If you stayed abreast of technological advancements last year, you undoubtedly witnessed the transformative impact of Large Language Models (LLMs) like GPT on Natural Language Processing. These models significantly enhanced text-generation tasks, delivering extraordinary results and ushering in a new era for Generative AI. Developers and companies can now leverage pre-trained LLMs to streamline internal processes or enhance customer experiences without extensive retraining or the need for large datasets.


However, despite the apparent simplicity of utilizing LLMs for various tasks such as text generation, summarization, and Q&A, evaluating the generated text presents unique challenges from both business and academic perspectives.


1. Lack of labeled data and precise target objectives:


Traditional Machine Learning requires labeled data for each input to assess and refine results according to the desired objective. Conversely, tasks like text generation and Q&A often lack a single correct answer, with objectives that are not always straightforward. For example, companies may aim to develop an AI assistant for Q&A, and will need to assess factual consistency while also considering other factors such as style and politeness, which may become difficult due to the subjectivity aspect of evaluating the answers.


2. Dependency on the Prompt:


The performance of an LLM is closely linked to the quality of the prompt, regardless of the availability of reference data. This underscores the importance of creating and evaluating the prompt as a critical task for achieving the desired output.


3. Perplexity is not enough:


Perplexity is a well-known metric for evaluating language models, quantifying the average surprise or uncertainty of a model when predicting the next word in a sequence. While lower perplexity generally indicates higher model confidence in generating new tokens, it's important to note that high perplexity does not necessarily indicate superior performance. In some cases, a model with high perplexity might be overly confident in providing incorrect predictions, highlighting the limitation of using perplexity alone as an evaluation metric for language models.


4. Absence of advanced metrics:

The evaluation of a LLMs performance includes assessing factors such as language fluency, politeness, coherence, contextual understanding, factual consistency, the capacity to produce relevant and meaningful responses, and more. However, standard assessment  techniques (such as BLUE, ROUGE, etc.) fall short in providing dedicated tools to measure the diversity within the generated responses.

Depending on the task at hand, you might want to use the following types of evaluation metrics:

  • Embedding-based metrics: These metrics leverage the power of embeddings, which translate high-dimensional vectors (such as text) into a relatively low-dimensional space, typically learned from language models. By embracing contextual information, they can be used to compare reference text with text generated by language models, capturing both semantic information and grammar rules. Examples of embedding-based metrics include BERTScore and MAUVE.
  • Language-model-based metrics: These metrics utilize the powerful capabilities of pre-trained or fine-tuned language models to assess the quality of generated text in downstream tasks. Examples include BLUERT (for machine translation) and Q-squared (for evaluating the factual consistency of Q&A systems).

LLM-based metrics: These metrics leverage the advanced capabilities of pre-trained or fine-tuned large language models (LLMs) to evaluate text quality. Trained on vast amounts of data, these models possess a state-of-the-art contextual and semantic understanding. LLMs can assess generated text on nuanced metrics such as fluency, politeness, and factual consistency. Examples include SelCheck-GPT, G-Eval, and GPT-Score.This library, written in Python and built on top of the HuggingFace Transformers library, offers users a range of metrics and assessment techniques (including the ones mentioned above) to evaluate the quality of text generated by Large Language Models. The evaluation process is highly customizable, allowing users to choose from a predefined set of metrics or define custom ones. Users are free to utilize the selected metrics individually or collectively, based on factors such as the availability of reference data and the specific task and objective to be evaluated, including complex language metrics.


The source code and a comprehensive tutorial on installing and using the library can be found at . We are excited to share this library with the open-source community. Your feedback and contributions are highly welcome as we strive to enhance and refine the tool for collective benefit. Let's join forces in shaping this new era of AI together! 😄