Q&A on Large Language Models - Part I

Facebook IconFacebook IconFacebook Icon

Advantages and Disadvantages of Using Open-source LLMs vs Proprietary LLMs

Advantages of open-source models: more flexibility/control to fine-tune the model on your data: the whole process can be supervised to adapt the model to your own needs.
More generally, open-source models are more transparent, as we have access to their architecture and their weights. Also, more information on the dataset used for training the LLM is usually available.
Finally, if the open-source model is hosted on your private cloud, the data is not sent to a third party, which is a requirement when handling highly sensitive data.

Disadvantages of open-source models: there is often a performance trade-off between open-source and proprietary models, such as GPT-4. Hopefully, the performance gap between these two types of models will be reduced in the coming years.
Besides, careful consideration must be given to the deployment strategy of the open-source model on cloud infrastructure, and the monitoring of the deployment costs.

Advantages of proprietary models: proprietary models often come with a simple API to call, and the answer is subsequently returned by the model. All of this is done with less than 10 lines of code. 
Most importantly, as of today, GPT-4 offers superior performance to open-source models, according to available benchmarks.

Disadvantages of proprietary models: the data is sent to a third party, which might be problematic for processing highly sensitive data.
Moreover, the model you use could be updated by the provider without notice, and that would require you to update the LLM’s prompts designed initially.

Additional sources: 

How to Evaluate LLMs’ Quality & Capabilities ?

Properly assessing the skills of an LLM for a particular task is famously challenging. In order to achieve this, we need to differentiate between the two potential situations: either there is a reference to assess the model's response (i.e. the "perfect" answer) or there is not.

Two options are present for the first scenario: embedding-based metrics and language-based metrics. An embedding is how a text is depicted in a vector space. The similarity, such as cosine similarity, between the model's prediction and the reference embedding measures how similar the model's answer is to the reference.

BLEURT is a language-based metric that provides a similarity score between the model's response and the reference using a model that has been trained on pairs of sentences. BLEURT has the capability to be customized to fit your specific issue through fine-tuning.

Handling the assessment in the second situation is, of course, more challenging. Usually, a different LLM is utilized to assess the model we employed for making predictions (such as GPT-Score, G-Eval, ...). Nonetheless, these methods are not without their drawbacks, such as showing a preference for lengthy responses or prioritizing answers from candidates ranked higher.

Finally, one must consider the reproducibility of outcomes when evaluating, for instance when comparing two models. In order to guarantee consistent and replicable outcomes, it is essential to adjust the LLM's temperature parameter to zero to ensure the identical response is produced for every individual inquiry.