Active learning is a type of machine learning where a model in production selects samples to send to labellisation on its own.
Suppose you have a model detecting defects on a production line where the ratio of defects is 1/100. You can’t ask to label every piece that goes through the model because that would require labeling thousands of pieces. But you can’t ask to label, say, a random hundred pieces a day because the chance of actually labeling a defect would be 1/100. How would you select the most interesting pieces to have labeled based only on the output of your model?
Active learning has been around for years. In this article, I will introduce general active learning strategies and then present the challenges to adapting them to deep learning as well as specialized techniques used for deep active learning.
The model decides on a case-by-case basis if the data instance is interesting based on its informativeness. The informativeness is determined by the query strategy which will be detailed later.
This strategy is the most common one. The model will apply an informativeness query on a large pool of samples and select the most appropriate ones. On the contrary to the stream-based approach, this allows us to select a fixed number of instances to send to labellisation and guarantee that they are the most informative ones.
This strategy does not apply to all scenarios. Here, the model will generate an instance based on the real distribution of the data. In our example, the model would create a piece that is very resembling but with a different angle of rotation or a small part missing. This strategy is reserved for problems where generating convincing data is easy.
Active learning differentiates itself from passive learning by its ability to select data instances based on their informativeness. There are different ways to quantify the informativeness of a data instance by using different query types. Here are the 4 most common ones.
The most common query framework is uncertainty sampling. In this framework, the model queries the data instances on which he is least certain of his prediction. This approach is called the least confidence approach. For example in binary classification, the model would choose examples for which he outputs a probability around 0.5.
The QBC strategy involves creating a committee of models trained on the current labeled data. All the models in the committee are allowed to vote on the data instance. The most informative instance is the one on which the most models disagree.
Instead of focusing on model prediction, we focus on getting the most diverse examples. Some type of clustering will be done on the data and data instances and we will select across clusters to ensure the model's capacity to generalize to the entire distribution of data instances.
In this framework, we select the instances which would impact the model the most if we knew the label. An example of gradient models is the “expected gradient length” (1). The authors approximate the gradient for an instance based on the possible labels for that instance. The larger the expected gradient, the larger the possible changes on the model's weights, the better the example.
Active learning in deep learning comes with a series of challenges.
Firstly, the uncertainty sampling strategy mentioned above doesn’t work well with deep learning. In a classification task, the final output of the model passes through a softmax. Although it outputs a probability, experiences have shown that they are too confident making this an unreliable measure of confidence.
Secondly, deep learning is very hungry for data whereas active learning relies on a small amount of labeled data to update the model. Giving data instance by instance to a deep learning algorithm is a suboptimal training strategy.
Thirdly, most AL algorithms are focused on training classifiers with fixed feature representation. In deep learning feature learning and classification (or segmentation, object detection, etc) training are done together. Training both separately can lead to divergence.
BMDAL uses batch-based sample querying instead of individual querying. Traditional AL presented above queries on a one-by-one basis. This implies a lot of retraining the model for a few more data instances. In BMDAL, we score a batch of data instead of a unique data point.
A naive approach would be to score a series of samples and then select the most informative ones. However, this method doesn’t stop the model from selecting similar samples. This will result in a batch containing a lot of the same information and considering the high prices for labeling and retraining large deep learning models, that is not something we can afford.
The solution to this is to turn to hybrid strategies where we can take both the uncertainty and diversity in our queries to take into account information volume and diversity of samples.
The biggest danger of using only uncertainty sampling is that you will lose diversity. The model will have a lot of difficulties in one class and less in another but the active learning process will only select examples from the first class. As described in the above section, it can lead to suboptimal batch selection but not only. It has been shown (2) that on datasets as simple as MNIST, focusing on only uncertainty functions for acquisition can lead to mode collapse. In that paper, the model had problems recognising the handwritten “8”. The uncertainty query selected largely examples of “8” meaning that after a few rounds of acquisition, the dataset contained more than 3 times more “8” than any other numbers. The model will learn to predict mostly 8. This is “mode collapse”.
To answer this problem we turn to hybrid queries joining uncertainty and diversity. The general strategy is to cluster the data point based on an embedding (3) or the gradient relating to its prediction (4) and select in all clusters the top-k most interesting data point with an uncertainty function.
In density-based strategies, we focus on the “core dataset”, a subset of the dataset capable of representing the distribution of the feature space of the entire dataset. An example of this approach is the core-set approach (5) where we select data instances so that the distance between a data instance in the unlabelled pool and its closest data instance in the training set, will be minimized.
To avoid the possible divergence caused by the focus of active learning on the final decision instead of the feature learning and the decision like deep learning, the “Learning loss for active learning” (6) paper came up with a merged pipeline. The authors design a loss module that will predict the loss value for a data instance without knowing the true loss function or seeing the ground truth. This is done by extracting values from intermediary layers. During the active learning rounds, select the images with the highest loss.
In this article, you have seen an overview of what Active Learning is and how it is used in traditional machine learning. You have learned about the challenges that come when trying to join deep learning and active learning such as the unreliable confidence, the need for large quantities of data and the artificial separation between feature-learning and decision layers that are introduced by active learning. This article presents a few interesting approaches to solving some of these challenges.
With the rising demand for data and more specifically labeled data, It makes no doubt that deep active learning is a promising field that will likely expand in the coming years.
By Heloïse Baudoin - Machine Learning Engineer @ Sagacify
1. Settles, Burr, Mark Craven, and Soumya Ray. "Multiple-instance active learning." Advances in neural information processing systems 20 (2007).
2. Pop, Remus, and Patric Fulop. "Deep ensemble bayesian active learning: Addressing the mode collapse issue in monte carlo dropout via ensembles." arXiv preprint arXiv:1811.03897 (2018).
3. Zhdanov, Fedor. "Diverse mini-batch active learning." arXiv preprint arXiv:1901.05954 (2019)
4. Ash, Jordan T., et al. "Deep batch active learning by diverse, uncertain gradient lower bounds." arXiv preprint arXiv:1906.03671 (2019).
5. Sener, Ozan, and Silvio Savarese. "Active learning for convolutional neural networks: A core-set approach." arXiv preprint arXiv:1708.00489 (2017).
6. Yoo, Donggeun, and In So Kweon. "Learning loss for active learning." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019.