It’s widely known that one of the higher barriers to train a predictive model using Deep Learning is the availability of a large and proper labeled dataset as it usually carries a high time labeling cost to get it.

So we have two high cost tasks right here and before getting to the fancy stage of model training, hyperparameter tuning and such. There are a huge amount of projects that never see the end of the tunnel because there’s simply not enough data in order to train an accurate predictive model even if you have a great team of experienced people with the knowledge to run highly complex deep learning architectures.

I guess it’s clear now how important is to have a good amount of labeled data for the people reading this (if you have been working on this field you already know and suffer this, unless you’re at Google or any of that kind of companies with access to a “little” more of data to play). That’s why I have a big interest on thinking about creative ways to get the needed labeled data without having to use a lot of time in order to get it. I’m from Argentina so I’m used to work with very limited resources all the time, maybe that’s why I have a special interest on this kind of topics.

There’s a lot of people more intelligent than me that keeps moving forward the field, sharing their aproach in the process. I’m going to mention several techniques in order to help getting more labeled data, some of them may be obvious for a lot of people but I think you may find something helpful if you keep reading.


Can I get the data by simply extracting them from the web?

For example, if you want to build a predictive model which identifies damaged homes from images and you have no datasets available for the task, what you can try is to “scrape it” from the web. Obviously if the model you want to build depends on data generated within your organization it’s better to use other methods, but if you don’t have that kind of constraints maybe you’re onto something.

There are several ways to get data from the web, the most common is by programming a scraper (for example, using a scraping framework like Scrapy) but you can also hire a company to do it, you can also pay for specialized online services ( and scrapinghub comes to mind) to avoid programming the scraper yourself.

The great people from offer an online MOOC called “Practical Deep Learning for Coders” and in one of their lessons they use a very simple script to extract images through Google Images just by copy pasting a little fragment of code in the browser editor. You can do that with almost no programming experience.

Data augmentation

Take a look at the following images (credits to fastai):

Do you see a cute dog on each picture? Ok, if you human (sorry robot this post is not for you) see a dog in each of those pictures then it’s very likely that the trained model identifies repeated patterns related to this particular dog. The first image (top left) is the original image, tagged as a “great pyrenees dog” by a human, but the rest of the pictures were “generated” from the first one by changing the zoom, lightning, orientation or another property of the image. This is awesome, you get a basically free approach to give the model more labeled data to train on so it’s really important for you to focus on.

Weakly supervised learning

If you have to classify if there’s cancer or not from hispathology image slides (very high resolution) but the labels are too general, you only know if there’s cancer or not on the whole slide. That’s a bummer, most of the current related models are trained on a more granular (by patches, small square regions from the image) labeled dataset. But you can imagine that the cost of having pathologists tag every patch of every giant image is pretty big.

So we have labeled data but on some part of the dataset, maybe a lot labeled samples from one category but none of the other (binary classes). Don’t just give up, maybe you could use something called Weakly Supervised Learning.

If a whole slide is labeled with the tag “no cancer” we have a pretty informative tag, we know that there’s no signs on cancer anywhere in the image. But if the slide it tagged “cancer” we don’t really know where in the image the indicators of cancer are present. Weakly Supervised Learning is a way to use this “partial” information and build a combine set of different models to get a good whole system performance.

It won’t be as precise as having a good labeled dataset, but if we have a big “partial” dataset available, we can end up getting similar precision.

Transfer Learning

The learning process of a Deep Learning model goes through several sequential layers, each layer learns more complex patterns as we move along them. The first layers usually learn to identify basic shapes like lines or rounded squares and the last layers learn to identify dog shapes and human eyes. The issue with this beautiful and amazing process is that requires a lot, well not a lot, a huuuuge amount of correctly labeled data to work properly.

Nowadays there are some of these datasets available online for every human (with access to the web) in this crazy world, Imagenet being one of the most important and used of them because it contains labeled images from a lot of different topics. Transfer Learning basically allow us to train a general Deep Learning model which learns dog faces, cars shapes and that kind of general things and then use the layers for our own specific data problem. Yes, you still have to get a labeled dataset for you usecase, but by using Transfer Learning you’ll probably end up needing a lot less samples.

Thank you Transfer Learning, we are not worthy.

Man in the middle, UX y Active Learning

How many predictions will the model need to do by hour? What error threshold can the system tolerate for it to still be functional? Does the predictions of the model  have to be processed real time?

Those are some of the questions to ask whenever a new AI related project comes your way before you decide how to approach it. A lot of projects don’t really need real time predictions processing, so a thing you can do is to include a human in the loop. The humans can really help to quickly fix some errors (at least in the predictions with the highest error probability) the model makes, acting as a gatekeeper before the predictions make it to the final user. And if you make the model to actively learn from these mistakes then you’re producing a Perfect Circle (hi Maynard!) of organical improvement of the model. You can call the action in the last sentence Active Learning, a task that could have high impact if you integrate it to the learning process.

All of this can be insanely better if you pair it with a great User eXperience. Many projects leave the UX for the final stages of the projects and I think that is a huge mistake, because a great UX will allow the humans in the loop to “tag” or fix the model mistakes quickly and with a considerably less margin of error. I can’t write about UX applied for AI without recommending the People + AI Research from Google. Besides the great blog posts with different use cases they have an amazing guidebook to start with if interested.

Self supervised learning

Can we find a way to label data using the already encoded information inside them? It’s not magic, it’s self supervised learning.

Let’s use an example related with text. Initially if we’d want to use Deep Learning to classify movie reviews as positive or negative we’d have to first get large enough dataset of labeled samples, but by the use of Self Supervised Learning we can find a way to decrease the size of this dataset in order to train a proper predictive model.

So how does this work? It works by using a huge and general text dataset, open and free to use if possible, and it doesn’t have to be labeled in a particular way. Where can we find such a dataset?

Yes, yes, Wikipedia! We can use this giant source of text data to train a model and make it learn several bits of wisdom about language, like predicting the next word in a sentence. For example, when you read the following sentence “Here I am in the computer writing an [MASK]” you can infer that the following “masked” or hidden word could be one of the following: article, recipe, billboard…

In order to train a model to learn that you don’t need labeled sentences because you are simply using what’s already encoded in text: isn’t this fascinating? This task of “hiding” words, making the model to guess them helps in the creation of a commonly called “language model”. Once created this language model can be used in the Transfer Learning stage to learn the first layers of the Deep Learning model and then training it on specific data, like the movies reviews I mentioned before. So a big Wikipedia dataset would be kind of an Imagenet for text, but without all the human labeling process.

Multi stage transfer learning

Sometimes you have few labeled samples but a simple search on your favorite search engine shows you that there are a few available datasets pretty related to the data we have and the problem we’ve been trying to solve. For example, if I need to build a model in order to identify several argentinian cars from images it’s highly likely that I won’t find such a dataset online. But it’s pretty easy to find labeled cars datasets of other more popular models.

The idea of Multi Stage Transfer Learning is to create intermediate instances between the first layers created by using an Imagenet like dataset and the last layers created based on your specific dataset. Those intermediate instances would be trained using “intermediate” datasets, not something so general like Imagenet, something more related to the problem you’re trying to solve. Following the argentinian cars example, this one could be a useful dataset for this task.

If you manage to find a way to use this you’re golden, you suddenly have a lot of different datasets that you could apply to your specific problem and see how it works without costing you too much. Personally I’ve never tried this on practice (hope to do it in the following weeks and share here the results) but I think it’s a really promising alternative to the supervised learning way. I’ve been reading some papers like this one and this one that are showing nice improvements using this technique.

Synthetic Data

I’ve been reading about Synthetic Data generation for a while and it seems to be a promising sub field. It’s about computer generated images to mimic real word ones and combine them in the training set with the goal of improving the model performance. For example if I’d like to build a dataset of chairs, I could use some software to generate synthetic images of different types of chairs.

One of the problems with this technique is the difficulty to generate real world like images (pixel level)¬† with software, so it’s something that you could use on a subset of problems, it’s not (actually) something that can be used for everything. Another problem is having access to a good software packages, but I guess this is something that will become easier in the near future.

Some domains exploring this topic are medicine, interior design and self-driving cars.

As I said earlier, this is not a complete list, only the approaches that I’ve been working on or on the verge of doing something with them. I hope you find something useful on this and if you have any doubts please let me know, happy to help.

Do you know about another useful approaches not mentioned in this article? Please tell me. I’m really interested on this topic and it’d be really nice to know about them.

Leave a Reply

Your email address will not be published. Required fields are marked *