Nowadays there is a growing use of Deep Learning (DL) to build systems for machine decision making. DL is a powerful technology but it currently has a big limitation, you need a lot of (most times) manually labeled data in order to train the AI model.
That’s why most of the time we want to work with DL we end up using giant open datasets (at least for the initial bootstrapping phase), mostly published by a big company or by an academic institution. What’s the problem with this? If the data we use is biased in some way, our new model will probably (not in all cases) be biased too.
So one big dataset is published in the open and it suddenly is used by a lot of people around the world, we need to take care of this problem because it obviously can cause a lot of harm in our daily life, as this machine decision systems are deployed for several uses.
This is an example on bias for word embeddings.