🌬️ 🕺🏾 🎓 When your data is dirty 👇🏾 🙍 📟

The story below is a great illustration of how AI gets the wrong idea about the problem we are asking to solve:

Researchers at the University of Tübingen trained a neural network to recognize images , and then asked to indicate which parts of the images were most important for making a decision. When they asked the neural network to highlight the most important pixels for the tench (fish species) category, this is what it highlighted:

Pink human fingers on a green background.

Human fingers on a green background!

Why was she looking for fingers in the photographs when she had to look for fish? It turned out that most of the tench images in the dataset were images of people holding fish as a trophy. She has no context for what the tench really is, so she assumes that the fingers are part of the fish.

The neural network generating images in ArtBreeder ( BigGAN ) was trained on the same ImageNet dataset, and when you ask it to generate a line, it does the same:

Four images are white people holding something green and speckled. In some of the images, the green thing has a more fishy texture, but nowhere is there a clear head and tail. It's just a big fish body. The lower fins are intricately blended with many pink human fingers

Humans are much more distinct than fish, and I am fascinated by the highly exaggerated human fingers.

There are other categories on ImageNet with similar problems. Here's a microphone.

Four images with a very dark background. The top left is similar in shape to a microphone with a fluffy sound baffle or a head made of gray human hair. Others look like humans The

neural network has recognized the contrasting lighting of the scene and the human form, but many of the images do not contain anything that remotely resembles a microphone. In many of the training kit photos, the microphone is a tiny part of the image that can be easily overlooked. Similar problems arise with small instruments such as the "flute" and "oboe".

In other cases, there is evidence that photographs are mislabeled. In these generated images of the "football helmet", some clearly depict people not wearing helmets, and some look suspiciously like baseball helmets.

Four generated images. The top two are people, neither of whom wears a football helmet (although their hair may be a little weird; hard to tell since the rest are so weird too). At the bottom left, a man wears a helmet that looks like a metal baseball. Bottom right ... bottom right - a football helmet crossed with a toothy cartoon fish

ImageNet is a really messy dataset. He has a category for an agama, but not for a giraffe. Instead of a horse as a category, there is sorrel (a specific color of a horse). Bicycle for two is a category, but skateboard is not.

Four images that are clearly some kind of multi-wheeled bicycle objects. Wheels tend to be flexible with oddly split spokes, and sometimes the wheels get loose. There are people who look like riders, but it is difficult to separate them from bicycles. The

main reason for ImageNet pollution is that the database is automatically collected on the Internet. The images were supposed to be filtered by the crowdsourced workers who tagged them, but many oddities have leaked out. And awfully largethe number of images and tags that definitely shouldn't have appeared in the general research dataset, and images that look like they got there without the consent of the people depicted. After years of widespread use by the AI community, the ImageNet team reportedly removed some of this content. Other problematic datasets, such as those collected from online images without permission or from surveillance footage, have also recently been removed (others like Clearview AI are still in use ).

Vinay Prabhu and Ababa Birhane pointed out serious problems with another dataset this week , 80 million Tiny Images... The system cropped out the images and automatically tagged them using another neural network trained on internet text. You may be shocked, but there are some pretty offensive things in the Internet text. MIT CSAIL deleted this dataset permanently, choosing not to manually filter all 80 million images.

This is not only a problem with bad data , but with a system in which large research groups can release datasets with huge problems such as offensive language and lack of consent to take photos. As technology ethicist Shannon Vallor put it , "For any institution that does machine learning today, 'we didn't know' is not an excuse, but an acknowledgment." Likethe algorithm that upscaled Obama into a white man , ImageNet is a product of the machine learning community where there is a huge lack of diversity (have you noticed that most of the generated people on this blog are white? part of Western culture considers white as the default color).

It takes a lot of work to create the best dataset - and better understand which datasets should never be created. But this work is worth doing.

See also:

« The Data Science Skill: CAE signs on the Python »

" Designing Critical Algorithms: Implementation "

" Checklist for a Machine Learning Project"

When your data is dirty

More articles: