Analysis Most neural networks trained for image recognition tend to identify objects based on their texture rather than shape, unlike humans, according to this latest research.
Artificially intelligent software may suck at most tasks like reading or writing, but they can be pretty good at recognizing things in images. The latest explosion of excitement around neural-network-based computer vision was sparked in 2012 when the ImageNet Large Scale Visual Recognition Challenge, a competition pitting various image recognition systems against each other, was won by a convolutional neural network (CNN) dubbed AlexNet.
After this, tons of new image-scrutinizing CNN architectures came flooding in, and by 2017 most of them had an accuracy of over 95 per cent in the competition. If you showed them a photo, they would be able to confidently figure out what object or creature is in the snap. Now, it’s easy for developers and companies to just use off-the-shelf models trained on the ImageNet dataset to solve whatever image recognition problem they have, whether it's figuring out which species of animals are in a picture, or identifying items of clothing in a shot.
However, CNNs are also easily fooled by adversarial inputs. Change a few pixels here or there in a photograph, and they’ll fail to recognize an object correctly. What was a banana now looks like a toaster to the AI just by tweaking some pixels. Heck, even a turtle can be mistaken for a gun. Why’s that? Could it be that machine-learning software focuses too much on texture, allowing changes in patterns in the image to hoodwink the classifier software?
A paper submitted to this year’s International Conference on Learning Representations (ICLR) may explain why. Researchers from the University of Tübingen in Germany found that CNNs trained on ImageNet identify objects by their texture rather than shape.
They devised a series of simple tests to study how humans and machines understand visual abstracts. In the computer corner, four CNN models: AlexNet, VGG-16, GoogLeNet, and ResNet-50. In the fleshbag corner, 97 people. Everyone, living or electronic, was asked to identify the objects shown in a series of images.
Crucially, the images were distorted in different ways to test each viewer's ability to truly comprehend what they were seeing: the pictures were presented as grayscale; or with the object as a black silhouette against a white background; or just the outline of the object; or a close-up of the texture of an object; or with a distorted texture applied to the shape of the object.
The results showed that almost all the images that retained the objects' shape and texture were recognized correctly by humans and the neural networks. But when the test involved changing or removing the texture of the objects, the machines fared much worse. The software couldn't work with the shape of stuff along.
“These experiments provide behavioral evidence in favor of the texture hypothesis: a cat with an elephant texture is an elephant to CNNs, and still a cat to humans,” the paper stated.
Humans can recognize objects by their overall shape, while machines consider smaller details, such texture. When asked to identify objects with an incorrect textures, like the cat with elephant skin example, the 97 human participants were accurate 95.9 per cent of the time on average, but the neural networks only scored between 17.2 per cent to 42.9 per cent.
“On a very fundamental level, our work highlights how far current CNNs are from learning the 'true' structure of the world,” said Robert Geirhos, coauthor of the paper and a PhD student at the university, explained to The Register.
“They learn the easiest associations possible, and in many cases this means associating small texture-like bits and pieces of an image with a class label, rather than learning how objects [are typically shaped]. And I think adversarial examples are clearly pointing to the same problem - current CNNs don't learn the 'true' structure of the world.”
The problem may lie in the dataset. ImageNet contains over 14 million images split across many categories, and yet it's not enough – not enough angles and other insights. Software trained from this information can't understand how stuff is actually formed and shaped. The algorithms can tell butterfly species from the patterns on the creatures' wings, but take away that detail, and the code has no idea what it's actually looking at. It's fake smart.
“These datasets may just be too simple: if they can be solved by detecting textures, why bother checking whether the shape matches, too?" said Geirhos.
"For humans, it is hard to imagine recognizing a car by detecting a specific tire pattern that only images from the 'car' category have, but for CNNs this might just be the easiest solution since the shape of an object is much bigger, and changes a lot depending on viewpoint, etc. Ultimately, we may need better datasets that don't allow for this kind of ‘cheating’."
Back to the adversarial question: do these findings of an over-reliance on texture confirm why slightly altered colors and patterns in pictures fool neural networks? That corrupting a section of banana peel makes the code think it's looking at the texture of a shiny metal toaster?
To investigate this, the researchers built Stylized-ImageNet, a new dataset based on ImageNet. They scrubbed the original textures in the images and swapped them with a random texture, and then retrained a ResNet-50 model. Interestingly, although the CNN was more robust to the changes, it still fell victim to adversarial examples. So, no. The answer to our question is no.
“Even a model trained on Stylized-ImageNet is still susceptible to adversarial examples, so unfortunately a shape bias is not a solution to adversarial examples," Geirhos explained.
"However, current state-of-the-art CNNs are very susceptible to random noise such as rain or snow in the real world, [which is] a problem for autonomous driving. The fact that the shape-based CNN that I trained turned out to be much more robust on nearly all tested sorts of noise seems like a promising result on the way to more robust models.”
The texture versus shape problem may not sound like such a big deal, but it could have far reaching consequences. Some systems pretrained on ImageNet might not perform so well in other domains, like facial recognition or medical imaging.
In fact, other research has shown it’s pretty easy to evade identification with a pair of glasses or even fake paper ones. ®