Every tech company wants its image generator to be the best. But they all produce oddly similar work
ART | CAROLINE MIMBS NYCE | In mid-August, X launched an AI-image generator, allowing paying subscribers of Elon Musk’s social platform to make their own art. So—naturally—some users appear to have immediately made images of Donald Trump flying a plane toward the World Trade Center; Mickey Mouse wielding an assault rifle, and another of him enjoying a cigarette and some beer on the beach; and so on. Some of the images that people have created using the tool are deeply unsettling; others are just strange, or even kind of funny. They depict wildly different scenarios and characters. But somehow they all kind of look alike, bearing unmistakable hallmarks of AI art that have cropped up in recent years thanks to products such as Midjourney and DALL-E.
Two years into the generative-AI boom, these programs’ creations seem more technically advanced—the Trump image looks better than, say, a similarly distasteful one of SpongeBob SquarePants that Microsoft’s Bing Image Creator generated last October—but they are stuck with a distinct aesthetic. The colors are bright and saturated, the people are beautiful, and the lighting is dramatic. Much of the imagery appears blurred or airbrushed, carefully smoothed like frosting on a wedding cake. At times, the visuals look exaggerated. (And yes, there are frequently errors, such as extra fingers.) A user can get around this algorithmic monotony by using more specific prompts—for example, by typing a picture of a dog riding a horse in the style of Andy Warhol rather than just a picture of a dog riding a horse. But when a person fails to specify, these tools seem to default to an odd blend of cartoon and dreamscape.
These programs are becoming more common. Google just announced a new AI-image-making app called Pixel Studio that will allow people to make such art on their Pixel phone. The app will come preinstalled on all of the company’s latest devices. Apple will launch Image Playground as part of its Apple Intelligence suite of AI tools later this year. OpenAI now allows ChatGPT users to generate two free images a day from DALL-E 3, its newest text-to-image model. (Previously, a user needed a paid premium plan to access the tool.) And so I wanted to understand: Why does so much AI art look the same?
The AI companies themselves aren’t particularly forthcoming. X sent back a form email in response to a request for comment about its new product and the images its users are creating. Four firms behind popular image generators—OpenAI, Google, Stability AI, and Midjourney—either did not respond or did not provide comment. A Microsoft spokesperson directed me toward some of its prompting guides and referred any technical questions to OpenAI, because Microsoft uses a version of DALL-E in products such as Bing Image Creator.
So I turned to outside experts, who gave me four possible explanations. The first focuses on the data that models are trained on. Text-to-image generators rely on extensive libraries of photos paired with text descriptions, which they then use to create their own original imagery. The tools may inadvertently pick up on any biases in their data sets—whether that’s racial or gender bias, or something as simple as bright colors and good lighting. The internet is filled with decades of filtered and artificially brightened photos, as well as a ton of ethereal illustrations. “We see a lot of fantasy-style art and stock photography, which then trickles into the models themselves,” Ziv Epstein, a scientist at the Stanford Institute for Human-Centered AI, told me. There are also only so many good data sets available for people to use to build image models, Phillip Isola, a professor at the MIT Computer Science & Artificial Intelligence Laboratory, told me, meaning the models might overlap in what they’re trained on. (One popular one, CelebA, features 200,000 labeled photos of celebrities. Another, LAION 5B, is an open-source option featuring 5.8 billion pairs of photos and text.)
The second explanation has to do with the technology itself. Most modern models use a technique called diffusion: During training, models are taught to add “noise” to existing images, which are paired with text descriptions. “Think of it as TV static,” Apolinário Passos, a machine-learning art engineer at Hugging Face, a company that makes its own open-source models, told me. The model then is trained to remove this noise, over and over, for tens of thousands, if not millions, of images. The process repeats itself, and the model learns how to de-noise an image. Eventually, it’s able to take this static and create an original image from it. All it needs is a text prompt.
Many companies use this technique. “These models are, I think, all technically quite alike,” Isola said, noting that recent tools are based on the transformer model. Perhaps this technology is biased toward a specific look. Take an example from the not-so-distant past: Five years ago, he explained, image generators tended to create really blurry outputs. Researchers realised that it was the result of a mathematical fluke; the models were essentially averaging all the images they were trained on. Averaging, it turns out, “looks like blur.” It’s possible that, today, something similarly technical is happening with this generation of image models that leads them to plop out the same kind of dramatic, highly stylized imagery—but researchers haven’t quite figured it out yet. Additionally, “most models have an ‘aesthetic’ filter on both the input and output that reject images that don’t meet a certain aesthetic criteria,” Hany Farid, a professor at the UC Berkeley School of Information, told me over email. “This type of filtering on the input and output is almost certainly a big part of why AI-generated images all have a certain ethereal quality.”
The third theory revolves around the humans who use these tools. Some of these sophisticated models incorporate human feedback; they learn as they go. This could be by taking in a signal, such as which photos are downloaded. Others, Isola explained, have trainers manually rate which photos they like and which ones they don’t. Perhaps this feedback is making its way into the model. If people are downloading art that tends to have really dramatic sunsets and absurdly beautiful oceanscapes, then the tools might be learning that that’s what humans want, and then giving them more of that. Alexandru Costin, a vice president of generative AI at Adobe, and Zeke Koch, a vice president of product management for Adobe Firefly (the company’s AI-image tool) told me in an email that user feedback can indeed be a factor for some AI models—a process called “reinforcement learning from human feedback,” or RLHF. They also pointed to training data as well as assessments performed by human evaluators as influencing factors. “Art generated by AI models sometimes have a distinct look (especially when created using simple prompts),” they said in a statement. “That’s generally caused by a combination of the images used to train the image output and the tastes of those who train or evaluate the images.”
The fourth theory has to do with the creators of these tools. Although representatives for Adobe told me that their company does not do anything to encourage a specific aesthetic, it is possible that other AI makers have picked up on human preference and coded that in—essentially putting their thumb on the scale, telling the models to make more dreamy beach scenes and fairylike women. This could be intentional: If such imagery has a market, maybe companies would begin to converge around it. Or it could be unintentional; companies do lots of manual work in their models to combat bias, for example, and various tweaks favouring one kind of imagery over another could inadvertently result in a particular look.
More than one of these explanations could be true. In fact, that’s probably what’s happening: Experts told me that, most likely, the style we see is caused by multiple factors at once. Ironically, all of these explanations suggest that the uncanny scenes we associate with AI-generated imagery are actually a reflection of our own human preferences, taken to an extreme. No surprise, then, that Facebook is filled with AI-generated slop imagery that earns creators money, that Etsy recently asked users to label products made with AI following a surge of junk listings, and that the arts-and-craft store Michaels recently got caught selling a canvas featuring an image that was partially generated by AI (the company pulled the product, calling this an “unacceptable error.”).
AI imagery is poised to seep even further into everyday life. For now, such art is usually visually distinct enough that people can tell it was made by a machine. But that may change. The technology could get better. Passos told me he sees “an attempt to diverge from” the current aesthetic “on newer models.” Indeed, someday computer-generated art may shed its weird, cartoonish look, and start to slip past us unnoticed. Perhaps then we’ll miss the corny style that was once a dead giveaway.
****
Source: Internet