Why does generative AI get hands and legs so wrong?

In short: AI image generators have so much trouble with hands, because hands are not something that follows a fixed pattern. Hands can have thousands of different positions, and can also be holding or grabbing infinite different objects. Add into the mix that a hand can rotate into a variety of angles that make it seem like it has less or more fingers, and you have a case that’s nearly impossible for an AI image generator to get right most of the time.

wonderfully messed up hands, generate images with AI — AI has created a new sport: Judo for the multi-armed

By now, we’ve all heard about, or even played with artificial Intelligence (AI) image generation. Anything from typing prompts into Bing Image Generator, to doing unbelievable fills and corrections in Photoshop with AI. Fills and corrections that previously would have taken hours to do, turns out you can now pretty much click and drag in PS, and you’re done.

But, there’s one thing that’s been joked about when it comes to AI image generation since the very beginning. And despite all the work and mass adoption of these image generators, it’s not completely right yet. Yep, you probably guessed it: human and animal appendages, and more specifically, human hands.

AI image generators can now create photorealistic representations of a wide array of things. Lifelike depictions of everyday objects, or the wildest and most surreal imaginings, AI consistently delivers. So, why do they suffer so much when it comes to hands, arms and legs? Why is it that AI gives people extra fingers, or turns fingers into some incomprehensible mass of whatever? Why do some AI images come out with extra legs, arms, and look more like something out of a Frankenstein novel?

why does ai mess up hands and fingers — You might wanna count the number of fingers here

It shouldn’t be so hard, it seems. A 5 year old knows how to draw a correct hand, and attach it at the end of an arm. Even a dumb old video game console can do it: you don’t see extra fingers and arms popping up in Call of Duty or Elder Scrolls… unless there’s some interesting mod at play. I can spend $500 on a video game console, and it can depict a human better than the most advanced Artificial Intelligence running on thousands of servers worth millions of dollars?

Someone please explain that.

Why AI messes up hands: it’s all about bad imitation

Have you ever been told not to go around copying and pasting solutions from the internet? Or maybe you’ve gotten that “yeah you know the stuff, but you have no experience applying it” stuff at work. “You’ve gotta think, you can’t just apply the same recipe all over the place”

too many teeth image generated by AI DALL-E — Good thing they’re straight, imagine the orthodontics bill on that mouth with all those extra incisors.

Well, that’s sort of what’s happening here. The problem with AI is bad imitation, and very limited reasoning. When you generate an image with AI, the AI looks through it’s catalog of recipes for something that might work, applies it, evaluates it and if it passes some basic tests, out it goes into the world.

AI generators like DALL-E, Stablediffusion or Midjourney initially received hundreds of thousands of images from the real world, with basic descriptions. The images got analyzed and, via a graphics algorithm a bit too long to describe in this blog post, a bunch of “recipes” for how to represent a whole lot of objects were created.

So when you ask DALL-E (Bing) to draw a basketball, it searches its database for what recipes are associated with “basketball”, throws out a bunch of random splatters of color and noise, and starts refining. It moves pixes here, moves pixels there, and then compares the result to its recipe. If it doesn’t fit, it refines and tries again. Finally after so many cycles of refining, it gets to a point where the “image” is deemed acceptable, and gets sent out.

a basketball player with 4 hands — AI doesn’t “get” how to put things together, which is why a basketball player can have arms growing out of his knees

See the problem yet? The problem is that nowhere in that process does DALL-E reason what a basketball is, how heavy it is, whether it bounces, what it’s made of, etc. It just draws what its database says is a “basketball”.

That might not be a problem if you’re drawing a basketball, or a nature scene, or a car. But when you’re dealing with something highly functional and very complicated, like a hand, arm or leg, it becomes a problem.

Hands are complicated things

The human hand has 27 bones, 27 joints, 34 muscles, and over 100 ligaments. This makes for an astounding range of movements. Think about just your fingers: each finger can extend, retract, angle, open, close, and assume various positions. Some fingers are tied to each other: move your ring finger, the pinky moves slightly. There’s also limits: you can’t, for instance, bend your fingers back past a certain point, nor can you open them past a specific angle.

Now add into the mix everything your hands can hold on to, touch or grab in the real world: you get an infinite amount of objects and possible positions for each one of those 27 joints in your hand.

Sure, drawing something that looks like a hand is not that hard. But drawing something that looks like a working, real-life hand is pretty hard stuff. Artists will study the hands for months or even years in order to get a basic grasp on how to draw them correctly.

sketch of a hand — About right? Yes, because this one’s by Alexandru Petre and not by some AI generator

And with due reason: if something is even a little off, people are bound to notice. The hands are one of those things that just looks wrong if a finger is in the wrong position, if the length is slightly off, or if the angle is not spot on.

With AI it’s worse. Because if you remember, AI didn’t “study” the hand to learn how it works, what’s connected to what, how everything bends, etc. It just knows that a hand is a series of color patterns, edges, and shading.

weird hands ai generation dall-e midjourney stablediffusion — Wanna really mess up an AI? Ask for hands on a repetitive background, like piano keys

And here’s the kicker: most of the AI’s training is on images where the hands are secondary, and are holding something. That makes it even worse: how does the AI know that a blue line coming from a hand is a pen, and not a blue finger? Or how can an AI tell that the fingers wrapped around an umbrella aren’t short, they’re just curled and not visible? Or for that matter, how does the AI know what part of the image is the hand, and what part is the umbrella?

It can’t. That image with very short fingers is registered as a recipe for “hand”. And so is the image with the pen. The AI learns that under certain conditions, it’s ok to draw a hand with a blue, skinny finger or with short, stubby fingers. And if there’s a picture of a hand from the side, where the fingers are all aligned, the AI might register that there’s a recipe for “hand” where the hand is a big C-shaped blob of skin.

All that incorrect information and visual tricks will eventually get spit out by the AI in its resulting images. Especially when you start asking for stuff that doesn’t have an associated image in the database, and the image generator has to start creating stuff on its own.

Perspective and optical illusions

If you want to finish messing up hands in AI image generation, take into account that there’s also the issue of perspective: human hands can do, at the very least, a 180 at the wrist.

That means the hand, along with all its fingers in infinite positions, can be pretty much anywhere in terms of rotation at the wrist. And that’s a huge problem because it generates a bunch of angles where all fingers aren’t visible. The AI has lots of images in its database which show 5 fingers, but it also has a lot of images where the hand seems to have 4,3, or 2 fingers.

common errors in drawing human hands — Common AI mistakes in hands: notice how the more the hands interact, the bigger the errors. Deformed fingers are also common, the AI is probably imitating an object in the hand as if it were a finger.

“Seems” is clear to us humans, because we know that our hands usually have 5 fingers. But an AI doesn’t understand the concept of optical illusion unfortunately. And it simply registers that hands can be drawn with 2 to 5 fingers. Sure, the AI knows that 2 to 4 fingers are much less frequently drawn than 5, but to the AI, it’s a valid way of solving certain requests the user might make.

a female model generated by ai, with incorrect fingers

And on top of all that, put one hand over the other, or in front of the other. A very common pose in jewelry fashion shots. What do you see? Two hands, one in front of the other. What does the AI see? A distorted hand, with around 10 fingers. Or it might register fingers with more fingers coming out of them, or a hand with fingers coming out the back. All that is possible when you have 5.000 different recipes for hands, and you combine them with no real reasoning of how they’re put together.

So, why doesn’t my game console get hands and arms wrong?

Games and computer animation in general work on a completely different principle from generative AI. They start out with a model, which is a mathematical 3D representation of an object, say for example a hand. The model usually contains each moving part of the hand, and rules for how and when it should move. It also contains rules for wrapping that hand around objects.

A game doesn’t have to guess how big fingers should be, or how far they should bend: the programmers have already given it that information. From there on, the graphics engine takes some pre-built textures and shading rules, and applies them to each frame in the game sequence.

You probably couldn’t do something like that in an AI image generator, since it would be way too complex and time consuming to build 3D models for everything and then get a 2D image from them. But, who knows, maybe in the future we could see something of the sort at play.

In 3D animation, stuff is already defined and the software just shades and renders. Like in this wireframe by shugs81

But what you could do (and I imagine is being currently done) is create a series of dynamic 3D hand models, and simply feeding the resulting images into an AI to improve training. With a 3D model, you can generate hundreds of thousands of poses in differing conditions of light and environment, without the need to go hunting for images all over the internet.

Which explains the problem AI has with hands and legs

The comically wrong representations we tend to see in AI generated art, having to do with arms, legs, hands and other appendages, probably have to do with the training of our AI generators, and how complicated these objects really are in real life.

horses with extra legs from generative AI — Look closely and you might find an extra leg

AI databases are way too large for humans to go through and understand. And your average AI image generation database is bound to contain a whole lot of incorrect information. Eventually, you’re bound to experience that age old information science principle, called “garbage in, garbage out”. And you’ll get a hand with insanely long fingers, or an extra finger put into the mix, or maybe you’ll get a horse with 3 or 5 legs.

Will they get better with time? Yeah, they probably will. Especially on the more commercial AIs like Midjourney or DALL-E, the hand and appendage problem is furiously being worked on. Because you can’t have a product going around that makes you the butt of jokes on the internet if you want to be taken seriously.

In fact in the last 6 to 8 months, DALL-E has made some very impressive advancements in hands and fingers. It still messes up, but not as immensely as before. Same with other generators like Midjourney or Stablediffusion. So, maybe in a year, hands and appendages will be perfect, and we’ll all look back and laugh at this chapter in generative AI models.