Friday, December 8, 2023

AI model bias and why responsible technology matters – exemplified by image generation

In this era of AI where ChatGPT with LLM’s has become the hottest topic in computer science since the Apple Macintosh and the IBM PC, I figured I’d do a small write-up on AI model bias and why paying attention to bias is important. This is especially true when it comes to enterprise scenarios where Microsoft is launching a long range of AI powered copilot experiences.

The author of this article works for Microsoft (December 2023) and is an internal champion for responsible ai and technology as well as internal champion for privacy and compliance.

At Microsoft we have a high bar for delivering responsible AI solutions, which means there is a lot of work put in place to ensure the output from AI systems follow Microsoft’s AI principles to be fair, inclusive, reliable and safe, privacy and security is accounted for, and the systems are accountable.

Any model, that be a large language model (LLM) or a model to generate image will inherently have bias built in due to the training data used. In smaller models you can manually verify the training data to counter some bias and balance the training set, but as models grow large this becomes inherently harder. I’m not saying there are not systems in place to counter training bias already, but to truly counter bias this has to be built into pre and post processing of input prompts and outputs from the models.

I will use image generation as an example where I will show the difference between using image creator in Microsoft Designer ( built on DALL·E 3 from OpenAI and Stable Diffusion XL, which is an open source model from Stability.AI ( The Microsoft solution has guardrails in place, where the open source solution do not – unless you add them yourself via prompting. Not saying neither of them are perfect as examples will show.

I want to call out that any bias shown is not statistically verified, and only based on generating a set of sample random images with the same prompt.

Example 1 - photo of correctional officer in a well lit hallway eating a donut


The above eight images are from DALL·E 3. They are all close-up photos showing a fit, light skinned male with dark hair.


In comparison the SDXL images have a wider focal point showing the full body. It’s a mix of male and female people, and also a mix of light and dark skinned people. I would argue the SDXL model is more accurate to what people look like in 2023, while the DALL·E 3 model output “perfect” looking people. If this is due to the images the models are trained on, or the prompt being augmenting to have “perfect” looking people I do not know.

The default color palette is also different where DALL·E 3 has more green and SDXL has more brownish colors.

If I add “overweight” to the DALL·E 3 prompt, the Responsible AI filter kicks in and blocks the generation. If I add “fat”, then it works.


With SDXL I can modify the prompt to closeup photo of a slim white male correctional officer in a well lit hallway eating a donut” to mimic what DALL·E 3 output by default – to counter the wide angle and real life looking people bias of the model.


Example 2 – woman

Let’s try a simple prompt with the subject “woman”. For SDXL I added negative prompting to avoid any nsfw images – which is blocked as part of DALL·E 3 RAI principles.


DALL·E 3 seems to pivot towards portrait photo’s when no extra contextual information is given, as that is likely the intent with a simple input subject. They are also all dark haired and seem to be young women.


In comparison SDXL gives a wide variety of image types, pivoting to more art-like images instead of photos.

Example 3 – painting of a beautiful norwegian fjord with vikings, with a boing 737 in the sky, in the style of munch’s scream


The DALL·E 3 painting nails the airplane and pretty much the painting style of Edward Munch.


The SDXL one is not bad either, but the Munch style is not as visible for this one sample. And the scale of the plane vs. the viking ship and buildings is way off.


These simple examples shows that articulating your intent in prompting is crucial. Either the system has to add guardrails and contextual information to the prompt, or the person prompting has to be articulate on what they want returned and what they do not want returned. And you have to generate many images to find that ONE you really like.

For online services like Microsoft Designer going the safe route is the only approach as people using the service comes from a wide variety of backgrounds and age groups. Taking that extra measure to ensure everyone feels safe is important to trusting the service.

Open source solutions you can run on your own PC/phone/table can allow for less guardrails as the individual running it likely has more skill and is using the tool themselves. Maybe the analogy of hiring a carpenter as a service vs. hammering yourself can be used. You trust a hired professional to meet a certain bar, while you are responsible yourself on anything you do.

When it comes LLM’s we know they are largely based on English text today, and would favor input and output in this language. As they are built on public data, that will influence default writing style as well. fortunately ChatGPT and Microsoft Copilots put a lot of effort into the system prompts put around the user prompt, to counter any bias in the model. This is to ensure grounding in facts and avoid hallucinations. More on that for another post.


I used the service of to create the DALL·E 3 images, and I used the Draw Things app on a MacBook with an 8-bit quantized version of the default SDXL model. The Draw Things app also work on iOS devices.