What Is Stable Diffusion


What Is Stable Diffusion and How Does It Work?

These images were created using stable diffusion
All of these images were created using the Popular AI Generative Image Tool Stable Diffusion

You might have heard the word Stable Diffusion being thrown around the Artificial Intelligence space but do you know what it is and how it works? We delve deep in this article to explain how this AI tool creates amazing AI images just off of a basic prompt.

Lets face it only the past few years have we been made aware of AI Image generators. They have been littered all over the internet promising to create stunning AI images in a matter of seconds. This is known as AI Generative Art.

Non of these AI tools are more popular and heavily talked about than Stable Diffusion, stable diffusion is a text-to-image model that uses deep learning, it was published in 2022. Since then the whole industry has boomed we are even seeing forks of stable diffusion like Midjourney, this type of AI tool focuses more on photo realism.

Stable Diffusion has made it possible to create Images based off of textual descriptions. Basically you write a text prompt and the AI tool will try to create it into an image for you. You might be asking how is that possible?

How Does Stable Diffusion Work

Well stable diffusion is a version of a latent diffusion model. These types of models use what are known as latent spaces. These latent spaces contain low-dimensional data. This diffusion model uses methods of adding and removing noise from the image to generate something based off of the user prompt.

In Simple terms imagine your tv. Do you remember when you used to see static on the screen? every now and then you might get a picture. Well this is like a diffusion model it slowly removes the static and tunes into the right image. Except stable diffusion is using millions of images to create the image you see.

So What are Latent spaces?

In machine learning, a latent space refers to a learned representation of data in a lower-dimensional space. It is also known as an embedding space or feature space. Latent spaces are commonly used in various fields, including computer vision, natural language processing, and generative modeling.

The concept of a latent space arises from the idea that high-dimensional data often exhibit complex and redundant structures, making it challenging to analyze or model directly. By transforming the data into a lower-dimensional latent space, we can capture essential features or patterns while discarding irrelevant or noisy information.

Latent spaces are typically learned through unsupervised or self-supervised learning methods. These methods aim to find a compressed representation of the data that preserves important characteristics. For example, in image processing tasks, a convolutional neural network (CNN) can be trained to encode images into a lower-dimensional latent space while retaining relevant visual information.

Once a latent space is learned, it can be used for various purposes. One common application is dimensionality reduction, where the high-dimensional data is projected onto the learned latent space to simplify analysis or visualization. Latent spaces can also be used for generative modeling, allowing us to generate new samples that resemble the training data. For instance, in a generative adversarial network (GAN), a latent vector is fed into the generator network to produce synthetic samples.

In summary, a latent space is a lower-dimensional representation of data that captures important features or patterns. It enables dimensionality reduction, data analysis, and generative modelling tasks in machine learning.

Example of how latent spaces work. Image to latent space to reconstructed image
Here is a more visual way of how Laten Spaces are used – (Source: Hackernoon)
Layer is used when creating an image
Here you can see an example of layering. (Source: Pytorch)

Now that you have some understanding about latent space and how these models gather there in formation it’s time to move onto the actually diffusion model. The AI tool that creates the image we see.

What are Diffusion models?

Imagine you want to create a computer program that can generate realistic images, like pictures of cats. Diffusion models are a type of computer program that can help with this.

To understand diffusion models, think of them as a step-by-step process of making a picture better and better. At the beginning, you start with a random, messy image that doesn’t look like a cat at all. This messy image represents the “noise” in the model.

Then, you go through a series of steps called “diffusion steps.” In each step, you make the image a little bit clearer and more cat-like. It’s like you’re gradually removing the noise and bringing out the cat features.

Example of a diffusion model being used
Example of a diffusion model (source: Nvidia)

You repeat these diffusion steps multiple times, with each step refining the image further. By the end of all the steps, the image looks much more like a realistic cat picture. This final image represents the generated output of the model.

The cool thing about diffusion models is that they learn from real cat pictures to understand how to make the generated images look realistic. They use a special technique called “inverse transformations” to figure out how to change the noisy image into a clearer, more cat-like image.

Once the diffusion model is trained, you can use it to generate new cat pictures. All you have to do is start with a random, noisy image and apply the inverse transformations in reverse order. This will gradually refine the image until it looks like a cat.

Diffusion models are exciting because they can generate high-quality images that capture the complexity and details of real cat pictures.

Why did we use cats as a reference model? well we only recent used a Custom LoRA to create the below image, so you could say it was stuck in our mind. Sorry if you don’t like cats you can also create images of dogs too!

Cat images created with stable diffusion using custom loRA and model files
Sure they are not “realistic” but they have a particualr artists style.

So Why The Use of A Text Prompt?

In text-to-image generation, a text prompt serves as a description or a concept for the image that you want the model to generate. For example, you might provide a prompt like “a beach sunset with palm trees and waves crashing.” The text prompt acts as a guide for the model to understand what kind of image you want it to create.

The model processes the text prompt and generates an image that corresponds to the description. It uses its understanding of visual features, textures, and objects learned during training to create a realistic image representation of the given text.

To generate the image, the model may rely on different techniques, such as combining pre-existing image elements, learning from a dataset of paired text and image examples, or utilizing a multi-step process involving text embeddings and image synthesis.

The generated image is an interpretation of the text prompt based on the model’s training and its understanding of visual patterns. However, it’s important to note that the output image might not always match your exact expectations, as it depends on the capabilities and limitations of the specific model being used.

If you are struggling to create the perfect image with stable diffusion then why not try out our Stable Diffusion Cheat Sheet it’s prefect for beginners.

Issues With Stable Diffusion

Stable diffusion is great, it works well and creates some really stunning images. Heck you can even create realistic people if you know the correct prompt to use. Now one of it’s main issues is diffusion models are time consuming to use.

When creating images we stable diffusion we have a wide range of options to choose from. We can select the diemension size and once our image is created we have the option to make it high res too. We could create an image that is 512×512 or 1024×1024 or even higher if your GPU can handle it.

If you were to run stable diffusion locally on either windows or a mac, you are heavily reliant on your GPU. It is recommended that you have at least 8GB of Vram and an NVIDIA graphics card has the best chances at creating images without any hiccups. Now Graphics cards are quite expensive and so our modern gaming systems. So all of this cost has to be taken into account before you start using stable diffusion.

However you do have options you can use stable diffusion DreamStudio which is fairly cheap, we did a break down of their prices here. But even then you are paying per image generation, so you are basically paying to use someone else’s GPU’s.

The Possibilities of Using Stable Diffusion

When it comes to using stable diffusion the possibilities are endless. You are only limited by your own imagination and of course by your system specification. 😅There are Thousands of AI models for you to choose from online that will aid you in creating some stunning amazing AI art. If that’s not good enough there are LoRA files which help you fine tune your images.

So if you have something in mind, chances are someone has already created a file for you to use. Or why not learn how to create your very own models? So now you know what stable diffusion is and how it works!

Do you use stable diffusion? would you like to? why not leave your comments below.

Leave a Reply

Your email address will not be published. Required fields are marked *