Artistic depiction of a celestial being with a detailed, cosmic landscape

Stable Diffusion: Captioning for Training Data Sets

I drew from various resources – from books and articles to courses and datasets – to convey my experiences, insights, and strategies. I experimented with different methods and techniques to caption images effectively using Stable Diffusion.

I discovered that for smaller projects, manual captioning is superior to automated captioning. However, manual captioning is a labor-intensive process, so it is not suitable for large-scale projects with numerous images.

AI is rapidly evolving, so this blog is not static; it is a reflection of how I learn and improve, constantly updated with the latest developments.

I primarily use LoRA training, but the information I share here can benefit you with other training methods as well. As a disclaimer, I am a learner and an explorer of fine-tuning Stable Diffusion models, so as I learn, I use my blog as a chronicle of all my findings and this guide will be updated accordingly.

[Reference Guide]

Stable Diffusion: Captioning for Training Data Sets

When preparing images for machine learning, the resolution and aspect ratio are important considerations:

  • High Resolution:

    Aim for the highest resolution possible within a 512×512 pixel dimension. High resolution ensures better image clarity and detail, which is crucial for accurate image analysis and processing.

  • Aspect Ratio Flexibility:

    The aspect ratio, which is the ratio of the width to the height of an image, should not be a primary concern. Various aspect ratios can be accommodated using a technique called bucketing. This method groups images of similar aspect ratios together, allowing for efficient processing of different shaped images without distorting their original proportions.

  • Varied Image Formats:

    Acceptable image formats can range from square to landscape orientations. The flexibility in formats allows for a diverse range of images to be processed and analyzed, catering to different requirements and use cases.

  • High-Quality Input for High-Quality Output:

    The foundational principle of dataset preparation is that the quality of your input directly influences the quality of your output. In essence, if you use high-quality data for training, the model is more likely to produce high-quality results. This means selecting images or data that are clear, accurate, and representative of the subjects or styles you want the model to learn.

  • The Importance of Quantity and Variety:

    While quality is paramount, the quantity and variety of the data also play an important role. A larger and more diverse dataset enables the model to learn from a broader range of examples. This diversity can include variations in lighting, angles, backgrounds, and other contextual elements for images. The more varied your dataset, the more robust and flexible the trained model will be in handling different scenarios or tasks.

  • Quality Over Quantity:

    When faced with the choice between a large number of lower-quality images and fewer high-quality ones, opt for quality. A dataset with fewer but higher-quality images is generally more beneficial than a larger set of poorer quality. Quality images provide clearer and more accurate information for the model to learn from, leading to better performance and reliability.

  • Upscaling as a Last Resort:

    If your dataset includes images that are of lower resolution or quality, upscaling can be considered, but it should be a last resort. Upscaling is the process of increasing the resolution of an image using algorithms. It can introduce artifacts or distortions, which might negatively impact the learning process.

    If upscaling is necessary, using advanced tools like LDSR (Low-Dimension Super-Resolution) via Automatic1111, which are designed to enhance image quality with minimal loss of detail, is recommended. However, it’s important to note that even the best upscaling methods can’t fully substitute for the inherent quality of high-resolution, original images. is an incredible upscaling tool to try out as well but it has a hefty price tag, so you use it as a last resort.

Based on a combination of personal experiments, insights from previous work with different datasets, and learnings from subject-matter papers and successful methodologies, here are some general notes and recommendations for captioning in image training:

  • Don’t rely on Automated Captioning Tools for Now:

    While automated captioning tools like BLIP and Deepbooru are innovative and show promise, they may not yet be fully reliable for high-quality captioning tasks. Their current limitations could affect the efficacy of your training process. I use them as a starting point and edit as needed.

  • Challenges with Automated Tools:

    These automated systems have been observed to produce mistakes and repetitive captions, which require significant effort to correct. Issues such as struggling with context understanding and assessing the relative importance of elements in an image are common.

  • Manual Captioning is More Efficient Currently:

    Given the time and effort needed to rectify the errors made by automated tools, it’s generally faster and more effective to manually create captions. This approach allows for greater accuracy and context-specific nuances that automated systems might miss or misinterpret.

  • Focus on Quality and Relevance:

    When manually captioning, prioritize the quality and relevance of the captions to the specific training purpose. Ensure that the captions accurately describe the significant elements of the images and are aligned with the specific learning objectives of the model.

  • Stay Updated on Tool Developments:

    While manual captioning is recommended for now, the landscape of AI and machine learning tools is rapidly evolving. Keep ahead of developments in automated captioning technologies, as they may become more sophisticated and useful in the future.

Although I did mention to avoid using automated captioning, but I find it useful as a starting point and then edit it. With ,BLIP captioning tool, I can add a Prefix to my caption that helps me save time and keep my caption structure consistent.

These guidelines are designed to assist in creating a more effective and efficient process for captioning, which is a critical component in the training of image-based models. By focusing on manual, high-quality captioning, you can enhance the model’s learning accuracy and overall performance. When captioning images for training purposes, it’s essential to align the style of your captions with how you typically prompt.

If your prompts are usually verbose, detailed, or specific, your captions should reflect that. Conversely, if you prefer short, concise prompts, your captions should be similar. This consistency ensures that the model better understands and interprets the visual content in relation to the text.

The principle behind this is that the training process benefits from uniformity in language style, aiding the model in making more accurate associations between images and textual descriptions.

  • Consistency in Style:

    Caption images in the same style as your prompts. If your prompts are verbose and detailed, your captions should be similarly descriptive.

  • Relation Between Captioning and Prompting:

    Understand that captioning and prompting are interconnected. The way you prompt directly influences how you should caption images.

  • Self-awareness in Prompting Style:

    Be aware of your natural prompting style. Do you prefer lengthy, detailed sentences or short, concise ones?

  • Matching Caption Verbosity to Prompting:

    Align the verbosity of your captions with your prompting style. If your prompts are short and to the point, so should your captions be.

  • Reflect Prompting Nuances in Captions:

    If your prompts are typically nuanced or vague, incorporate similar nuances in your captions

  • Leverage Task-Aware Captioning:

    Research suggests that caption styles influence the effectiveness of models in tasks. So, tailoring captions to be task-aware can improve model performance.

  • Adapt Captions to Prompt Requirements:

    For specific tasks, adjust your captions to meet the requirements of the prompt, enhancing the relevance and effectiveness of the training.

  • Continuous Learning and Adaptation:

    Stay updated with evolving captioning techniques and adapt your strategy as new research and tools become available.

When captioning images for a dataset, adopting a structured approach can streamline the process and potentially enhance the learning efficiency of the model. This structured approach means applying a consistent format or template when describing each image. The idea is that consistency in the structure of your captions can help the model recognize patterns and learn more effectively.

Here are the key points to consider:

  • Consistent Structure for Each Concept:

    Apply a uniform structure in your captions for each specific concept or category of images, such as photographs or illustrations.

  • Ease of Process:

    Using a set structure simplifies the captioning process for you and can potentially aid the learning process of the model, although this is based on intuition rather than proven evidence.

  • Avoid Mixing Structures:

    Within a single dataset, try to maintain the same structural format for captions. Different structures for different types of images (e.g., photographs vs. illustrations) should be kept separate and not mixed.

  • Example of a Structure:

    The specific structure you use can vary, but it should consistently apply to all images within a category.

In summary, a consistent, structured approach to captioning can potentially make your dataset more effective for training purposes and simplify the captioning process.

Captions in image training can be thought of as variables influencing the model’s output. When captioning, it’s important to detail elements you wish to vary in the model’s responses. In the context of image captioning for model training, consider captions as variables in your prompts. This approach involves detailing elements in the captions that you intend to vary in the model’s generated responses.

For example, if you’re training the model to recognize faces and want the flexibility to change hair color, your captions should explicitly mention the hair color in each image. This makes “hair color” a variable. On the other hand, for characteristics that are essential to the concept (like a specific facial feature), avoid making them variables in your captions. For instance, if teaching a model about a specific face, avoid changing descriptions of distinct facial features like nose size, as it would compromise the consistency of that face’s representation.

Here are the key points to consider:

  • Describe changeable aspects in detail:

    Example: If different hairstyles are desired, specify “short blonde hair” or “long brown hair” in captions.

  • Maintain Core Characteristics:

    Avoid varying important features of the main concept. Example: If focusing on a specific face, don’t vary the nose size in captions to maintain the uniqueness of the face.

  • Contextual Captions:

    Use captions for context but focus on the primary concept. Example: While teaching about a specific dog breed, mention “brown dog” but don’t vary breed-specific features in the captions.

  • Detailing Variables:

    Suppose you’re training a model on images of rooms but want to change furniture styles. Caption each image with specific furniture details like “Victorian sofa” or “modern armchair”. This makes furniture style a variable while keeping the concept of ‘room’ constant.

  • Maintaining Core Concepts:

    If your focus is on teaching the model about a specific type of car, avoid varying essential features like the make or model in your captions. For instance, if it’s a ‘Ford Mustang’, don’t caption some images as ‘Chevrolet Camaro’. You can, however, vary the color or setting descriptions like “red Ford Mustang in the city” or “blue Ford Mustang in the countryside”.

These examples help illustrate how to strategically use captioning to control the variables and constants in model training.

Using Generic Class Tags to Bias Learning: Applying broad class tags (like “man” or “car”) can bias the model towards associating all instances of that class with the characteristics present in your training data. This can be useful if you want the model to generalize a certain feature across the entire class.

Specific Tags for Targeted Learning: To teach the model specific concepts without influencing the entire class, use unique or specific tags. This approach ensures that the model associates these tags with the particular characteristics of your training images, without generalizing these characteristics to the entire class.

Here are the key points to consider:

  • Broad Class Tagging:

    If you’re training a model on birds and want it to generalize the concept of “bird” with certain traits (like being small and colorful), use the tag “bird” across various images showing small, colorful birds.

  • Specific Tagging:

    If you want the model to recognize a specific type of bird, like a “sparrow,” without influencing its general understanding of all birds, use a specific tag like “citySparrow.” This way, the model learns specific attributes of “citySparrow” without applying these attributes to the broader class of “bird.”

It’s imperative to maintain consistency in captions throughout your dataset. This consistency is key in effectively training models like Stable Diffusion (SD). Consistent captions ensure that the model accurately learns and recognizes concepts with the same terminology, avoiding confusion from varied phrasings.

Example: If teaching the concept of a “smiling person”, consistently use this exact term across all relevant images. Avoid alternating with phrases like “person smiling” or “grinning person” to maintain clarity and ease of learning for the model.

Here are the key points to consider:

  • Uniformity in Terminology:

    Make sure each concept is described with the same term or phrase across all captions to facilitate straightforward learning by the model.

  • Avoid Mixed Terminology:

    Using different terms for the same concept (like “soaring bird” and “bird in flight”) can confuse the model. Stick to one descriptive phrase per concept.

  • Assistance from Software:

    Using different terms for the same concept (like “soaring bird” and “bird in flight”) can confuse the model. Stick to one descriptive phrase per concept.

When captioning for models like Stable Diffusion, it’s important to avoid repetitive phrasing. Repetition can inadvertently increase the significance or ‘weight’ of certain words in the model’s learning process. For instance, repeatedly using the word “background” in various tags (like “simple background,” “white background,” “lamp in background”) can overemphasize this aspect in the model’s learning.

Instead, it’s more effective to combine these descriptions into a single, more comprehensive phrase, such as “simple white background with a lamp.” This approach reduces unnecessary repetition and helps maintain balance in the model’s focus during training.

Instead of separately tagging “sunny sky,” “blue sky,” and “cloudless sky,” combine them into one tag like “clear blue sunny sky” to avoid repetition.


  • Minimize the use of repetitive words in captions to prevent skewed learning emphasis.

  • Combine similar descriptions into a single, comprehensive phrase for efficiency.

In captioning for AI models, the order of words significantly impacts tag weighting, emphasizing the need for structured captioning. This structure aids in maintaining consistent emphasis across different images, enhancing the model’s learning process. Familiarity with this structure also speeds up the captioning workflow. Furthermore, leveraging the model’s existing knowledge is beneficial. Use commonly understood and effective words in your prompts, as obscure terms might not be as effectively recognized by the model.

For instance, instead of using a rare term like “mordacious,” use “sarcastic” if it aligns better with the model’s known vocabulary. This approach ensures the model’s current understanding is effectively utilized.

For Example, suppose you’re training a model to recognize different types of trees. Instead of using complex botanical terms like “Quercus rubra” (which might not be in the model’s existing vocabulary), you could use more commonly known terms like “red oak.” This way, you’re leveraging the model’s existing knowledge of basic tree types while still introducing specific species. The model, already familiar with the concept of “oak,” can more easily adapt to recognize “red oak” as a specific variation.

Here are the key points to consider:

  • Structured Captioning: Maintain a consistent word order in captions for uniformity.

  • Model Comprehension: Use familiar terms for better recognition by the model.

  • Detailed Example: Teach “Quercus rubra” by captioning it as “red oak” for easier model recognition.

  • Knowledge Utilization: Incorporate words that the model is already likely to know.

This guide primarily focuses on captioning individuals or characters, which might be slightly less relevant for abstract themes like mystical landscapes. Nonetheless, it could offer some creative ideas, but I essentially will help you understand the foundation of structuring captions.

It’s important to note that this isn’t the definitive or only method for effective captioning.

This is my personal approach, which has proven effective in my projects. I’m sharing my techniques and rationale, and you’re welcome to adapt these ideas as you see fit. I learned this method from a reddit [Source] and applied more of my research into it. Over time I’ve evolved this structure and sometimes may simplify it depending on the subject.

General Format


This is where I would stick a rare token (e.g. “ohwx”) that I want heavily associated with the concept I am training, or anything that is both important to the training and uniform across the dataset. You can use the subject’s name or call it:
Example: man, woman, anime,(or the person or subject you’re trying to train like Uncle Larry, or Aunt Selma) etc.
  • Identifying the Image’s Nature

    First, determine the basic form of the image. Is it a photograph, an artistic illustration, a simple drawing, a detailed portrait, a computer-generated render, or an anime-style image? These categories help to initially define the image.

  • Subject of the Image

  • Detailing the Type of Subject

  • Perspective of the Subject

Category or Viewpoint:

Broad descriptions of the image to supply context in “layers”:

Identifying the Image’s NatureSubject of the ImageDetailing the Type of Subject Perspective of the Subject
What is it?“Of a…”What type of x is it (x = choice above)? What perspective is X from?
Examples: photograph, illustration, drawing, portrait, render, anime.Examples: woman, man, mountain, trees, forest, fantasy scene, cityscapeExamples: full body, close up, cowboy shot, cropped, filtered, black and white, landscape, 80s styleExamples: from above, from below, from front, from behind, from side, forced perspective, tilt-shifted, depth of field

Action Prompts

This section is about capturing the activities or states of the main subject in your image. You should use verbs to vividly describe what the subject is doing or experiencing. The idea is to include as many action words as necessary to paint a clear picture.

  • Detailing the Subject’s Actions

  • Turning Actions into Variables

  • Examples with Different Subjects

Examples of Action Prompts:

Using a person as an example: Using a flower as an example:
standing, sitting, leaning, arms above head, walking, running, jumping, one arm up, one leg out, elbows bent, posing, kneeling, stretching, arms in front, knee bent, lying down, looking away, looking up, looking at viewerwilting, growing, blooming, decaying, blossoming

Subject Description

This part focuses on describing the subject in detail. The aim is to highlight features and aspects of the subject without directly addressing the primary concept you’re trying to convey. Think of these details as variables in your description, adding depth and specificity.n

Describing a PersonDescribing a Flower
white hat, blue shirt, silver necklace, sunglasses, pink shoes, blonde hair, silver bracelet, green jacket, large backpackpink petals, green leaves, tall, straight, thorny, round leaves

Distinctive Features

Highlighting Additional Features. This section is dedicated to capturing those unique or supplementary aspects of an image that don’t quite fit into the categories of ‘main subject’ or ‘background’, but are still significant.

  • Purpose of Notable Details

  • Using This Section Effectively

Examples of Notable Details

In a beach scene, instead of just mentioning the presence of an umbrella, you could specify its appearance as a “yellow and blue striped umbrella partially open in the foreground.” This adds a layer of detail that enriches the overall image.In a portrait, go beyond basic descriptions and include actions like “he is holding a cellphone to his ear”. This not only describes a physical object but also hints at an activity or behavior, making the image more dynamic and relatable.

Setting and Environment

  • Describing the Setting and Environment:

    This section focuses on detailing the background or location in your image. It’s important to provide a comprehensive description of what’s happening behind the main subject. To do this effectively, you can use a multi-layered approach.

  • Basic Environment Description:

    Start with a broad outline of the setting. For instance, in a beach scene, you would begin by mentioning general elements such as being outdoors, the presence of a beach, sand, water, the shoreline, and perhaps the time of day, like sunset. This first layer sets the stage and provides context.

  • Adding Specific Details:

    Next, dive into more specific aspects of the environment. Continuing with the beach example, you might describe the ocean dynamics like small waves, note objects or activities such as ships at sea, a sandcastle, or towels laid out on the sand. This layer brings more life and activity to the scene.

  • Detailed Description of Elements:

    Finally, focus on the finer details that make the scene unique. You could describe the color and design of the ships as red and white, mention that the sandcastle has a moat around it, and specify the color pattern of the towels, like red with yellow stripes. This final layer enriches the scene with vivid, memorable details.

Examples:Examples with 3 layers separated by commas:
1Outdoors, beach, sand, water, shore, sunset
2Small waves, ships out at sea, sandcastle, towels
3the ships are red and white, the sand castle has a moat around it, the towels are red with yellow stripes

Loose Associations

Incorporate Broad Impressions. Finally, capture the more intangible or abstract associations connected with the image. These are typically spontaneous thoughts or emotions evoked by the image, or overarching themes that it seems to convey.

  • Nature of Loose Associations:

    Think of this as the space for any subjective impressions or feelings that the image inspires in you. These can range from emotional responses to broader conceptual ideas. The key is that these elements should be present, at least implicitly, in the image.

  • Placement of Associations in Captions:

    It’s important to remember that this section is for more general, nuanced associations. If the image strongly embodies a particular emotion or concept, you might consider placing that nearer to the beginning of your caption for greater emphasis.

  • Examples of Broad Impressions:

    You might note feelings like happiness, sadness, a sense of joy, hope, loneliness, or a somber mood. These don’t describe physical aspects of the image but rather the emotional or thematic undertones it presents.

Now that we’ve learned all the rules of proper captioning for data sets, too help you decide which method is best for you, I have prepared some links below where I will walk you through how to actually create data sets using the Kohya_ss Gui’s built-in captioning utilities.

The four options for creating data sets with the Kohya_ss Gui are:

  • Basic Captioning:

    This option allows you to generate captions for a single image using a pre-trained model.

  • This option allows you to generate captions for multiple images using a pre-trained model called BLIP (Bilinear LSTM Image Patch). BLIP is a model that learns from both local patches of an image (bilinear) and global context of an image (LSTM).

  • WD14 Captioning:

    This option allows you to generate captions for multiple images using a pre-trained model called WD14 (Wider-Dataset-14). WD14 is a model that learns from a larger dataset than CLIP-BLIP or BERT-BLIP by adding more diversity and coverage.

  • Manual Captioning:

    This option allows you to manually write captions for multiple images without using any pre-trained model. You can use any text editor or word processor to write your captions in plain text format.

Each option has its advantages and disadvantages over the others. None of them is superior to the others in terms of quality or performance. It depends on your personal preference and goal what option suits you best.

BLIP Captioning: A Guide for Creating Captions and Datasets for Stable Diffusion
BLIP Captioning: A Guide for Creating Captions and Datasets for Stable Diffusion

Discover the power of BLIP Captioning in Kohya_ss GUI! Learn how to generate high-quality captions for images and fine-tune models with this tutorial.

This was a simple introduction to captioning for datasets.
Below is a checklist of the main points we covered, so you can easily review them anytime.

  • Image Resolution and Aspect Ratio

    • Aim for the highest possible resolution fitting 512×512 pixel dimensions.
    • Aspect ratio is not a concern; using bucketing to handle various aspect ratios.
    • Acceptable image formats include square, landscape, etc.
  • Image Quality

    • Prefer sharper images without artifacts over soft images.
  • Image Selection and Preparation

    • Choose subjects with similar-looking faces for character consistency.
    • Include a variety of poses, backgrounds, and clothing for comprehensive training.
    • Avoid images lower than 512×512 pixels unless using tools like Magnific AI for upscaling.
    • Crop images tightly to focus on the subject, leaving a minimal border.
    • Adjust bucketing to fit within a 64 pixel by 64-bit resolution.
  • Captions and Training

    • Captions are crucial for focusing training on relevant image parts.
    • For low-resolution images, use Topaz AI or Magnific AI for upscaling.
    • Avoid describing mutable features (e.g., eye color, facial expressions) to retain character consistency.
    • If not focusing on clothing in training, describe it minimally in captions.
  • Dataset Preparation

    • High-quality input leads to high-quality output.
    • Opt for more quantity and variety where possible.
    • Prioritize quality over quantity.
    • Upscale images only as a last resort.
  • Captioning

    • Avoid automated captioning tools like BLIP and deepbooru for now.
    • Manually caption in a style similar to how you prompt.
    • Follow a structured approach for consistency.
    • Use detailed descriptions for elements that are not the main focus.
    • Utilize class tags strategically to bias or de-bias the learning process.
    • Ensure consistency in captioning across the dataset.
    • Avoid repetitive wording to prevent undue weighting.
    • Be mindful of tag ordering in captions.
    • Leverage the model’s existing knowledge base in captioning.
  • Training

    • Experiment with training settings; there’s no one-size-fits-all approach.
    • Trust the training process and the quality of your images.
  • Use Kohya_ss GUI for Captioning and Training

Tags And Categories

In: ,

Share this post

Leave a Reply

Your email address will not be published. Required fields are marked *

Horizontal ad will be here