Girl in retro red room staring at an old tv - ControlNet V2V
,

AnimateDiff and ControlNet V2V: Your Guide to Generating Videos with Videos

ControlNet V2V, which stands for Video to Video, is the latest craze in AI generative art. When you combine it with AnimateDiff, you can create animations that will make your jaw drop. And the fun doesn’t stop there, we’re just warming up, and this trend is sizzling hot. The animation industry is about to get shaken up, and it’s all happening in the blink of an eye.

ControlNet V2V, which stands for Video to Video, is the latest craze in AI generative art. When you combine it with AnimateDiff, you can create animations that will make your jaw drop. And the fun doesn’t stop there, we’re just warming up, and this trend is sizzling hot. The animation industry is about to get shaken up, and it’s all happening in the blink of an eye.

I’ve been a 3d animation fan and enthusiast for years. But traditional animation often fails to capture the magic of concept art. With AI, we’re bridging that gap and making concept art come to life, which is something I’ve rarely seen, and if it did exist, it was very complicated.

Your Guide to Generating Videos with Videos

But with great power comes great risk. Video to video can also be misused, even leading to fake content online where guys are pretending to be sexy ladies on Onlyfans. But let’s not talk about that. Let’s focus on how to use this technology for good. In this guide, we will learn how to generate amazing animations with AnimateDiff and ControlNet V2V.


AnimateDiff and ControlNet V2V: Your Guide to Generating Videos with Videos

To use ControlNet V2V, you need to have the following components installed on your system: ControlNet, AnimateDiff extension, and ControlNet Tile model. You can find the installation instructions and links for each of them below.

Rear view of a woman sitting in front of a vintage television set with AnimateDiff ControlNet text, set against a red background. 

AnimateDiff and ControlNet V2V

What is ControlNet V2V Mode? ControlNet v2v is a mode of ControlNet that lets you use a video to guide your animation. In this mode, each frame of your animation will match a frame from the video, instead of using the same frame for all frames. This mode can make your animations smoother and more realistic, but it needs more memory and speed.

  • Control Weight:

    ControlNet weight is a setting that decides how much your animation copies the reference image or video. A higher weight means that your animation will be more like the reference, while a lower weight means that your animation will be more unique. You can change the weight by moving the sliders in the Web UI.

  • Starting & Ending Control Step:

    Start and End sweep are settings that decide the timing of your animation. Starting Control Step is how much of your animation will copy the first frame of the video, while Ending Control Step is how much of your animation will copy the last frame of the video.

    For example, if you set Start sweep to 50% and End sweep to 0%, then the first half of your animation will be the same as the first frame of the video, and the second half of your animation will be the same as the original AnimateDiff output. You can change the Start and End sweep by sliding the sliders in the WebUI.

  • Preprocessor Resolution:

    Preprocessor resolution is a setting that changes the quality of the control map that ControlNet makes from your reference image or video. The control map is a simple version of the features and edges of your reference that tells the AnimateDiff model what to do.

    The higher the preprocessor resolution, the more details the control map will have, but it will also take more time and space to make. The lower the preprocessor resolution, the less details the control map will have, but it will also take less time and space to make. You can change the preprocessor resolution by sliding the sliders in the Web UI.

Tile is a feature that lets you make huge images, and it is very handy if your GPU is not very strong.
You can learn how to use it to scale images HERE, but if you’re into video-to-video generation, you need to know about Tile/Blur.

It’s a feature of ControlNet v2v, a tool that lets you make videos with videos using AnimateDiff and ControlNet. Tile/Blur lets you decide how sharp the output images are. It feels like it was made for video-to-video generation, and it can really boost the quality of your animations while matching the color samples of the original source.

ControlNet v2v uses ControlNet tile, a preprocessor that can improve an image by using another image as a guide. For example, you can use ControlNet tile to make a blurry image look more clear and detailed by using a sharp image as a reference.

The preprocessors

Tile/Blur lets you decide how sharp the output images are. It was made for video-to-video generation, and it can really boost the quality of your animations while matching the color samples of the original source. It also samples colors very well and stays very accurate to the original source, so you don’t have to worry about losing the original style or mood of your reference.

  • none:

    Use this when you want to keep the input image as it is, and just let ControlNet tile do its magic. This may work well for some images, but not for others. For example, if you are working with anime images, you may get blurry or harmonious images, which means they look too smooth or too much like the guide image.

  • tile_resample:

    Use this when you want to change the size of the input image to match the size of the guide image. This may help ControlNet tile to change the details more precisely, but it may also create some glitches or noise, which means some weird pixels or patterns in the image.

  • tile_colorfix:

    Use this when you want to fix the color mismatch issues with anime images. This means that the input image and the guide image have different color shades, which may cause some trouble with ControlNet tile.

    For example, the input image may have more red hues, while the guide image may have more blue hues. This preprocessor can make the input image and the guide image have the same color shades, which may help ControlNet tile to change the details better. But it may also change the original color of the input image, which you may not like.

  • tile_colorfix+Sharp:

    Use this when you want to fix the color mismatch issues and also sharpen the input image. This may help ControlNet tile to change the details better and also make the output image look more clear and realistic. But it may also make the output image look too sharp or fake, which you may not enjoy.

  • blur_gaussian:

    Use this when you want to smooth the input image. This may help you avoid the harmonious effect, which means that the output image looks too much like the guide image and loses its uniqueness. But it may also make the output image look too soft or blurry, which you may not prefer.

The other sliders are specific to each preprocessor. Here are the explanations n of the other sliders:

blur_gaussian:

This preprocessor blurs the input image with a Gaussian filter before feeding it into ControlNet tile. This may smooth the output image and avoid the harmonious effect, but it may also lose some details or make the image look too soft. The other sliders for this preprocessor are:

  • Sigma:

    This slider lets you choose the standard deviation of the Gaussian filter. The lower the value, the smaller the filter. The higher the value, the larger the filter. You can use this slider to control the amount of blurring applied to the input image.

    For example, if you set the sigma to a low value, the input image will be less blurred and the output image will be more detailed. If you set the sigma to a high value, the input image will be more blurred and the output image will be more smooth.

tile_colorfix:

This preprocessor color-corrects the input image to match the color distribution of the reference image before feeding it into ControlNet tile. This may fix the color offset issues with anime images, but it may also alter the original color of the input image. The other slider for this preprocessor is:

  • Variation:

    This slider lets you choose the amount of variation in the color correction. The lower the value, the more exact the color correction. The higher the value, the more random the color correction. You can use this slider to control the balance between the original color and the reference color in the output image.

    For example, if you set the variation to a low value, the output image will have more of the reference color and less of the original color. If you set the variation to a high value, the output image will have more of the original color and less of the reference color.

tile_colorfix+sharp:

This preprocessor color-corrects and sharpens the input image before feeding it into ControlNet tile. This may fix the color offset issues and enhance the details of the output image, but it may also make the image look too sharp or unnatural. The other sliders for this preprocessor are:

  • Variation:

    This slider is the same as the one for tile_colorfix. It lets you choose the amount of variation in the color correction. The lower the value, the more exact the color correction. The higher the value, the more random the color correction. You can use this slider to control the balance between the original color and the reference color in the output image.

  • Sharpness:

    This slider lets you choose the amount of sharpening applied to the input image. The lower the value, the less sharpening. The higher the value, the more sharpening. You can use this slider to control the amount of detail or contrast in the output image. For example, if you set the sharpness to a low value, the output image will be more smooth or soft. If you set the sharpness to a high value, the output image will be more crisp or hard.

tile_resample:

This preprocessor resamples the input image to match the size of the reference image before feeding it into ControlNet tile. This may improve the quality of the output image, but it may also introduce some artifacts or noise. The other slider for this preprocessor is:

  • Down Sampling Rate:

    This slider lets you choose how much the input image is reduced in size before feeding it into ControlNet tile. Down sampling is a process that makes an image smaller by removing some pixels from it. The lower the value, the less pixels are removed.

    The higher the value, the more pixels are removed. You can use this slider to control the size of the input image and the output image. For example, if you set the value to low, the input image and the output image will be big. If you set the value to high, the input image and the output image will be small.

To wrap up, the preprocessors are different ways of tweaking the input image for ControlNet tile, which is a model that can change the details of an image by using another image as a guide. The preprocessors can help you make the output image better, depending on what kind of images you are using. You can pick from four preprocessors: none, tile_resample, tile_colorfix, and tile_colorfix+sharp.

Each preprocessor has its own sliders that let you change different things about the input image, like the size, the color, and the sharpness. You can also use the Tile/Blur strength slider to decide how sharp the output image is.

When you use ControlNet v2v to make videos with videos, you should try different preprocessors and sliders to see what suits your project. There is no one best way, as different images may need different settings. The goal is to make realistic and cool animations that match your vision.

Context Batch Size is a setting that lets you choose how many images from your video will be used to make the new video. The images are called frames, and they show the movement in the video. The more frames you use, the smoother and better the new video will look. But using more frames also takes more time and memory.

The best number of frames to use depends on the type of motion module you are using. The motion module is a part of the program that can change how the video moves. There are different types of motion modules.

Each type of motion module works best with a certain number of frames. For example, SD1.5 works best with 16 frames, and SDXL and HotShotXL work best with 8 frames. You can choose the number of frames yourself, or you can let the program choose it for you. The program will try to find the best number of frames for your video. Below is a test of Context Batch Size using SD 1.5

  • A low Context Batch Size can make the new video look choppy and inconsistent.

    This means that the movement in the new video will not be smooth and natural, and it may change suddenly or randomly. For example, if you use a low Context Batch Size to make a video of a person dancing, the person may look like they are skipping or glitching, and their dance moves may not match the music.

    A low Context Batch Size can also make the new video look blurry or noisy. This means that the details and colors in the new video will not be clear and sharp, and they may have unwanted dots or lines.

    For example, if you use a low Context Batch Size to make a video of a landscape, the trees and mountains may look fuzzy and distorted, and they may have artifacts or noise. A low Context Batch Size can make the new video faster to make, but it will also make it lower in quality.

  • A high Context Batch Size can make the new video look smooth and consistent.

    This means that the movement in the new video will be fluid and natural, and it will match the input video. For example, if you use a high Context Batch Size to make a video of a person dancing, the person will look like they are moving gracefully and rhythmically, and their dance moves will match the music. A high Context Batch Size can also make the new video look sharp and clean.

    This means that the details and colors in the new video will be clear and vivid, and they will not have any unwanted dots or lines.

    For example, if you use a high Context Batch Size to make a video of a landscape, the trees and mountains will look crisp and realistic, and they will not have any artifacts or noise. A high Context Batch Size can make the new video better in quality, but it will also make it slower to make.

This guide will show you how to use ControlNet v2v with four different methods. The first method is to upload a single image in ControlNet’s ‘Single Image’ tab and use it to animate a video. The second method is to upload a video in the ‘Video Source’ tab and use it as the control image for another video. The third method is to provide a video path in the ‘Video Path’ tab and use it as the control image for another video.

The fourth method is to upload a folder of image sequences in the ‘Batch’ tab and use them as the control images for another video. For this method, you need to input a directory of the folder that contains your image sequences.

  • Image to Video Using ControlNet’s ‘Single Image’ Upload

  • Video to Video Using ControlNet’s ‘Video Source’ Upload

  • Video to Video Using ControlNet’s ‘Video Path’ Upload

  • Video Sequence to Video Sequence Using ControlNet’s ‘Batch’

In this tutorial, we will use the prompts and settings below to create a realistic animation of Gil Devera, a friend of mine and a professional body builder and fitness coach for Dynamic G Fitness. I will use the magicMIX Lux fine tuned model by Merjic, which you can find below.

Different models have different strengths and weaknesses. I had to try many models before I found one that works well with Animation. This model is made by Merjic, who is one of my favorite creators.

You can use my prompts and settings, or you can use your own. Just remember to keep your prompts within 75 tokens, as shown in the link below, and to use short negative prompts to avoid changing the scenes.

Prompts:
(best quality, masterpiece, 1man, shirtless body builder, jeans, highest detailed)full body photo, ultra detailed, (textured_clothing), black_background, (intricate details, hyperdetailed:1.15), detailed, (official art, extreme detailed, highest detailed),
Negative Prompts:
EasyNegative, bad-hands-5
The negative prompts are based on embeddings, which you can download by clicking on them.
​Model: majicMIX lux (Click to Download)Seed: 862816124
Settings: Euler A or DPM2CFG Scale: 6
Steps: 20Skip: 2
Height: 768Width: 768
Download Video Source

Many issues with AnimateDiff stem from the prompts you use. To understand and solve them, you need to know how the prompts affect the generation process. Read this guide to learn more.

In this method, you will learn how to upload a single image in ControlNet’s ‘Single Image’ tab and use it to animate a video. This is useful when you want to create a simple animation from a still image, but be aware that using a single picture as a reference is a very strong signal that will affect all frames of your animation with the same weight.

This means that your animation will look very similar to the picture, and it will be hard to achieve significant motion or variation. There will be an advanced weight update in the future, which means that you will be able to adjust the weight for each frame individually in the future.

To create an animation from a single image, follow these steps:

  • Copy the image prompt from the options I gave you or write your own.

  • Go to the AnimateDiff tab and turn it on.

  • Set the Number of Frames to 32 and the FPS to 8.

  • Go to the ControlNet tab and turn it on. Then, click on Pixel perfect.

  • Choose OpenPose as the ControlNet model.

    (You can try other models later.)

  • Choose Balanced as the Control Mode.

    (This mode balances the effect of the image prompt and the ControlNet model. You can experiment with other modes to see how they differ.)

  • Click Generate to see your animation. It will resemble the pose of the source image.

  • To add some variation to the animation, change the Control Weight to 0.3.

    This will reduce the influence of the ControlNet source on the animation.

  • The Start/End Control Step options are not very useful for single image animations.

    You can adjust them to see if they have any impact.

I experimented with the Control Weight parameter for this animation. I set it to 0 and then increased it to 0.3 to allow some variation from the original pose. This way, the animation was not too rigid or static.

I also tried to modify the weight at different times of the animation using the Starting Control Step and Ending Control Step options. However, these did not seem to have any effect on the animation, as the weight was either on or off. I think this is because the advanced weight update feature is not available yet, as mentioned in the first paragraph. I hope this feature will be added soon, as it would enable more control and flexibility over the animation. Please let me know if you find a way to use these options effectively.

In this method, you will learn how to upload a video in the ‘Video Source’ tab and use it as the control image for another video. This is useful when you want to transfer the motion and style from one video to another, such as a dance or a speech.

Using photoshop, I merged 3 images of Gil Devera in different poses to make a short video. I stretched the 3 frames to last 3 seconds, which gave me a 113-frame video at 30 FPS. This is the source video that I will use here.

To create an animation from a source video follow these steps:

  • Copy the image prompt from the options I gave you or write your own.

  • Go to the AnimateDiff tab and turn it on.

  • Upload your source video in the Video Source uploader.

  • You don’t need to set the Number of frames and FPS.

    AnimateDiff will automatically detect them for you.

  • Go to the ControlNet tab and turn it on. Then, click on Pixel perfect.

  • Choose OpenPose as the ControlNet model. (You can try other models later.)

  • Choose Balanced as the Control Mode. (This mode balances the effect of the image prompt and the ControlNet model. You can experiment with other modes to see how they differ.)

    (This mode balances the effect of the image prompt and the ControlNet model. You can experiment with other modes to see how they differ.)

  • Click Generate to see your animation. It will resemble the pose of the source image.

    (This mode balances the effect of the image prompt and the ControlNet model. You can experiment with other modes to see how they differ.)

In this video, I am comparing two different preprocessors: Soft Edge and OpenPose. They have different effects on the edges of the objects in the scene. I will explain how they work and what are their advantages and disadvantages.

Soft Edge is a preprocessor that softens all the edges that are on screen. I like this because it matches my style, but it also captures some background elements that I may not want to have edges. For example, you can see the edges of the wall and the floor behind me.

OpenPose is a preprocessor that only detects the skeletal pose of the human body.

It is good for avoiding unwanted background edges, but it also creates some flickering and changing effects on the background. This can be fixed by using different prompts, or by combining multiple ControlNet preprocessors to control the scene better.

The model that I am using for this demonstration may not be the best one for this task, so you may want to experiment with other models as well. If you want to learn more about the ControlNet preprocessor and how it works, you can check out my blog post where I explain it in detail.

Understanding the ControlNet Preprocessors (Coming Soon)

Select a Post
Select a Post

Post Excerpt

In this method, you will learn how to provide a video path in the ‘Video Path’ tab and use it as the control image for another video. This is useful when you want to use a video that is stored online or on your computer as the control image, but this comes with a catch.

A more precise title for this section would be Image Sequence to Video Using a Video Path, since you need to convert your video to an image sequence before using it as the control image. You can use any video that you have access to, either online or on your computer, as long as you provide its path in the ‘Video Path’ tab. I will explain how to create an image sequence with Photoshop in the next section. Please follow the steps there before proceeding. If you already have your video source, then you can skip this step.

How to Create an Image Sequence from your videos:

  • Open Photoshop and go to File > Export > Render to Video.

  • Under Name, enter a short and memorable name for your video.

  • Choose the folder where you want to save the image sequence.

  • Under Output, select Photoshop Image Sequence as the format.

  • Under Size, set the width and height to 512×512 pixels.

    This will reduce the processing time and memory usage.

  • Under Frame Rate, you can keep the default value or adjust it according to your preference.

    A higher frame rate will result in more images and longer rendering time.

  • Click Render to start the conversion process.

  • Wait for the image sequence to be generated.

    You can check the progress in the status bar.

To create an animation from a video path follow these steps:

  • Copy the image prompt from the options I gave you or write your own.

  • Go to the AnimateDiff tab and turn it on.

    You don’t need to change the settings, as they depend on the number of images in your image sequence.

  • In the AnimateDiff control panel:

    Go to Video Path and paste the path of your video from your local drive. It has to be an image sequence.

  • My path looks like this:

    C:\Users\AndyH\Desktop\Julie

  • Go to the ControlNet tab and turn it on.

    Then, click on Pixel perfect.

  • Choose a different preprocessor to use.

    I will use sketch as the ControlNet model. (You can try other models later.)

  • Choose Balanced as the Control Mode.

    (This mode balances the effect of the image prompt and the ControlNet model. You can experiment with other modes to see how they differ.)

  • Click Generate to see your animation.

    It will resemble the pose of the source image.

These techniques could be useful for restoring images or retouching videos to age or de-age them. I will write more about this in another blog.

In this method, you will learn how to upload a folder of image sequences in the ‘Batch’ tab with the Input Directory and use them as the control images for another video. This is useful when you want to use multiple images as the control images for different frames of the output video, such as a slideshow or a comic. Essentially this will just be like the tutorial above, but this time you’re using ControlNet’s Batch tab to source your video.

To create an animation from video batch path follow these steps:

  • Follow the steps in the previous section to create your image sequence.

  • Copy the image prompt from the options I gave you or write your own.

  • Go to the AnimateDiff tab and turn it on.

    You don’t need to change the settings, as they depend on the number of images in your image sequence.

  • Go to ControlNet, then switch to the Batch tab.

  • In the Input Directory space, paste the path of your video from your local drive.

    It has to be an image sequence.

  • My path looks like this: C:\Users\AndyH\Desktop\Julie

  • Go back to the ControlNet tab and turn it on.

    Then, click on Pixel perfect.

  • Choose a different preprocessor to use.

    I will use sketch as the ControlNet model. (You can try other models later.)

  • Choose Balanced as the Control Mode.

    (This mode balances the effect of the image prompt and the ControlNet model. You can experiment with other modes to see how they differ.)

  • Click Generate to see your animation.

    It will resemble the pose of the source image.

I wanted to see how different ControlNet settings would affect the animations created by AnimateDiff, using a video path as the control image. I used the same image prompts as before, but I used some new video sources of Julie. The image prompts were of a male body builder, so the animations looked weird with Julie’s videos. Some of the results were funny and unrealistic, but also interesting and creative.

I tested different Control types and Preprocessors, using the default Preprocessor for each Control type:

  • Tile/Blur:

    This Control type copied the environment and the color of the scene very well. It also copied some of the elements from the original source, not just the pose.

  • Lineart:

    This Control type made the animation look like a painting. It captured the pose and the movements, but it lost the color accuracy.

  • Depth:

    This Control type showed the depth of field, but it ignored a lot of fine details.

  • Soft Edge:

    This is my favorite Control type, as it created soft edges like canny models. It made the animation look more dreamy, but it didn’t capture the original colors of the video source.

  • Multi-ControlNet Units:

    I used multi ControlNet to add another ControlNet unit using Tile/Blur, to get both the lines and the color samples. When I added a third ControlNet unit, the animation looked very similar to the original source, but very blurry. This may be because of the low resolution Preprocessor and the low resolution source image, combined with the Soft Edge Preprocessor.

Final Thoughts

Final Thoughts I have always loved animation, and I have seen how it has changed over the years. But I always felt that there was something missing, something that polygons and rigging could not capture when it came to videogames. I wanted to see animations that looked like concept art, that had the same level of detail and expression without the polygon overlaps and hard deformation in skin wrinkles. And I think AI is the way to achieve that. With AnimateDiff and ControlNet V2V, I can create animations that look like moving concept art.

I can also change the motion and style of any video, which is super cool. This is a new kind of animation, and it will change the industry forever.

But AnimateDiff in Automatic1111 is not very good right now. It’s very GPU hungry, so you tend to crash a lot or the results are not consistent. I have an RTX 4090 and it still doesn’t work properly. That’s why many people prefer using ComfyUI. There are many workflows available on ComfyUI that are shared and used by many. It requires less GPU power and runs much better.

It will give you much more consistent and smoother results than what you can get in A1111. If you’re looking to really try out AnimateDiff, you should be looking at ComfyUI right now. I will be covering it in the future because I love what you can do with AnimateDiff.

I hope in the future the extensions can be more developed for A1111, because I prefer A1111’s simple UI. You don’t have to see the back end stuff and well, I just don’t like nodes!! They look like spaghetti noodles and they make me hungry!!”



Tags And Categories

In: ,

Share this post

Leave a Reply

Your email address will not be published. Required fields are marked *

Horizontal ad will be here