[ComfyUI] FramePack End-to-End Frame Workflow for Generating Smooth AI Videos and Achieving Precise Control Over the Video Creation Process

Table of Contents

What is FramePack

At the dawn of AI-generated applications, the most popular tools were ChatGPT, Midjourney, and Stable Diffusion. The latter two are known for their image generation capabilities, with Stable Diffusion standing out particularly due to its open-source nature, meaning it can theoretically be used for free.

However, when Stable Diffusion was first open-sourced, it didn’t gain much traction. The generated images were unpredictable, making it less convenient than ChatGPT, and the quality was far inferior to Midjourney. So how did Stable Diffusion rise to prominence and become one of the three leading AI image generation tools alongside ChatGPT and Midjourney?

Zhang developed a plugin called ContrlNet for Stable Diffusion, which first enabled precise control over the image generation process using semantic conditions (such as skeletal poses and depth maps), making Stable Diffusion the best AI image generation tool for precise control at the time.

By 2025, this genius once again proved his genius by open-sourcing the video generation large model FramePack. Using an efficient neural network structure, it achieves frame-by-frame (or segment-by-segment) prediction, using the previous few frames to predict the subsequent frames, enabling the generation of 1-minute-long videos with ease, and theoretically even infinitely long videos.Additionally, this innovation enables the processing of large numbers of frames on ordinary laptop GPUs, even on older laptops with 6GB of VRAM, without any performance issues.

Using a next-frame prediction neural network structure, videos can be generated step by step.

An innovative tree-based inference structure compresses input context into a constant length, ensuring that the generation workload is not affected by video length.

A drift-free sampling method effectively addresses quality degradation issues in video generation.

FramePack advantages

Its advantages include:

High-efficiency processing of large numbers of frames: Even on consumer-grade laptop GPUs (including the RTX 30XX, 40XX, and 50XX series), it can handle large numbers of frames and supports a 13B model.

Batch training: Similar to image diffusion training, it uses large batch sizes to improve training efficiency.

Frame-by-frame feedback: Since it is frame-by-frame (or frame segment) prediction, users can see the generated frames in real-time before the entire video is generated, providing immediate visual feedback.

Low memory requirements: Generating a 1-minute (60-second) video requires only 6GB of GPU VRAM or more.

Videos generated using FramePack can be up to 1,000 frames long, with extremely high consistency between frames, a frame rate of 30 frames per second, smooth and natural motion, and no issues such as drifting, flickering, or character deformation seen in traditional methods.

Most excitingly, despite the high quality of videos generated by FramePack, the hardware requirements are not high, with a minimum of 6GB of VRAM required.

Currently, it supports 30-series, 40-series, and 50-series graphics cards. For 20-series graphics cards, users interested in testing are encouraged to do so on their own. Take a look at the video effects first

Below is an introduction to the usage of FramePack’s first and last frame videos

Upgrade the plugin

Before using the first and last frame feature, you need to upgrade the kijai framepackwrapper plugin to the latest version. You can use the plugin update feature in comfyui or Hui Shi, or use the git clone command.

Here, we will use the simplest method: download the plugin from the following address and paste it into the custom_nodes directory of your local comfyui.

github

Note: Do not include “mian” in the folder name after extraction.

Open the workflow

After updating the plugin, navigate to the example_workflows directory under the plugin and open the workflow.

Workflow name: framepack_hv_example.json

After opening the workflow, you will see a start image and an end image. Add the start and end images here to control the first and last frames of the video.

Create the start and end frame images

We will use Google’s imageFX to create the start and end frame images, as it excels at maintaining character consistency.

First, provide the prompt to generate an image of a Chinese girl wearing a sleeveless yellow sweater.

After setting the character, start creating the first frame image. To maintain character consistency, lock the seed count, then provide a prompt for a girl reading a book in a library. Select an appropriate image from the generated images as the first frame image.

Next, change the scene prompt to “woman standing by the window” and select an image from the generated images as the last frame image.

Generate the video

Drag the first and last frame images into the workflow, with the prompt “Girl puts down the book and walks to the window.”

Although both images show a girl in a library, the scenes are quite different. Let’s run it and see if FramePack can generate a complete video from the two images.

The results came back quickly. Due to the significant differences in the backgrounds of the two images, the video frames are not seamless. FramePack automatically transitioned between the two images.

I then changed the prompt to: “The girl puts down her book, leaves the desk, and walks to the window.” I emphasized the action of the girl leaving the desk and walking to the window. Since the scenes are quite different and there wasn’t enough time in the video, I increased the video duration to 10 seconds.

The final video still showed transitions between the two scenes.

Therefore, when creating start-end frame videos, it is best to use two images with the same background, such as the two images below. The prompt I provided was also simple: “Drinking coffee.”

The resulting video meets our expectations and the video process is much smoother and more natural. However, note that the backgrounds of the two images I provided are not completely identical, so the generated video has a slight frame jump effect. This is something to be mindful of when creating videos in the future.