Turning Images into Videos: A Programmable and Cost-Effective Approach
Introduction
With the help of AI models, we can now turn a prompt or an image into a video using so-called “text2video” or “image2video” technology. However, this type of video creation can be expensive and time-consuming. What if you could programmatically turn a static image into an engaging video clip at a fraction of the cost? In this article, I’ll show you how I built a cost-effective Image2Video open-source tool that converts images into short video clips (5 seconds for just $0.2) using programmable APIs. This approach cuts costs and offers flexibility and efficiency, making it ideal for social media content creation.
Cost-Effective Video Generation
I have been doing research on image2video models and APIs for a while. As of today, there are two types of API services we can choose. One is the monthly subscription mode, in which you pay a monthly fee to receive credits and then consume them by making API calls. Another is the pay-as-you-go mode, in which you either pay for the API calls made during a period or prepay credits and use them as you make API calls.
The monthly subscription mode usually costs more. For example, the Runway Gen3 Alpha Turbo model offers a $28 monthly plan that charges $0.31 for 5 seconds of video. Another example is the Klingai V1.6 model, which has a $1400 monthly plan that charges $0.28 for 5 seconds of video.
If you are working on a project that starts small and gradually grows, you will choose the pay-as-you-go mode like I do. The Kling models hosted by Pi API cost you $0.13 per 5 seconds of video, with at least $5 charged as credits to run the APIs. The Kling-v1.6-standard model hosted by Replicated API charges $0.28 per 5 seconds of video without requiring any upfront payment. The Stability Video Diffusion model hosted by Stability AI charges $0.20 per 5 seconds of video by prepaying $10 for 1000 credits (4 credits per second).
In my open-source tool, I integrated the Pi API, Replicate API, and Stability API to provide a balance between quality and cost — empowering creators to produce high-quality videos without breaking the bank.
Programmable Video Generation: Efficiency & Flexibility
Now let’s look at the code of these three APIs. The APIs have two parts: to submit a request by sending the image and a prompt, and to poll the request status and retrieve the video file when the request is processed at the remote server.
# Pi API code
import http.client
import json
import os
import requests
import time
api_key = os.getenv("PI_API_KEY")
def image2video(image_url: str, prompt: str, output_path: str) -> None:
conn = http.client.HTTPSConnection("api.piapi.ai")
def get_task(task_id: str) -> dict:
headers = {"x-api-key": api_key}
conn.request("GET", f"/api/v1/task/{task_id}", headers=headers)
res = conn.getresponse()
data = res.read().decode("utf-8")
return json.loads(data)
def download_video(task: dict) -> None:
video_url = task["data"]["output"]["works"][0]["video"]["resource_without_watermark"]
response = requests.get(video_url)
response.raise_for_status()
os.makedirs(os.path.dirname(output_path), exist_ok=True)
with open(output_path, "wb") as f:
f.write(response.content)
payload = json.dumps({
"model": "kling",
"task_type": "video_generation",
"input": {
"image_url": image_url,
"prompt": prompt,
"negative_prompt": "distort the image, show anything that is not in the image, like human hand or fingers.",
"cfg_scale": 0.5,
"duration": 5,
"aspect_ratio": "9:16",
"camera_control": {
"type": "simple",
"config": {
"horizontal": 0,
"vertical": 0,
"pan": -10,
"tilt": 0,
"roll": 0,
"zoom": 0,
}
},
"mode": "std",
"version": "1.6"
},
"config": {
"service_mode": "",
"webhook_config": {"endpoint": "", "secret": ""}
}
})
headers = {"x-api-key": api_key, "Content-Type": "application/json"}
conn.request("POST", "/api/v1/task", payload, headers)
res = conn.getresponse()
data = res.read().decode("utf-8")
task_id = json.loads(data)["task_id"]
timeout = 600
task = get_task(task_id)
while timeout > 0 and task.get("data", {}).get("status") not in ["Completed", "Failed"]:
time.sleep(15)
timeout -= 15
task = get_task(task_id)
if task.get("data", {}).get("status") == "Completed":
download_video(task)
# Replicate API code
import os
import time
import requests
import replicate
api_token = os.environ["REPLICATE_API_TOKEN"]
def image2video(image_path: str, prompt: str, output_path: str) -> None:
with open(image_path, "rb") as image_file:
prediction = replicate.predictions.create(
model="kwaivgi/kling-v1.6-standard",
input={
"prompt": prompt,
"duration": 5,
"cfg_scale": 0.5,
"start_image": image_file,
"aspect_ratio": "9:16",
"negative_prompt": (
"distort the image, show anything that is not in the image, "
"like human hand or fingers."
)
}
)
timeout_seconds = 600
while prediction.status not in {"succeeded", "failed", "canceled"} and timeout_seconds > 0:
time.sleep(10)
timeout_seconds -= 10
prediction.reload()
if prediction.status == "succeeded":
video_url = prediction.output
os.makedirs(os.path.dirname(output_path), exist_ok=True)
response = requests.get(video_url)
with open(output_path, "wb") as f:
f.write(response.content)
# Stability AI API code
import os
import time
import requests
api_key = os.environ["STABILITY_AI_API_KEY"]
def get_generation_status(generation_id: str, api_key: str) -> requests.Response:
url = f"https://api.stability.ai/v2beta/image-to-video/result/{generation_id}"
return requests.get(
url,
headers={
"accept": "video/*",
"authorization": api_key,
}
)
def write_video_file(content: bytes, output_path: str) -> None:
os.makedirs(os.path.dirname(output_path), exist_ok=True)
with open(output_path, "wb") as f:
f.write(content)
def image2video(image_path: str, prompt: str, output_path: str, api_key: str) -> None:
post_url = "https://api.stability.ai/v2beta/image-to-video"
with open(image_path, "rb") as image_file:
response = requests.post(
post_url,
headers={"authorization": api_key},
files={"image": image_file},
data={
"seed": 0,
"cfg_scale": 1.8,
"motion_bucket_id": 127,
"prompt": prompt,
},
)
response_json = response.json()
generation_id = response_json.get("id")
timeout = 600
wait_interval = 15
status_response = get_generation_status(generation_id, api_key)
while status_response.status_code != 200 and timeout > 0:
time.sleep(wait_interval)
timeout -= wait_interval
status_response = get_generation_status(generation_id, api_key)
if status_response.status_code == 200:
write_video_file(status_response.content, output_path)
Using Gradio UI for Video Generation
You can also run the Gradio UI integrated into the tool to quickly turn any image into a video by following the steps below:
- Download the code and run the Gradio app.
git clone https://github.com/vincent-linktime/tools.git
cd tools
conda create -n tools python=3.11
conda activate tools
pip install -r requirements.txt
pip install -e .
export REPLICATE_API_TOKEN=your_replicate_api_token
export STABILITY_AI_API_KEY=your_stability_ai_api_key
export PI_API_KEY=your_pi_api_key
python -m tools.image2video.app
2. Upload an Image: Upload an image via a Gradio interface.
3. Input a Prompt: Provide a custom caption or prompt that guides the video conversion process.
4. Select a Model: Choose either the Replicate or Stability models. Note that: The “Stability” model accepts only images with the following dimensions: 1024x576, 576x1024, or 768x768.
5. Conversion Process: When you click “Run,” the tool processes the image and prompts you to generate a video, updating you with a “Processing…” message along the way.
6. View the Result: The generated video is displayed for you to review once the conversion is complete.
How Does the Image2Video Model Work?
To understand how a single photo and a simple description can be turned into a video, we need to look at the whole process:
- Step 1: Input Submission
You start this process by providing a still image along with a text prompt. This prompt explains the kind of movement or transformation you expect to happen. For example, you might say, “Make this river flow slowly.” This initial input is the foundation for the transformation. - Step 2: AI Interpretation
The Image2Video model first studies your image, identifying critical visual details such as colors, shapes, and textures. Then, it uses your prompt to determine how these elements should change over time. Based on your request, you can think of the model as planning a series of small changes — like sketching a rough storyboard. - Step 3: Frame Generation
Once the model understands your image and prompt, it creates a sequence of frames such that each frame is a slightly modified version of the previous one and is designed to simulate a smooth transition from the static image to dynamic movement. This is similar to how you are making a flipbook, basically, you make a tiny change on each page. - Step 4: Video Assembly
Finally, all the individual frames are stitched together to form a short video. When played quickly, these frames create the illusion of continuous motion, turning your still image into a lively video.
Both the Kling-v1.6-standard and Stable Video Diffusion models follow this basic approach. They differ in technical details and performance, but their core idea remains the same: by generating a series of carefully altered frames, they bring a static image to life, all driven by your simple prompt.
What is coming in the near Future?
As I am writing this blog, ByteDance has just released a new paper on a human video generation framework called OmniHuman on Feb 3rd, 2025. This framework can generate human videos based on a single human image and motion signals (e.g., audio only, video only, or a combination of both). The demos are stunning. Image2video technology will evolve rapidly, and the cost of APIs will be further reduced. I will keep my eyes open on this trend and provide prompt updates. Please follow me and stay tuned for more updates.
GitHub Link
Here is the Github Repo for the code used in this article. Feedback is welcome — happy coding!