Try Out The Model
Overworld Stream: https://overworld.stream
What is Waypoint-1?
Waypoint-1 is at the forefront of interactive video technology, developed by Overworld. This remarkable model enables real-time interactivity through video diffusion, allowing users to control and prompt the system via text, mouse, and keyboard inputs. By inputting frames into Waypoint-1, users can generate a dynamic world that they can step into and interact with.
The backbone of Waypoint-1 lies in its frame-causal rectified flow transformer, which has been meticulously trained on a staggering 10,000 hours of diverse video game footage. Each training session includes control inputs and text captions, positioning Waypoint-1 as a pioneering latent model trained on compressed frames. Unlike other models that may limit your control to basic camera movements, Waypoint-1 takes user experience a step further. It grants unrestricted mouse movement and instant keyboard inputs, all free from latency, making it an extraordinary tool for real-time interactions.
How was it trained?
The training process for Waypoint-1 involved a method called diffusion forcing, designed for the model to learn how to denoise future frames based on past inputs. By employing a causal attention mask, the model ensures that tokens in each frame can only reference tokens from their own or past frames, thus avoiding any future frame interactions. This setup allows the model to train effectively, generating each frame independently while learning denoising skills.
Despite the advantages of diffusion forcing, a challenge arose as the model’s training and inference methods differed, leading to errors during long rollouts. To counter this, the team implemented a post-training technique known as self forcing. This innovative approach aligns the model’s training with its inference behavior, allowing it to produce realistic outputs consistently. Self-forcing further enhances the efficiency of model performance, making Waypoint-1 an incredibly powerful interactive model.
The Inference Library: WorldEngine
WorldEngine serves as Overworld’s high-performance inference library, enabling real-time interactive world model streaming. Built for simplicity and extensibility, this library is optimized for low latency and high throughput. It incorporates a runtime loop specifically designed for interaction, processing context frame images and user inputs before outputting image frames for real-time streaming.
When tested with Waypoint-1-Small (2.3B parameters) on a 5090 GPU, WorldEngine can sustain approximately 30,000 token-passes per second, achieving 30 frames per second at 4 steps, or a remarkable 60 frames per second at just 2 steps. Such performance is attributable to several targeted optimizations:
- AdaLN Feature Caching: This technique avoids repetitive conditioning projections by caching and reusing them, provided that both prompt conditioning and timesteps remain unchanged.
- Static Rolling KV Cache + Flex Attention: This innovation enhances the model’s efficiency and responsiveness.
- Matmul Fusion: A standard inference optimization that combines QKV projections into a single operation.
- Torch Compile: Utilizing
torch.compile(fullgraph=True, mode="max-autotune", dynamic=False)for additional performance enhancements.
from world_engine import WorldEngine, CtrlInput
engine = WorldEngine("Overworld/Waypoint-1-Small", device="cuda")
engine.set_prompt("A game where you herd goats in a beautiful valley")
img = pipeline.append_frame(uint8_img)
for controller_input in [
CtrlInput(button={48, 42}, mouse=[0.4, 0.3]),
CtrlInput(mouse=[0.1, 0.2]),
CtrlInput(button={95, 32, 105}),
]:
img = engine.gen_frame(ctrl=controller_input)
Build with World Engine
Mark your calendars! Overworld is hosting a world_engine hackathon on January 20, 2026. Teams of 2-4 members are welcome, with an exciting prize of a 5090 GPU awarded to the winning team. This event represents a fantastic opportunity for developers to showcase their creativity and technical skills while collaborating with like-minded individuals, including founders, engineers, hackers, and investors. Join us at 10 AM PST for eight hours of friendly competition and innovation!
Stay in Touch
Inspired by: Source

