We're excited to announce Oasis, the first experiential, realtime, open-world AI model — it's an interactive video experience, but generated end-to-end by a transformer on a frame-by-frame basis. Oasis takes in user keyboard and mouse input and generates real-time experience, internally simulating physics, rules, and graphics. The model learned to allow users to move around, jump, pick up items, break blocks, and more, all by watching the demonstration directly. We view Oasis as the first step in our research towards foundational models that simulate more complex interactive worlds, thereby replacing the classic demo engine for a future driven by AI.
Achieving Oasis requires a combination of two fundamental advances: improvements in model architecture in order to enable the model to capture the entire world and simulate it, as well as breakthroughs in model inference technology to allow users to interact with the model in real-time with minimal latency. For the former, we adopt the emerging state-of-the-art approach of diffusion training combined with transformer models [1, 2] inspired by advanced large-language-models (LLMs) in order to train an autoregressive model that can generate video on a frame-by-frame basis conditioned by the user actions at that instant. For the latter, we currently use Decart’s proprietary inference framework which is built to provide peak utilization of NVIDIA H100 Tensor Core GPUs for transformer workloads and also have built the model to also support Etched’s upcoming Sohu chip.
We're releasing Oasis's code and the weights of a model you can run locally, and a live demo of a larger checkpoint. Today, using Decart's proprietary inference platform, we show that real-time transformer-based video is possible and can be streamed across the web for live experience. When Etched's transformer ASIC, Sohu, is released, we could run models like Oasis in 4K resolution. Together, we believe fast transformer inference is the missing link to making high quality, affordable generative real-time video a new fundamental interface.
While Oasis is an impressive technical demo, we believe this research is only the beginning of a new journey involving more complex foundation models that enable realtime human-AI interaction on a new level. This may revolutionize a wide variety of experiences by providing an interactive video interface that puts the control at the hands of the user. Simply imagine a world where this integration is so tight that foundation models may even augment modern entertainment platforms by generating content on the fly according to the user preferences. Or perhaps a gaming experience that provides new possibilities for the user interaction such as textual and audio prompts guiding the experience (e.g., “imagine that there is a pink elephant chasing me down”).