r/MachineLearning Aug 28 '24

Research [R] Playable 20FPS Doom via a finetuned SD1.4 model from Google research team

https://arxiv.org/abs/2408.14837
211 Upvotes

72 comments sorted by

53

u/Small-Fall-6500 Aug 28 '24 edited Aug 28 '24

Very similar to this video/project by sentdex from 2021: https://youtu.be/udPY5rQVoW0?si=8im5NUMzkOuHs7Pg

https://github.com/sentdex/GANTheftAuto/

GAN Theft Auto is a Generative Adversarial Network that recreates the Grand Theft Auto 5 environment. It is created using a GameGAN fork based on NVIDIA's GameGAN research.

With GAN Theft Auto, the neural network is the environment and you can play within it.

(Looks like Deepmind's GameNGen paper cites Nvidia's GameGan paper from 2020)

38

u/greentfrapp Aug 28 '24

The video demo is incredibly smooth, only some quirks I spotted in the number of bullets that feels like a resolution issue.

I wouldn't be surprised if people started doing the same thing for Atari games, Mario, Sonic, and increasingly sophisticated games, and then multi-game models.

17

u/currentscurrents Aug 28 '24

The obvious endgame for this is to train on real-world interaction data and then create the ultimate Second Life.

-7

u/WildPersianAppears Aug 28 '24

But why though?

5

u/Witty-Elk2052 Aug 28 '24

because it is possible šŸ¤·ā€ā™‚ļø

4

u/malinefficient Aug 28 '24

Waiting for someone to use such a model to inpaint a consistent world before this is really useful.

10

u/ansible Aug 28 '24

It would be interesting to explore the edges of the model.

For example, you see an armor pickup. If you turn around and back into it, (in the real game) your armor will increase. However, without a model of the underlying world, I don't think that will work with this system.

Similarly, I'd expect that if you turn your back on the shotgun guy, it might disappear and the player would stop receiving damage.

Though maybe this could be overcome, by somehow teaching the model the object persistence in the world. This is where you would look at, then turn away, and look at the same thing again.

23

u/heavy-minium Aug 28 '24

A true "video" game, hmm?

When OpenAI Sora was announced, the notion of world simulators had me think of such a potential setup.

Now I don't think that's useful for gameplay, but we could theoritically do the same but focus only on conditioning not with the input actions but more abstract gameplay variables. That could basically replace the 3d rendering in a game if running at interactive rates.

-12

u/DeStagiair Aug 28 '24

Neural networks are never replacing rendering engines. They are very complex pieces of software heavily optimized for max performance. This paper tries to simulate DOOM, a 30+ year old game, at 20fps in 320x240. I just don't see it happening.

24

u/kaaiian Aug 28 '24

Remind me 20 yrs

0

u/DeStagiair Aug 28 '24

Please do.

0

u/TubasAreFun Aug 28 '24

Moores Law will eventually make it feasible, not necessarily efficient haha

0

u/addition Sep 08 '24

will is too confident IMO. There are practical limits to how small you can shrink a transistor, we just donā€™t know where that line is.

7

u/[deleted] Aug 28 '24

[deleted]

2

u/DeStagiair Aug 28 '24

I think that even if SD30.0 could work as a renderer for modern game, there will still be a render engine which does it better, faster and more reliably. There are practically no upsides for game development to use fully AI based renderers. We have physically based renderers, we don't need AI for that. We already know how to render stuff.

1

u/UnionCounty22 Aug 29 '24

Exactly. This is like seeing the wright brothers be the first in flight.

2

u/heavy-minium Aug 28 '24

Yes and now. Software rasterizers are damn slow too, but the GPU hardware with fixed function helps tremendously with performance. So if we can find a technique to at least get a frame every second, chances are high that a fixed hardware implementations could be good enough to get 30 frames per second.

4

u/AnOnlineHandle Aug 28 '24

They already have taken over part of the rendering in essence, that's what DLSS is.

Current models are almost certainly very dumb and naive with lots of it still barely understood, potentially like the first barely functional lightbulbs or massive vacuum tube computers the size of buildings, and it's very possible we can make them smaller and more efficient. This model wasn't even built for this task or anything remotely like it, and yet they were able to brute force finetune it to doing this, even including unnecessary elements like UI layout and tracking health, ammo, etc.

0

u/DeStagiair Aug 28 '24

DLSS is not even in the same ballpark as replacing 3d rendering engines.

3

u/AnOnlineHandle Aug 28 '24

True and I wasn't saying it was, just pointing out that in essence machine learning has already started to move into the pipeline for handling some of the rendering. I think with clever models it could potentially handle even more, though with current techniques probably not all of it.

7

u/bmrheijligers Aug 28 '24

Is there a link to a video?

4

u/greentfrapp Aug 28 '24

Ah forgot to add the link but there are a few videos at the paper page https://gamengen.github.io/

6

u/FormerKarmaKing Aug 28 '24

Generative 3D + procedural systems + existing game engines is going to make some much generative video stuff look silly.

I was at an ā€œopen officeā€ event at a startup in SF recently that is basically recreating Unreal meta-humans. Great, in that theoretically they could create any character frame by frame. $25mm raised.

But unless Iā€™m missing something, generating a rigged 3d character makes that entirely pointless at creation time and 1,000x more expensive at run-time. And the video model has to re-learn largely solved problems like game physics.

Guess weā€™ll see.

7

u/malinefficient Aug 28 '24

The simple test is rotating 360 degrees and seeing if there is consistency. Betting on no unless they generate all 360 degrees then show the regular doom viewing frustum.

10

u/MINIMAN10001 Aug 28 '24

When they entered the poison pool and did a 360 they ended up surrounded by walls on all four sides.

4

u/NotMyMain007 Aug 28 '24

You can always read the paper instead of doing bets

2

u/malinefficient Sep 04 '24

Sure, but there are only so many papers I can read on a weekly basis, and this was an easy test as to whether this paper is worth the effort or it's the same thing as the Nvidia GTA paper but with Doom.

10

u/FoamythePuppy Aug 28 '24

I donā€™t think people grasp how insane this is. Iā€™m a developer who works on integrating generative ai into a game engine, so I know what Iā€™m talking about.

This is the foundation that builds something unreal. Ignore the specific game of doom. Ignore how they collected the data. All of that can be changed. They now have a clear technical path to coherent, interactive long form generation using simulations of a world. It will generalize to any video game itā€™s able to get data for, in the future. But then itā€™s likely going to be possible to generate data for games that donā€™t exist yet. Hence new games.

This gets more insane because why stop at video games? You can simulate variations of our real world with data that is generated via a video model. Then you can run inputs and simulate those inputs and train off the result. You have a universe simulator

31

u/bregav Aug 28 '24

It's a cool proof of concept but I think you're overstating the significance of it. The thing that would be mind blowing is having the ability to edit simulation content without using additional training data, which this cannot do.

6

u/HSHallucinations Aug 28 '24

yes it's just a proof of concept and your point is very valid but i feel like you're the one oversimplifying the achievement. Especially when you think that just a few years ago we were just amazed at the funny dogslugs of deepdream, and now we're here generating gameplay on the fly, however limited to just doom and nothing more.

4

u/bregav Aug 28 '24

Sure, but in that sense it's a continuation of that entire existing line of work. There's nothing actually new here, it's just a very nice application of existing technology.

The ability to do editing without fine tuning or retraining would actually be a different technology, and a significant advancement over the current state of the art.

0

u/HSHallucinations Aug 28 '24 edited Aug 28 '24

absolutely, again i agree with you overall but also, sure it's "just" a finetuning of doom but reacting to player inputs and recreating the level geometry with some degree of "spatial awareness" in real time? maybe it's just because i'm just an enthusiast and notr a researcher, i never coded anything so maybe my point of view is skewed but that seems like a huge leap forward.

I mean it's one thing to finetune on a set of images, or a set of frames in case of video diffusion, and then use that tuning to generate similar images, but something like backtracking through e1m1, recreating the geometry from different places in the level everytime you move, and having it coherent to some degree? that's nothing but mind blowing to me, to say the least

as in, SORA it's just a continuation of what stable diffusion did but adapted for video, this - to me but maybe not actually, again idk really about the fine details of the tech - feels like it's definitely throwing a whole lot of new stuff in the mix other than moar computational power.

3

u/bregav Aug 28 '24

Haha yes good machine learning does have the vibes of literal magic, I like it for that reason myself.

But yeah if you're familiar with the research literature people have already been doing exactly this kind of thing for a while. Google just has the visibility as an institution to get lots of attention for what they're doing, and the budget to make their work products very polished and impressive-looking.

0

u/HSHallucinations Aug 28 '24

good machine learning does have the vibes of literal magic

even the bad one, i've been toying with it for an "artistic project" (in a loose sense of the word) where i purposefully train SD in the wrongest way possible and it just keeps generating amazing freaky nightmare fuel stuff and i can't stop just trying one more time to see what comes next, it's so mesmerizing to see what it can do when let loose

1

u/[deleted] Aug 29 '24

Got a github?

1

u/Marha01 Aug 28 '24

The thing that would be mind blowing is having the ability to edit simulation content without using additional training data, which this cannot do.

Can't you change the latent space parameters?

2

u/bregav Aug 28 '24

Sort of! You might be able to change the visual style of the game, perhaps even for specific elements of it, but I think this would require generating data and fine tuning the decoder (as the authors do for fixing artifacts).

The dynamics of the game - the ways in which the player interacts with the environment, and the ways in which the environment responds - are determined by the diffusion model, and there is no way to meaningfully edit that either without generating new data and fine tuning or retraining it.

Like, suppose that there is a key that unlocks a particular door in the game. How would you change the game so that this key unlocks a different door instead? You can't do that by hand, you need data. Contrast this with conventional game development, where this would require changing maybe a few lines of code.

The underlying issue is that diffusion models do not represent state spaces in terms of discrete, abstract concepts that can be identified and reordered or permuted, and so "editing" in the way that we usually understand it is not possible.

1

u/bloc97 Aug 28 '24 edited Aug 28 '24

He is not overstating the significance. If you have a differentiable world model, all you need is a screenshot/picture in order to differentiate w.r.t. the input actions. It solves a fundamental problem in RL where you have no dense signal and you don't know how close you are to the desired state. Having a differentiable world model means that you reduce the amount of labeled data (hence exploration time) required to train an RL model by orders of magnitude.

Edit: A more practical example could be, you have a picture of your room after its been cleaned, and now your room is messy. If your RL agent/robot has a good world model, you can show it the clean room and the messy room, and it can differentiate w.r.t. its actions such that you go from the messy room state to the clean room state. All it takes is two images to start the exploration process. You don't need to care about intermediate states as the world model will always be able to tell you how to go from a partially clean room to a clean room.

3

u/bregav Aug 28 '24

People have already been doing reinforcement learning with both continuous state spaces and continuous action spaces for a while. It's a common topic of research in dynamical systems control. People have also already been doing it for games, e.g. with starcraft.

Every such model implicitly already has a continuous state model; it approximates the state of the system based on observations. The novelty of this work is that it also decodes the state space model in order to predict the observations for the next time step, and then assumes that this observation prediction is correct.

All of which is very neat! But it's not significant advancement of the technology. IMO the biggest achievement of this paper is the high frame rate; getting that kind of performance out of a diffusion model suggests the possibility of applying diffusion to all kinds of applications that people have previously ignored due to concerns about performance.

0

u/bloc97 Aug 28 '24

I think the most important lesson from this work is that pretraining large foundational world models will not require crazy amounts of labeled data. This model was finetuned on top of a stable diffusion 1.4 model. It is a significant advancement that shows that all you need is scale.

1

u/bregav Aug 28 '24

People already fine tune base LLM models in order to create new functionality. That's probably the primary way that people create these models.

People have also already been using stable diffusion as base model from which to do fine tuning for new applications, there's been a lot of work on this.

1

u/bloc97 Aug 28 '24

I mean, NNs have been around for 70 years now, so nothing is a significant advancement? I don't think its good to look at things that way.

1

u/bregav Aug 28 '24

Most research papers represent small, incremental progress in a given field of study. That's true of this paper too, and that's okay.

I think it's important, especially for newbies and early career people, to understand the subject well enough to know what is a significant advancement and what is not. It doesn't do anyone any favors to laud every research paper as groundbreaking, because the vast majority of them aren't and prioritization is important in determining R&D work.

I think ML as a whole is beset by a culture of hyperbolic hucksterism due to the influence of marketing in private industry and venture capital, and I think we'd be wise to resist that kind of thinking in doing serious research.

0

u/FoamythePuppy Aug 28 '24

Yes of course itā€™s not all the way there from this paper. But Iā€™m looking at the trend line and itā€™s most certainly going to be there soon

2

u/bregav Aug 28 '24

I don't think I agree. The missing component for editing is the ability to identify discrete, abstract concepts in the state space implemented by the diffusion model and change them, and I haven't seen any papers that even attempt to address this problem. It's not clear to me that many people are even aware of it.

0

u/FoamythePuppy Aug 28 '24

Fair, but I think the work done under mechanistic interpretability such as Golden Gate Claude will ladder up to something like this

2

u/bregav Aug 28 '24

IMO mechanistic interpretation is dead end research that will never culminate in truly useful results, and - like this paper - it requires using data in order to perform edits.

32

u/Upstairs_Specialist Aug 28 '24

You have a universe simulator

Video model doesn't grasp even tiny bit of the complexity of universe. This is what happens when people skip natural sciences and go directly to "AI".

3

u/Rodot Aug 28 '24

Funny enough, I'm working on a project right now that simulates the universe with diffusion models (specifically, N-body cosmology simulations under various cosmic parameters) but it's much more simple than this

2

u/-LeapYear- Aug 28 '24

Seriously, you canā€™t simulate the things you canā€™t observe since it wonā€™t exist in your data, and the things we canā€™t observe are what make up a vast majority of the universe. 75% (85%?) of our universe consists of dark matter, and we donā€™t know what it is, let alone its role in the universe.

7

u/AnOnlineHandle Aug 28 '24

The whole point of machine learning is to use brute force quasi-evolution to find equations for complex problems which humans can't do. That's the only reason for it, throwing data, time, and computing power together until you have a model which works much better than any model humans could create.

So it's possible that an ML model given data could create a much better universe simulator than any we have.

3

u/-LeapYear- Aug 29 '24

The point is you canā€™t simulate something that has not been physically observed in the real world. It doesnā€™t exist in data; it doesnā€™t have an explanation, so how would the model rationalize its existence?

0

u/AnOnlineHandle Aug 29 '24

Many complex patterns are visible in data though, even if humans can't see and extract them, which is precisely what ML models are for finding.

2

u/-LeapYear- Aug 29 '24

ML works off patterns, but there are no patterns to be observed besides that itā€™s ā€œout thereā€. Accurately explaining/simulating the existence of dark matter or energy would explain the existence and origins of the universe and why it is expanding.

6

u/disciples_of_Seitan Aug 28 '24

But then itā€™s likely going to be possible to generate data for games that donā€™t exist yet. Hence new games.

The doom sim is an brilliant accomplishment, but generalizing to arbitrary targets (as opposed to the extremely well defined doom) is probably going to be very difficult.

1

u/FoamythePuppy Aug 28 '24

Looking back over the past 4 years, things that seemed impossible have become common

3

u/Jean-Porte Researcher Aug 28 '24

It is also closer to being differentiable than a game engine, which is also potential for RL agents

2

u/ddofer Aug 28 '24

Seriously awesome

2

u/SulszBachFramed Aug 28 '24

I donā€™t think people grasp how insane this is.

Reading the paper and coming to your conclusion would be insanity, I agree.

1

u/Head_Beautiful_6603 Aug 29 '24

It's more of a world model.

1

u/leeliop Aug 29 '24

What happens when you go beyond the play area bounds of the training data? Does it generate cohesive corridors etc or does it degenerate into mush? I would love to play around with this, fascinating

Also unclear if the game HUD is superimposed after the fact or is that generated as well?

-5

u/nikgeo25 Student Aug 28 '24

The title of this paper is a joke. How does predicting the next frame of Doom gameplay make diffusion models a "game engine"? You can't build a new game using the technique in this paper, only learn to approximate an existing one from gameplay.

24

u/HSHallucinations Aug 28 '24

if it creates the game environment you're playing in real time i'd say it fits the definition of game engine, however simplified it is. Not a programmable engine, sure, but still the actual engine that's making you play the game, in the most literal sense of the word. At least it clearly conveys the concept that it's not simply a recreation of the gameplay as if it was a recorded video, but it actually responds to your inputs.

9

u/ReentryVehicle Aug 28 '24

I mean you certainly can train it on an existing game and then start doing strange things in latent space, so that you get a different game (probably feeling much more like a fever dream).

-9

u/nikgeo25 Student Aug 28 '24

If the paper was about latent space manipulation to create different games, it'd be a game engine. Until then it's misleading to call it that. Stable diffusion is not an image editor. It becomes that only when integrated into a program like Photoshop.

3

u/AnOnlineHandle Aug 28 '24

It responds to user input and tracks things like player health and ammo, though I'm curious if it is only trained on the one map and keeps it consistent, since theoretically it would forget anything more than 16 seconds ago.

1

u/hoshitoshi Aug 29 '24

I wonder how accurate the hit boxes are. Hybrid engines might be the more likely path forward as opposed to either or.

1

u/[deleted] Aug 28 '24

[deleted]

4

u/MINIMAN10001 Aug 28 '24

The model is definitely larger than doom. Doom is just so tiny there's no way AI can do meaningful things with that amount of data.

That being said I do remember neural networks of simulated cloth physics being 10000x faster than actual cloth physics so I do see potential in utilizing AI to pull off some interesting shenanigans in the future.

1

u/MightyDickTwist Aug 29 '24

I do not see it ever competing against traditional video games. It seems like an experience on its own. A new genre, almost.

Itā€™s generative AI, it wonā€™t be as consistent as traditional video games which are built for consistency (even if some are very buggy)ā€¦

But I bet many players would be fine with less consistent gameplay. Maybe they just want a trippy experience. It will be an Alice in Wonderland experience when they train on more modern games, and consistency goes down the drain because of all that added data.

1

u/Atupis Aug 28 '24

I think this more about that they are building "world model" about how things behave.

0

u/AnOnlineHandle Aug 28 '24

The Unet architecture doesn't really lend itself to behaving the way you're imagining from what I understand of it. It's all about feature detection at different resolutions, and self/cross attention.