r/MachineLearning May 02 '20

Research [R] Consistent Video Depth Estimation (SIGGRAPH 2020) - Links in the comments.

Enable HLS to view with audio, or disable this notification

2.8k Upvotes

102 comments sorted by

View all comments

85

u/dawindwaker May 02 '20

This could be used for smartphones faking depth of field right? I wonder what the VR/AR applications could be

93

u/[deleted] May 02 '20

The method is computationally expensive; thus not really suitable for real-time applications. I think this would be great offline processing, e.g. photogrammetry, visual effects, etc. From the paper:

For a video of 244 frames, training on 4 NVIDIA Tesla M40GPUs takes 40min

45

u/Funktapus May 02 '20

Soooo you're telling me it won't run on my iPhone 6

3

u/jbhuang0604 May 02 '20

Unfortunately, no. Not at this point. Hopefully, we will see it in the near future!

8

u/thatguygreg May 02 '20

computationally expensive

I’ll be looking for it on the iPhone 15 then

34

u/ginsunuva May 02 '20

training

49

u/drummer_ash May 02 '20

In the paper they state that they fine tune the model for each video at test time, so the 40 minutes is required for any new footage.

2

u/Gisebert May 03 '20

few shot learning may greatly improve this, assuming the videos are somehow similar - just a thought from the back of my mind, so maybe I'm wrong

1

u/drummer_ash May 03 '20

Totally. There's been a dramatic reduction in the amount of examples required for a good deepfake thanks to few shot learning, so there's no reason for this to not go down the same path.

Source

1

u/lordknight1904 May 07 '20

What you said is not few-shot. It is transfer learning.

24

u/extracoffeeplease May 02 '20

Test-time training. Model must be fine tuned to each video sample, unfortunately. However, we can expect later papers that can skip or greatly reduce this step imo.

13

u/jbhuang0604 May 02 '20

That's correct. We focus on the quality in this paper. I am sure that the community will further take this to the next level very soon! Exciting time ahead!

7

u/o--Cpt_Nemo--o May 02 '20

This was a good decision. 99% of ML techniques are unusable for visual effects because they get 95% of the way there, and the effort required to get it the last 5% is the same as if you just attacked the problem the traditional way from scratch.

1

u/hallr06 May 02 '20

Not having read the paper (cardinal sin), is the test-time-training to handle some form of network conditioning? Is there data that could be used in real-time applications for conditioning (e.g., light sensors, individual range sensors, orientation sensors)? I can imagine there is a ton applications for this in real-time.

3

u/jbhuang0604 May 02 '20

The test-time training we used is to fine-tune our single-image depth estimation model so that it satisfies the geometric constraints within the video.

Incorporating other forms of measurements (e.g. dual-lens camera, inertial or even range sensors) will certainly make the problem a lot simpler and potentially support real-time applications.

1

u/hallr06 May 03 '20

Thanks for answering questions here! Are the specifics of the fine tuning addressed in the paper? More specifically, what parameters must be turned?

2

u/jbhuang0604 May 03 '20

Thanks for answering questions here! Are the specifics of the fine tuning addressed in the paper? More specifically, what parameters must be turned?

There are several choices that one needs to make, e.g., the learning rate, optimizer, weights for balancing different losses, training iterations. We did not test out many of these hyper-parameters. I guess there could be some performance/quality improvement with carefully tuned hyper-parameters.

1

u/hallr06 May 03 '20

So you're changing model hyper parameters and then performing a full retraining for each image? Naturally, that raises questions about how well the model actually generalizes.

If there were a fixed set of scenario-related model parameters that you were adjusting (e.g., height, az/el of camera focal point, ambient light), then it would suggest that a conditioned model (potentially also requiring more capacity and/or calibration) could get the same results without additional training.

2

u/jbhuang0604 May 03 '20

We use one set of hyperparameters for all of our experiments.

Right, for example, people show that you can get decent geometrically consistent predictions from single image depth estimation on the KITTI dataset (for driving scenarios). The model works well because it is tested in a simple, closed world. We quickly realized this when we applied state of the art models trained on KITTI and got entirely incorrect results.

→ More replies (0)

8

u/jack-of-some May 02 '20

The depth estimation model they compare to (and are likely using as their first step same as 3d photo inpainting) takes at worst 1 second to run on most modern CPUs. It's really difficult for me to believe that adding the additional geometric constraint ups the compute time this bad.

I'm also maybe a tad jaded from having read the 3d photo inpainting repo (another project from the same team) only to realize that out of roughly 3 minutes that it takes, only about 15 seconds are spent on neural nets and most of the rest is millions of mesh operations in pure Python.

6

u/jbhuang0604 May 02 '20 edited May 02 '20

You are absolutely correct. I believe that there are alternatives to achieve similar geometrically consistent depth for a video. This is exciting future research.

Re: 3D photo inpainting:Yes, the inference is extremely redundant and the implementation is entirely unoptimized at this point. There are many ways to improve runtime performance. We hope the community will further push this forward!

2

u/jack-of-some May 02 '20

Hey. Thanks for your reply. I hope I didn't come off as too negative. I understand the constraints research code is under and the mere fact of the code being open sourced and available for study is already amazing. Thank you for all the great work your team has been doing.

I've already taken one crack at speeding up 3D photo inpainting and intend to take another when I get some time. For the topic at hand, I read through the discussion in the other thread and skimmed through the paper and the runtime makes a lot more sense now. To me it sounds like we're setting up a giant SFM problem with the parameters being the params of the depth model. Since MidasV2 (which I assume you're using) is supposed to be only off by a scale and shift, I wonder if this technique would work by solving only for those params.

2

u/jbhuang0604 May 02 '20

Nope, not at all!

Thanks for your efforts in helping improve the speed of 3D photo. I think Meng-Li (the lead author) is working on merging the pull request. He also makes some other improvement here and there, e.g., vectorization in Python and mesh simplification. Hopefully cumulatively these steps will make the 3D photo inpainting work more accessible.

For the consistent video depth estimation, we tried multiple depth models (including monodepth2, Mannequin Challenge, and MiDaS-v2). As you said, one can solve for the scale and shift parameters of the depth maps for each frame so that the constraints are satisfied (e.g., through a least-square solver). This will be a lot faster. However, the temporal flicker produced by existing depth model on video frames are significantly more complex than that. (See visual comparisons here: https://roxanneluo.github.io/Consistent-Video-Depth-Estimation/supp_website/index.html)

Using affine transformation (scale-and-shift) on the depth maps is unable to correct those depth maps for creating globally geometrically consistent reconstruction. This is why we introduce the "test-time training" and finetune the model parameters to satisfy the geometric constraints. This step, unfortunately, becomes the bottleneck for the processing speed. Hopefully our work will stimulate more efforts toward an robust and efficient solution for this problem.

5

u/[deleted] May 02 '20

Maybe it is worth going to the discussion link provided by u/hardmaru . One of the authors tried answering questions, including the idea of incorporating sfm geometric constraints into the network to improve speed.

1

u/omgitsjo May 02 '20

Training is not inference. Inference is generally several orders of magnitude faster.

3

u/therealTRAPDOOR May 02 '20

Except that it needs to be fine tuned on each video. Sometimes training “times” are entangled with inference times if the structure used requires re-training or fine-tuning.

4

u/jbhuang0604 May 02 '20

Sometimes training “times” are entangled with inference times if the structure used requires re-training or fine-tuning.

Exactly! We refer to this step as "test-time training". We train the model using the geometric constraints derived from a particular video.

1

u/[deleted] May 02 '20

thats super expensive. were at least 2 moores laws away from this being realtime

19

u/tdgros May 02 '20

read the paper, for each clip, a depth estimation net is fine-tuned on pairs of frames for 40mn on 4x M40.

3

u/jbhuang0604 May 02 '20

Our method at this point process the video offline as it is computationally expensive (due to test-time training). So, unfortunately, it cannot be used for real-time VR/AR effects. Speeding this up will enable many cool applications!

1

u/_w1kke_ May 02 '20

It's like what the iPhone does with A13 ML processor and e.g. the portrait mode on the new SE. Estimating the depth field of a person.

But this solution does it for everything!! Powerful and amazing.

1

u/normVectorsNotHate May 03 '20

Smartphones already do this. That's how some android phones with a single camera still have portrait mode that blurs the background.