February 28, 2019

Filed under:

As introduced in our techniques for generating 3D information blog, we’re going to talk a bit more in depth (pun intended) about depth prediction using fancy deep neural networks.

“Why is it interesting to DigitalBridge?”, you may ask. As we’ve mentioned before, we’re trying to obtain a 3D scan of a room which is metrically accurate even in the presence of featureless walls. This is quite common in indoor scenes, which are not reconstructed well using traditional methods. You can see a few examples of what I mean below:

ARKit/ARCore have similar problems detecting planes without distinctive features, as you can see on the right.

This is a known computer vision hurdle that many researchers are trying to approach and solve. How do you understand where you are in a room and what the room looks like if everything looks exactly the same? Just imagine staying in a completely uniformly coloured room. I like this example of a featureless room from Chris & Jack video.

Well, to be fair, the problem is slightly different.

We use our amazing brain and our really wide field of view, close to 210 degrees, to infer that things with uniform colour should belong to some structure. We’re also able to infer the right shape (for example plane for a planar wall) by checking how the floor or wall junction looks. Besides, our vision system is really good, and we can detect slight light patterns and imperfections even in very uniform surfaces. Unfortunately algorithms can’t do that yet.

Obviously, we’re not free from optical illusions and perception problems either, especially when surfaces are highly transparent:

But let’s keep on topic..

Why not? Everyone is using it!… That would be the easy answer.

In reality the availability of very large datasets, especially those that include colour images and depth images generated by depth sensors, able to measure depth also on featureless regions, enable naturally the use of Deep Networks to learn 3D structure. In fact, when a suitable deep depth estimation architecture is fed with thousands and thousands of data examples, they surprisingly start exhibiting some of our abilities, such as perception of depth from monocular views (one eye closed) and depth estimation on featureless regions.

**But why is that? What makes Deep Learning suitable for the task? **

That’s a difficult question to answer and researchers are still trying to investigate why it is performing so well, but the general belief is that convolutional architectures (picture below) are suitable for image data, capturing spatial relations among neighbouring pixels. Stacking layers with non-linear activation functions helps learning increasingly more abstract features, which improve the expressive power of the model.

There are a lot of very good tutorials and books to learn about Deep Networks (Deep Learning Book and cs231n just to name a few), but very briefly, they consist of multiple layers trained to minimise a loss function, which measures how close we are to “true” labels, on the training dataset. The objective function varies with the problem and the choice made by the designer of the network. In the case of depth prediction, usually considered a regression problem, a loss function which measures the difference between the predicted depth and the true one is used.

It is perhaps surprising that we can make these complex behaviours emerge just through a single cost function, without any modeling assumption on how to handle featureless walls or else, which highlights the power (and mystery) of Deep Learning.

Recent and past research has focused on trying to predict depth from just a single view. This is an “ill-posed” problem, in fact there could be an infinite number of scene geometries that generate the same image. For example, think about gradually “enlarging” an object while moving away from it; its dimension at different times will look exactly the same.

In general, the problem is even more complex since there are about as many degrees of freedom as pixels in the image. We actually use a lot of learned “context” that we embed in our inference process (e.g., how big a specific object should be, or what the depth relation between neighbour points is); we have learned these concepts because we know how perspective works and obviously we embed the fact that we continuously observe the world at different times. This also means that if we are presented with a specifically constructed optical illusion, our minds have a really hard time figuring out how it all works:

Many interesting methods have been proposed for monocular depth prediction, for example FCRN, Eigen et al., DORN. Even though deeper architectures (as ResNet-50 in FCRN) generally perform very well on external benchmark datasets (NYUD, SceneNN) and overall generate satisfactorily accurate depth maps, our tests with these architectures show that current state of the art is not able to achieve the accuracy and robustness we need (see figures below). One of the possible reasons may be that these deep architectures suffer from poor generalization performance. If training is performed on specific dataset, such as NYUD, and the network is tested on completely different one - different in terms of environments, field of view, context in the image and/or lighting - the network may not generalize well and thus perform poorly in different situations. Generalization performance is also correlated to how well and robustly the network approximates the function we want. Deep networks are mainly black boxes trained end to end, which don’t let us peek easily into what is being learnt, making interpretation of their results a rather challenging task.

As we can see from the pictures above, there are also issues in constraining depth estimation on featureless walls. Even though the network surprisingly understands that there is some correlation between pixels on the white wall, it fails to understand that it is a planar surface and thus generates a smooth wobbly surface. A possible mitigation of these issues could be retraining on the rooms we are collecting, which is something worth investigating in the future.

As I mentioned earlier, our fantastic brain actually sorts many things out by letting us explore the world and observe it from multiple points of view, while in the background a “model” of the world is continuously built and updated. This, in computer vision or robotics terms, has a the name of “SLAM” - simultaneous localization and mapping.

While there exist a variety of algorithms that have been developed for various sensors, a reliable and fast monocular (one camera is cheaper) dense SLAM is still hard to achieve. Notably, a few methods perform quite well (DTAM and Remode), but require big GPUs, modeling assumptions and approximations to solve complex optimization problems.

What if we combined Deep Learning with SLAM? That is at the basis of many recent works (CNN-SLAM and CodeSLAM to mention a few) and they obtain very good results, combining traditional geometric methods and deep learning.

Point-cloud resulting from fusion and filtering of multiple monocular depth images

Initial tests fusing independent depths coming from deep monocular depth estimates did not seem to result in more accurate point-clouds (figure above). Even though this method seems on the right track to build full 3D point-clouds, we believe that more coherent and consistent depth maps will make this fusion process converge more easily to the correct depth. It seems that for now the accuracy we need is beyond what monocular approaches can offer.

As usual in these research fields, not a long time passes before people come up with new ideas. Stereo, as explained in our previous blog post, is the traditional way to infer depth and has way less degrees of freedom than monocular depth prediction. In fact, if we have perfect correspondences and knowledge of the cameras, we can recover the scene up to a single scaling factor. If we formalize the problem in a less “ill-posed” way compared to monocular depth prediction, there is a higher chance that we will be able to learn how to predict depth in a more robust way, provided we have enough training data and the architecture is suitable.

Many recent research works (GC-Net , MVDepthNet, DeepTAM) try to use single or multiple pairs of stereo images to infer depth. Different formulations exist, but what is common in all of the approaches is the combination of classic “geometric” concepts with techniques that can learn from large amount of data. One of the main problems is understanding which features or points in one image correspond to the same points in another image. In particular, a formulation that I found interesting recently used built-in depth volumes, similar to DTAM, where evidence from disparities of each pair of pixels was accumulated in the volume. Then the function that we’re trying to learn is the mapping between this cost volume and the depth. Combining everything in our mapping pipeline, described briefly in our previous blog post, is then fairly straightforward.

Our tests on a few of our rooms show that using deep learning stereo shows significant improvements if compared to our previous monocular depth predictor, with increased quality of the point clouds that show more more accurate depth even in areas with fewer features.

Deep Stereo reconstruction of a room from SceneNN dataset

Traditional Stereo Reconstruction

Deep Learning Stereo Reconstruction

Deep learning, whether we like it or not, is shaking up the foundations of many fields, including 3D Computer Vision. We’re seeing many promising methods popping up, which suggests that this field is moving quite quickly, maybe also thanks to Deep Learning.

At DigitalBridge we value research and are always interested in state of the art methods that can benefit our products.

The problem is still far from being solved and we hope to contribute to research in this field with new methods or applications, altogether pushing the boundaries of what is possible.

In the meantime, we will try to get even better point clouds by combining what we learnt from these approaches and combining it with additional insights, such as structural constraints for featureless areas.

As our user base grows and we obtain more data, we will be able to tune the experience for every customer of our products.