To help anyone design a new kitchen or bathroom with ease, we wanted to find a better solution to manually measuring a room. DigitalBridge allows you to automatically capture a dense 3D point cloud of a room and feeding it into our floor plan estimation algorithm, returns a complete 3D floorplan. We also use 3D dense reconstructions of a customer’s room to enhance our augmented and virtual reality products.
Over the next couple of months, we’ll illustrate in detail how we use stereo vision and deep learning to obtain dense 3D point clouds. However, in this post we’re going to give an overview of the most commonly used techniques to acquire 3D information, briefly illustrate the theory behind them and highlight their strengths and weaknesses.
For context, we have a wealth of experience in dealing with compact devices that generate 3D information, from 3D depth cameras such as Google Tango structured-light scanners, which produce very accurate 3D point clouds and we used when we first developed our floor plan estimation algorithm, to Kinect and Structure Sensor from Occipital. However, these devices aren’t compact enough to carry around, and we also probably couldn’t convince customers to buy a Kinect specifically to use our products. For these reasons, we shifted our attention to devices that everyone uses: smartphones. Although most smartphones don’t have depth cameras yet, they’re steadily integrating them. Furthermore, the newest ones already embed very powerful augmented reality platforms such as ARKit and ARCore, which provide visual SLAM capabilities. With a little effort, we can take advantage of the information provided by ARKit and ARCore, mix them with passive depth prediction techniques such as stereo vision, and obtain a 3D point cloud from our phone without having to carry around other bulky devices.
Depth perception, which is the basis of 3D point cloud reconstruction, is the ability of a system (biological or artificial) to determine how far things are from the camera, and to retrieve the 3D information about the state of the environment. This competence is essential for living beings to safely traverse the environment without colliding with dangerous objects. Beside safe navigation, depth perception is important for gaining general understanding of the surroundings to allow proper interaction with it. For example, a mobile robot can make a time-optimal plan to collect different objects if it knows their relative distances.
The technological advancements of artificial devices that perceive depth information has been for the most part inspired by nature. Over thousands of years, different species have developed different biological techniques to collect 3D information and cope with their surroundings. These techniques can be either active or passive, where each has its pros and cons. Generally active techniques process the measurement in a controlled way, thus they tend to be more accurate than passive ones. Moreover, active techniques have, relatively, simpler principle of operation than those for passive counterparts that require more effort for calibration and inference. However, these advantages come at the expense of the device manufacturing cost, for example, a passive device like a camera is significantly cheaper than an active laser scanner (2-3 orders of magnitude).
An active system transmits a signal, of some form, towards the target object, then measures and interprets the reflected signal to determine the depth.
An example of such systems is echolocation used by bats and dolphins where they transmit pitches of ultrasound waves that propagate in the environment. Upon their reflections off targets, these waves are captured by the bat, or the dolphin, and translated into distance. The sonar is the equivalent artificial device that does exactly that, and the theory behind it is pretty simple: if $v$ is the speed of sound, and $t$ is the time-of-flight from the instance of transmission to the instance of reception, then the distance between the sonar and the target follows this equation $2d = vt$. Generally, the speed of sound $v$ propagating through a certain medium, e.g. air, is known, and by measuring the time $t$, the distance $d$ is simply half the product of these two quantities.
The sonar output is the distance to the closest object within the cone generated by the propagated ultrasound wave. This phenomena presents uncertainty in the exact location of the target with reference to the sonar.
Pros and cons: Sonar is already quite accurate. However by using a different type of wave its possible to gain even more information about the location of the target. An example of such device is a LIDAR. Nonetheless, sonar devices are relatively small in size and quite affordable with respect to LIDAR. A few interesting projects have been proposed that embed a sonar in a smartphone, but this kind of technology is still mainly used for ground and underwater vehicles.
A laser scanner, also known as LIDAR, uses focused electromagnetic waves, i.e., laser, to measure the phase shift between the transmitted and received signals to estimate the distance to the target. Typically, a rotating mirror is carefully mounted near the sensor to measure the distance at different angles which results in a 2D point cloud as shown below. Upgrading to a 3D laser scanner requires transmitting different laser beams at different vertical angles, and then applying the same principle above.
Pros and cons: Due to its high accuracy and long range, this type of scanner is very popular in self-driving cars and autonomous robots. However, devices that use this technique are usually bulky and quite expensive compared to others.
Another type of active depth technique is structured-light scanner. It is similar to LIDAR in terms of transmitting electromagnetic waves and measuring their reflections, however, the way the sensor interprets depth is completely different. To simplify the principle of operation, consider a laser projector, shown in the figure below, that transmits a beam towards the target at a known angle $\alpha$. The reflected beam is captured by a CMOS camera located at distance $b$ from the projector, and it measures the position $u$. If the focal length of the camera $f$ is known, then by using similar triangles, and a bit of trigonometry, the depth $z$ is easily determined. This concept can be generalised to capture depth in 3D space by introducing another angle $\beta$ with respect to the $y$-axis. In practice though, instead of mounting a laser projector on the top of a motorised rig, its easier to project a predefined pattern known as structured-light towards the scene as shown below, then measure the reflected pattern. Matching the two patterns allows triangulation to estimate depth at each point of the transmitted grid.
Pros and cons: This technique works well in capturing 3D point cloud, however, it fails to determine the depth of reflective surfaces such as mirrors or glass as the projected light does not reflect towards the projector. Furthermore, it is usually not as effective in outdoor environments due to its limited range. Devices that use structured light patterns are much cheaper than LIDARs but more expensive than simple cameras. There might be, however, a promising future for this technology; in fact, the iPhone X frontal camera is already equipped with a structured-light scanner, which is used for face recognition tasks.
The dominant method for depth perception in terrestrial life is vision, where land living beings use light to see and make sense of their environment. This type of vision technique is passive, i.e., it measures the signals coming directly from the target object without the need to transmit any source signal. Depth can be recovered from passive vision techniques using different approaches, either from a single image or from several images of the scene. In case of applications that run on smartphones or iPads, using passive depth techniques is a more viable, and a much cheaper, alternative than attaching any kind of projector to those devices.
Typically, humans recover depth information using two eyes via a process called Stereopsis. Since the eyes are separated by some distance, each observes the same scene from a slightly different point of view. This disparity between the two images is interpreted by the brain into depth map, where far points resemble small disparity as opposed to closer points that correspond to large disparity.
Replicating stereopsis can be done artificially using two cameras separated horizontally by a distance $b$ known as the baseline. For each 3D point in the scene, if we know its corresponding pixel in the first image as well as in the second one, we can triangulate and recover the depth using simple geometry as shown in figure above. The most difficult process in this approach is to match each pixel in first image with at most a single pixel in the second image; this problem is known as the correspondence problem, and there are many methods in Computer Vision literature that solve it efficiently.
Pros and cons: Stereo vision returns the best results in regions of high intensity gradient, e.g. edges or corners, which makes the correspondence problem relatively easy to solve. However, it does not perform well in low texture regions such as plain walls. On the other hand, when some assumptions are met (e.g., there is no change in illumination, the baseline is fixed and the cameras are calibrated), it is possible to implement this kind of technique on any device with a built-in camera on a smartphone.
One might say: “if I cover my right eye and observe the surrounding by my left eye only, I can still determine, roughly, how far things are!” But, how can we justify that? Well, the answer is: our magnificent brains! Over time, as we observe the same object, an apple for example, many many times, certain neurons in our brains fire more often than others, which strengthen certain connections between them. These connections between the neurons resemble the act of learning a particular skill, in this case, the size and relative depth of an apple. That is the essence of deep learning depth prediction from a single image.
The advancement of computing technology along with the internet revolution paved the path to design powerful artificial neural networks that are able to learn different skills in computer vision. An artificial neural network is a mathematical structure that consists of simple units stacked in layers like LEGO. Each unit simulates a brain neuron and activates if the sum of its input exceeds a threshold. The learning process of such network alters the weight of each input contributing to the neuron’s output. A neural network can be used to solve a classification problem, e.g., to identify the class of an object in an image such as an apple or orange. Another application of neural networks in computer vision is depth prediction. With the huge amount of images online, it became possible to train a network with many hidden layers, i.e. deep neural network, to estimate disparity from a single image or pairs of images, and then determine the 3D structure of the scene.
Pros and cons: The predicted depth is usually to scale, unless the camera intrinsic parameters are known. This means that the metric information cannot be retrieved very easily. This method, however, can predict the depth of featureless surfaces, unlike traditional stereo vision methods which are efficient only on edges or corners. It’s important to note that depth estimation using deep learning depends greatly on the structure of the network as well as the training dataset.
Both active and passive depth techniques have their pros and cons. To overcome these limitations, several works that combine active and passive techniques in one single system have been proposed in the past. For example, this work adds a texture projector to a stereo settings. Or, UltraStereo Depth System uses stereo vision, structured-light sensor and also some machine learning tricks to obtain a very accurate 3D structure of the scene. Despite it being an active research topic, many accurate and effective techniques have already been proposed, so we expect to see incredible applications that take advantage of these systems very soon.
Obtaining 3D information is a crucial part of DigitalBridge software. For example, we use a mixture of the above techniques to generate a simplified room structure and estimate a 3D floor plan. These techniques allow us to build a high-quality virtual tour of the room, which helps consumers kick-start their renovation project and gives them confidence in purchasing their new bathroom or kitchen.
In this blog post, we’ve illustrated some of the most popular techniques in obtaining 3D information. In the next couple of posts, we’ll explore passive technologies in more depth, so stay tuned!