How we use Convolutional Neural Networks to estimate the layout of a room
At DigitalBridge, one of our core technologies is in the area of machine-learning/Artificial Intelligence (AI); the Fully-Convolutional Deep Neural Network. While this is a daunting name, it relates to a set of conceptual building blocks which this post will aim to describe in an accessible way. We will start by describing a single neuron, and build this up to describing how multiple neurons can be chained together to form a neural network. From there we will go on to describe what it means for a neural network to be deep, and finally for it to be Fully-Convolutional.
Key tasks for which we use these systems in production include layout estimation and segmentation of rooms. Segmentation allows us to divide your room into meaningful regions, allowing us to apply wallpaper to regions without crossing the boundaries of objects, such as pictures and windows. This post will relate to our Room Layout Estimation, which is also based on segmentation. In contrast with our full segmentation system — which works with any number of segments of any type of object or category — the Room Layout Estimation system is only concerned with locating the presence of walls, ceilings, and floors in your photos, which in turn allows us to create a 3D representation of your room. Below we can see a colour coded example of a segmentation used for Room Layout Estimation.
So what is a Neural Network? To begin with, I’ll describe the simplest case of a single ‘neuron’ via a model called Logistic Regression. In Machine Learning we typically refer to items of data as ‘examples’, and refer to every bit of information that we know about them as ‘features’. Often we also know what type of thing an example is, we refer to the set of types as the ‘classes’, and to the class a particular example has as the ‘label’.
Imagine that we have a dataset of 100 people’s weights and heights, and that all of these people are either basketball players or sumo wrestlers. If we were to plot this data it may look something like this, with red crosses being basketball players, and blue circles being sumo wrestlers:
Our Logistic Regression model learns to find a line to separate these two classes, so in future we can take new data about sports people and decide which class they belong to. To define the line we need three numbers known as ‘parameters’. In this case they are a weight for each feature and a ‘bias’ which can be thought of like the intercept of the line on the graph. The model outputs a probability that a given example belongs to one of the classes, which can also be seen as a confidence.
The predicted probability comes from feeding our weighted sum (plus bias!) into an ‘activation function’, in this case a function called the Logistic Sigmoid, seen below.
The line corresponds to feature values where the model is unsure to which of the two classes a point along it would belong. Finding the best parameters to describe a set of data is called ‘Learning’, and after learning we can view the corresponding line our neuron has found, plotted along with our feature values.
For our room layout estimation we wish to classify every pixel in an image to one of five categories. These are floor, ceiling, left wall, centre wall, and right wall (we make the assumption that we will see at most 3 walls in a single photo).
This raises a couple of questions:
- How do we discriminate between five different classes?
- Where do our features come from, can a single pixel tell us anything about ‘wallness’ vs. ‘ceilingness’?
The first one has a simple solution. We train five neurons, and each of them learns to discriminate between their own class and the rest. So one of them learns to distinguish ceiling vs. not-ceiling, one learns floor vs. not-floor, etc. We then take the most confident neuron to be the prediction.
The second is not so straightforward, and the full details of it are beyond the scope of this post. What it boils down to, is that we can chain together many layers of neurons that feed into one another, the output of each neuron’s activation in a layer defines a new set of features for the next one. Each layer learns to discover more informative features for the next layer, up until the final layer where the decision is made. This is the intuition behind the term ‘Deep Learning’, a deep hierarchy of neurons automatically learns informative features to solve the problem at hand.
Now that we have covered the Deep Neural Network part, I’ll outline what Fully-Convolutional means. For images we have to begin with three features for every pixel, a red, green, and blue value. Taken individually these tell us very little about the image — but by looking at groups of neighbouring pixels we can understand a little bit more.
A convolution involves arranging our weights in a ‘kernel’, the kernel has a height and a width, and so views a little window of our image. To produce a feature or prediction in a 3x3 window we then need 9 times the number of features for a pixel, plus our ‘bias’ from before — so in the first level of our neural network one convolutional neuron must have 3x3x3+1 parameters. Height, times width, times features plus bias. We slide our kernel across the image to get one prediction, or feature for every 3x3 group of neighbouring pixels. Each layer can have any number of convolutional neurons, so that they detect some number of features associated with every position.
Traditionally Convolutional Neural Networks would take an image, learn features using multiple stacked convolutional layers and then classify the entire image into a category by aggregating all of the features in the final convolutional layer together. As such it would produce its final output from a ‘dense’ layer, which looked at the features from every point over the image. In contrast a Fully-Convolutional network has no final dense layer, and predicts a class for every point in the image. This allows us to generalise the problem from ‘what object is in this image’, to ‘what objects are in this image, and where are they’.
Back to Room Layout Estimation, the convolutional neural network assigns a class to each pixel in the image. These pixels are grouped in segments as shown in Fig. 1 (b). Now, given such segmentation, how to extract the room layout? One approach can be drawing lines between adjacent segments, then, finding polygons that represent walls, ceiling and floor. This approach, though simple, can be inaccurate due to the occasional misclassification encountered by the neural network.
A more robust approach is to incorporate some features from the image itself along with the result from the neural network. Lines can be extracted from the image by looking at the spatial gradient of pixel intensity values. Parallel lines in the 3D world intersect to a single point when projected into a 2D image plane. We call these points vanishing points, and they can be used to find the intersecting lines between two planes in the image, for example the ceiling and a wall.
Combining these two sources of information — the output from the convolutional neural network and the vanishing points — we are able to estimate the layout of the room as shown below in Fig. 6.