At DigitalBridge we create products that aid the design and visualisation of kitchens, bedrooms and bathrooms. Using sophisticated computer vision algorithms we allow a user to design and visualise their new kitchen, bedroom or bathroom in their existing room. By understanding the user’s current space we can make informed decisions on how to assist them in this process.
Scene understanding is the process of using a photo (or set of photos) to return semantic knowledge of a scene’s contents. One aspect of scene understanding is determining the objects in a photo. This may be performed at the object level, through object detection, or at the pixel level through semantic segmentation. This blog post will focus on the latter.
For each pixel in an image, semantic segmentation estimates the probability that the pixel belongs to a set of defined object classes. Each pixel can then be assigned a label by finding the most likely class. In the example below, taken from the VOC2012 dataset, the dog is segmented from the chair.
But why are pixel-wise predictions useful to DigitalBridge? They allow us to perform operations on only those pixels that belong to a specific class. For example, if we can determine the pixels that form a wall, we can redecorate those pixels with the user’s chosen wallpaper. Alternatively if a user wishes to visualise how a new suite would look in their bathroom, they may choose to remove all existing suite pixels from a photograph.
Existing approaches to semantic segmentation use a deep convolutional neural network. The details of deep convolutional neural networks are beyond the scope of this blog post, but essentially they are a type of machine learning model that maps input to output. In this context, the input is an image and the output a set of pixel-wise predictions.
New semantic segmentation algorithms are typically assessed by the mean Intersection over Union (mIoU) on the VOC2012 dataset. The IoU is calculated for each class at the pixel-level as:
where true-positives are those pixels that belong to the class and are correctly predicted as the class, false-negatives are those pixels that belong to the class but are incorrectly predicted as a different class and false-positives are those pixels that belong to a different class but are predicted as the class. The image below illustrates these sets of pixels for the dog in the image above. In the Difference image true-positives are indicated by yellow pixels, false-positives are indicated by red pixels and false-negatives are indicated by green pixels.
The IoU is a value between zero and 100, where a larger value indicates a more accurate segmentation. The mIoU is then the mean value across all the classes in the dataset.
The state-of-the-art on the VOC2012 dataset achieves a mIoU of 86.9; however the 21 classes in the VOC2012 dataset are not limited to furniture items and as such are unsuitable for the use here at DigitalBridge. Subsequently we wish to adapt a state-of-the-art model to our needs.
Transfer learning is the process of adapting an existing model for a new task. In our case we wish to adapt an existing model, pre-trained for the semantic segmentation task on the VOC2012 dataset, to a new dataset, NYUd 40, a subset of NYUd v2. NYUd 40 includes 40 classes commonly found in indoor rooms. For example, structure classes such as wall and floor; furniture classes such as sofa and bed; and prop classes such as books and clothes. Transfer learning proceeds by providing examples from the new dataset to the model and letting it adapt to the new pixel-wise mapping. When the dataset is small, transfer learning can outperform training a model from scratch because parts of the pre-trained model are generic and independent of task.
The model achieved a mIoU of 49.0. The visual accuracy of the segmentation is illustrated in the images below. The reduction in performance against VOC2012 can be partially explained by the increase in the number of classes from 21 to 40. It is also exaggerated by poor performance on certain classes. This will be explored more later, but first we will consider some qualitative results.
Below are three original images and their corresponding algorithm output images. In the algorithm output images each pixel is coloured by the class of its highest probability.
The algorithm output for three images from the NYUD 40 dataset. Left) The original images. Right) The algorithm output. Each pixel is coloured by the class of its highest probability.
In general the model is able to distinguish the structure classes and the larger furniture classes. For example, in the second image, the dark green segment indicates the table. However it struggles with smaller props that ‘decorate’ the room, for example the objects on the coffee table in the first image. This is supported by the class IoU’s listed in the table below. In general the structure classes and furniture classes have relatively high IoUs, whereas the props such as clothes, box and bag, have relatively low IoU.
Some of the mid range IoUs can be explained by the conceptual similarity between classes. For example, a curtain and blind are both window coverings that a human may struggle to distinguish. In the future better results could be achieved by reducing the set of possible classes to only those that are conceptually different. For instance, we may choose to include the blind and curtain classes in a window coverings super-class.
These results are promising and with further investigation the model could be used in one of DigitalBridge’s products. Further to this, it should be noted that semantic segmentation is an active research field and new improved models are frequently released. With an improved model, we should expect better results. At DigitalBridge we are well placed to recognise the exciting new developments in the semantic segmentation field and translate these into kitchen, bedroom and bathroom visualisations.