How we use image semantic segmentation

At DigitalBridge we create products that aid the design and visualisation of kitchens, bedrooms and bathrooms. Using sophisticated computer vision algorithms we allow a user to design and visualise their new kitchen, bedroom or bathroom in their existing room. By understanding the user’s current space we can make informed decisions on how to assist them in this process.

Scene understanding is the process of using a photo (or set of photos) to return semantic knowledge of a scene’s contents. One aspect of scene understanding is determining the objects in a photo. This may be performed at the object level, through object detection, or at the pixel level through semantic segmentation. This blog post will focus on the latter.

For each pixel in an image, semantic segmentation estimates the probability that the pixel belongs to a set of defined object classes. Each pixel can then be assigned a label by finding the most likely class. In the example below, taken from the VOC2012 dataset, the dog is segmented from the chair.

 An example of semantic segmentation. Left) The original image. Right) The resulting semantic segmentation image, when the classes dog and chair are specified.

An example of semantic segmentation. Left) The original image. Right) The resulting semantic segmentation image, when the classes dog and chair are specified.

An example of semantic segmentation. Left) The original image. Right) The resulting semantic segmentation image, when the classes dog and chair are specified.

But why are pixel-wise predictions useful to DigitalBridge? They allow us to perform operations on only those pixels that belong to a specific class. For example, if we can determine the pixels that form a wall, we can redecorate those pixels with the user’s chosen wallpaper. Alternatively if a user wishes to visualise how a new suite would look in their bathroom, they may choose to remove all existing suite pixels from a photograph.

What’s out there?

Existing approaches to semantic segmentation use a deep convolutional neural network. The details of deep convolutional neural networks are beyond the scope of this blog post, but essentially they are a type of machine learning model that maps input to output. In this context, the input is an image and the output a set of pixel-wise predictions.

 A deep learning model is used to make class probability predictions for each pixel in the image. The image on the right visualises, for each pixel, the most likely class.

A deep learning model is used to make class probability predictions for each pixel in the image. The image on the right visualises, for each pixel, the most likely class.

A deep learning model is used to make class probability predictions for each pixel in the image. The image on the right visualises, for each pixel, the most likely class.

New semantic segmentation algorithms are typically assessed by the mean Intersection over Union (mIoU) on the VOC2012 dataset. The IoU is calculated for each class at the pixel-level as:

Calculation

where true-positives are those pixels that belong to the class and are correctly predicted as the class, false-negatives are those pixels that belong to the class but are incorrectly predicted as a different class and false-positives are those pixels that belong to a different class but are predicted as the class. The image below illustrates these sets of pixels for the dog in the image above. In the Difference image true-positives are indicated by yellow pixels, false-positives are indicated by red pixels and false-negatives are indicated by green pixels.

 The different sets of pixels that make up the Intersection over Union. From left to right: The original image, The expected pixels for the dog class, The predicted pixels for the dog class, The difference image illustrating the three Intersection over Union sets: Yellow: True Positives, Green: False Negatives, Red: False Positives.

The different sets of pixels that make up the Intersection over Union. From left to right: The original image, The expected pixels for the dog class, The predicted pixels for the dog class, The difference image illustrating the three Intersection over Union sets: Yellow: True Positives, Green: False Negatives, Red: False Positives.

The IoU is a value between zero and 100, where a larger value indicates a more accurate segmentation. The mIoU is then the mean value across all the classes in the dataset.

The state-of-the-art on the VOC2012 dataset achieves a mIoU of 86.9; however the 21 classes in the VOC2012 dataset are not limited to furniture items and as such are unsuitable for the use here at DigitalBridge. Subsequently we wish to adapt a state-of-the-art model to our needs.

Transfer learning a model

Transfer learning is the process of adapting an existing model for a new task. In our case we wish to adapt an existing model, pre-trained for the semantic segmentation task on the VOC2012 dataset, to a new dataset, NYUd 40, a subset of NYUd v2. NYUd 40 includes 40 classes commonly found in indoor rooms. For example, structure classes such as wall and floor; furniture classes such as sofa and bed; and prop classes such as books and clothes. Transfer learning proceeds by providing examples from the new dataset to the model and letting it adapt to the new pixel-wise mapping. When the dataset is small, transfer learning can outperform training a model from scratch because parts of the pre-trained model are generic and independent of task.

So how did the model do?

The model achieved a mIoU of 49.0. The visual accuracy of the segmentation is illustrated in the images below. The reduction in performance against VOC2012 can be partially explained by the increase in the number of classes from 21 to 40. It is also exaggerated by poor performance on certain classes. This will be explored more later, but first we will consider some qualitative results.

Below are three original images and their corresponding algorithm output images. In the algorithm output images each pixel is coloured by the class of its highest probability.

 The algorithm output for three images from the NYUD 40 dataset. Left) The original images. Right) The algorithm output. Each pixel is coloured by the class of its highest probability.

The algorithm output for three images from the NYUD 40 dataset. Left) The original images. Right) The algorithm output. Each pixel is coloured by the class of its highest probability.

In general the model is able to distinguish the structure classes and the larger furniture classes. For example, in the second image, the dark green segment indicates the table. However it struggles with smaller props that ‘decorate’ the room, for example the objects on the coffee table in the first image. This is supported by the class IoU’s listed in the table below. In general the structure classes and furniture classes have relatively high IoUs, whereas the props such as clothes, box and bag, have relatively low IoU.

 The intersection over union for each of the 40 classes

The intersection over union for each of the 40 classes

Some of the mid range IoUs can be explained by the conceptual similarity between classes. For example, a curtain and blind are both window coverings that a human may struggle to distinguish. In the future better results could be achieved by reducing the set of possible classes to only those that are conceptually different. For instance, we may choose to include the blind and curtain classes in a window coverings super-class.

These results are promising and with further investigation the model could be used in one of DigitalBridge’s products. Further to this, it should be noted that semantic segmentation is an active research field and new improved models are frequently released. With an improved model, we should expect better results. At DigitalBridge we are well placed to recognise the exciting new developments in the semantic segmentation field and translate these into kitchen, bedroom and bathroom visualisations.

Summary

  • This blog post described how DigitalBridge have trained a model to perform semantic segmentation on an image of an indoor room.
  • Promising results are achieved on the room structure and the larger furniture classes.
  • With further investigation we believe the model could be applied to enhance DigitalBridge’s products.