Go Selfies — How to do photo background removal using deep learning

Hi, everyone. This time we talk about how can we do segmentation with deep learning. For this task, I use the example of photo background removal. Of course, you can apply any segmentation based on the codes. Let’s move on to each step now! The code is based on fastai and you can also find my model and interface code there: Github : github.com/NancyWu2168/Background-removal-using-deep-learning. Consider one situation. You need the red background ID photo but you only have the blue one. Or you want to make any composed images you like. As a result, the aim of this task is to build a segmentation model to change the background from a single RGB image people give. I also design an app by using OpenCv for daily usage. The workflow is shown in Figure[1]. Figure 1. The framework of the idea. Given one image, our method can extract the figure and get the mask. Then choose whatever background and finish the background removal. U-net is an encoder-decoder type of architecture in image segmentation, especially in medical images, which was proposed by Ronneberger et al. They name it based on its U-shape (As shown in Figure [2]). The architecture contains two parts. The first one id the encoder part, which is used to capture the context in the image. Another is the decoder part, also called the symmetric expanding path, which enables precise localization using transposed convolutions. The specialty of this model is that the receptive fields after convolution are concatenated with the receptive fields in the up-sampling fields. This is an all convolutional network, which promises any sizes of input images. Figure 2. U-net architecture. The receptive fields after convolution layers are concatenated with the receptive fields in the up-sampling fields (green arrow). 1. Background Consider one situation. You need the red background ID photo but you only have the blue one. Or you want to make any composed images you like. As a result, the aim of this task is to build a segmentation model to change the background from a single RGB image people give. I also design an app by using OpenCv for daily usage. The workflow is shown in Figure[1]. Figure 1. The framework of the idea. Given one image, our method can extract the figure and get the mask. Then choose whatever background and finish the background removal. U-net is an encoder-decoder type of architecture in image segmentation, especially in medical images, which was proposed by Ronneberger et al. They name it based on its U-shape (As shown in Figure [2]). The architecture contains two parts. The first one id the encoder part, which is used to capture the context in the image. Another is the decoder part, also called the symmetric expanding path, which enables precise localization using transposed convolutions. The specialty of this model is that the receptive fields after convolution are concatenated with the receptive fields in the up-sampling fields. This is an all convolutional network, which promises any sizes of input images. Figure 2. U-net architecture. The receptive fields after convolution layers are concatenated with the receptive fields in the up-sampling fields (green arrow). Rather than simply using U-Net, two different classification models, VGG16 and ResNet34 are combined with it as the encoder parts. The decoder part consists of upsampling feature maps from last layers and corresponding feature map from certain decoder parts. Two parts together make up a similar U-Net model, called ResU-Net and VGU-Net, as shown in Figure [3]. Figure 3. The architecture of the proposed model. It is divided into two parts, the encoder part (red arrow) and decoder part (green arrow). For the encoder part, simply by changing the last fully connected layers into convolutional layers, the feature maps contain local information from the raw image. The two encoder models have some differences. VGG16 contains 13 convolutional layers with pooling layers inserted after each block to decrease spatial dimensions. ResNet34 has unique skip connections between different layers, making the network deeper without enlarging parameters, as shown in Figure [4]. Figure 4. ResNet building block. Short skip connections are used in these blocks. Decoder part performs like a symmetric expanding path with encoder part, which gets precise localization information. At the end of each unit, the channel size is half reduced while the feature map dimensions are twice increased. A 1×1 convolutional layer appends to the last layer, followed by a sigmoid function, that computes the probability of each pixel in that image location. Final output mask is obtained based on each pixel class. Given two sets, such as background and foreground in this , dice is defined as: Where |X| and |Y| refers to the number of pixels in that set. Dice rather than IOU is chosen for the consideration of minor computation. The closer to 1, the better the performance. This is a binary classification problem, where 𝑦𝑝 is the prediction and 𝑦𝑡 is the label of that pixel. A table describes the performance of classification on validation dataset, where accuracy, sensitivity, specificity can be calculated. This task focuses more on accuracy, defined as: Proper dataset combining with masks is the foundation of the model. Our dataset comes from two large common dataset: COCO dataset and Pascal VOC dataset. COCO dataset includes about 80000 images with more than 90 classes and Pascal VOC dataset includes 11000 images in 20 categories. In terms of the objects, selfies, the images with person should be filtered into our dataset. The process of filtering is shown in Figure [5]. Firstly, leave the images with persons in them. Then, drop the images manually with many people, leaving the images with just 1 or 2 persons. What’s more, the person should cover over 30% of the images in case of the extremely small one in the foreground. Finally, change the background to be black and person to be white in the masks. In total, the final dataset contains 1600 training data in which 300 images are validation data (If you ask for this dataset, please contact me or leave a message below). The dataset seems not big enough so that three types of augmentations are used in this task: rotation, flipping and color adjusting. Rotation and flipping change the position of the foreground in various angles and color adjusting switches the light intensity, which decrease the influence of positions and color and at the same time, enlarge the number of images to alleviate overfitting problems. Instead of using the original resolution of each image, two types of resolution are trained in the model, 128 × 128 and 512 ××512. The models were trained by using different resolutions and different encoder networks. The basic model this task chose is U-net. The encoder networks consist of Resnet34 and VGG16. Batch size is 64 in resolution 128 while is 16 in resolution 512 in consideration of the memory and speed. The cyclical learning rate was applied in this task. All models are trained with the chosen learning rate in 50 epochs. Transfer learning is a good way to build robust model and converge fast, which instead of starting the learning from scratch, the weights are pre-trained on ImageNet dataset. ImageNet is a large visual database aimed for the use in visual recognition. It is helpful when similar problems are solved. This model freezes the convolutional base as the strategy to fine-tune the model, in which the convolutional base is kept in its original form and just use its output to feed the classifier. Figure 6. The output of the best model. They are all from the test data. The first column are the results of ID photo; The second column are normal frontal photos; The last column is the distance image. Figure 7. The comparison of different models Figure 8. Confusion matrix. The values on the diagonal are the right prediction. The darker the color, the better the prediction. Figure 8 shows one of the results from the test images. From it, accuracy is 98.8%, sensitivity is 98.3% and the specificity is 99%, which proves to be good performance. In this task , I applied a segmentation method aiming at the background removal for photos. People can change any background at any time just in seconds. In tall, I compared 8 different models based on different encoder model and input resolution and find that ResNet34 encoder combined with 512 × 512 resolution gives the best performance on both training and validation data. In addition, pretraining operation has a great impact on the model accuracy. Finally, I applied the model into basic products for daily life usage. The simple interface is shown in Figure[9] Figure 9. Go selfies for people usage. People can choose what they what to segment. First click open and find the file, then click begin remove and wait for some time and check the result. Finally, change the background. The result of Figure 9 looks not good. The reason is that we are not using the best model for this test image, just an example of the interface. This task is not the perfect one since so many new algorithms have been proposed. However, it gives you a sense of photo segmentation and from now, you can work on it by yourself. First, find the task you want to do and then prepare the dataset including the images with mask. You can use my method or your own methods. Last but not least, try to apply this into an app or any other places so that other people can use it. Now, it’s your turn to make something interesting! The github link is : Github : github.com/NancyWu2168/Background-removal-using-deep-learning It has two .ipynb files, which give the codes for how to train the model and design the interface from scratch. Please let me know if you have any questions or if any mistakes exist. My contact email : yunanwu2168@outlook.com Written by Yunan Wu 8 Followers A first-year PhD student, interested in AI medical healthcare and image processing.


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *