Instance Segmentation Operations with Mask R-CNN in Pytorch

In this post, we will discuss some of the theory behind mask R-CNN and how to use pre-trained mask R-CNN models in PyTorch.

1. semantic segmentation, target detection and instance segmentation

It has been described before:

1, Semantic Segmentation: in semantic segmentation, we assign a class label (e.g.. Dog, Cat, Person, Background, etc.) to each pixel in the image.

2. Target Detection: In target detection, we assign the class label to the enclosing box that contains the object.

A very natural idea is to combine the two. We just want to recognize an enclosing box around an object and find which pixels in the enclosing box belong to the object. In other words, we want a mask that indicates (using color or grayscale values) which pixels belong to the same object. The class of algorithms that produce the above masks are called instance segmentation algorithms. mask R-CNN is one such algorithm.

There are two differences between instance segmentation and semantic segmentation

1, In semantic segmentation, each pixel is assigned a class label, which is not the case in instance segmentation.

2. In semantic segmentation, we do not distinguish between instances of the same class. For example, all pixels belonging to the class "Person" in semantic segmentation will be assigned the same color/value in the mask. In instance segmentation, they are assigned different values and we are able to tell them which pixel corresponds to which person. To learn more about image segmentation, check out the post where we have explained it in detail.

Mask R-CNN structure

The network structure of mask R-CNN is an extension of the FasterR-CNN that we discussed previously.

Recall that the FAST R-CNN architecture has the following components

Convolutional layers: The input image is passed through several convolutional layers to create a feature map. If you are a beginner, think of a convolutional layer as a black box that takes a 3-channel input image and outputs an "image" with a much smaller spatial dimensionality (7×7) but very many channels (512).

Region Proposal Network (RPN). The output of the convolutional layer is used to train a network that extracts the regions that surround the object.

Classifier: the same feature map is also used to train a classifier that assigns labels to the objects in the box.

Furthermore, recall that FasterR-CNN is faster than Fast R-CNN because the feature map is computed once and reused by the RPN and the classifier. The mask R-CNN takes this idea a step forward. In addition to providing the feature map to the RPN and classifier, mask R-CNN uses it to predict the binary masks of the objects within the bounding box. One way to study the mask prediction part of maskR-CNN is that it is a fully convolutional network (FCN) for semantic segmentation. The only difference is that in mask R-CNN the FCN is applied to the bounding box and it shares the convolutional layer with the RPN and the classifier. The following figure shows a very high level architecture.

2. using mask R-CNN in PyTorch [code]

In this section, we will learn how to use pre-trained MaskR-CNN models in PyTorch.

2.1. Inputs and outputs

mask The input expected by the R-CNN model is a list of tensors, each of type (n, c, h, w) with elements in the range 0-1. The size of the image is arbitrary.

n is the number of images

c is the number of channels 3 for RGB images

h is the height of the image

w is the width of the image

The model returns :

Coordinates of the enclosing frame

Labels of the classes present in the input image predicted by the model and the scores corresponding to the labels

The mask for each class in the label.

2.2 Pre-training model

model = .maskrcnn_resnet50_fpn(pretrained=True)
()

2.3 Model predictions

COCO_INSTANCE_CATEGORY_NAMES = [
  '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
  'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
  'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
  'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
  'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
  'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
  'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
  'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
  'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
  'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
  'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
  'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]
 
def get_prediction(img_path, threshold):
 img = (img_path)
 transform = ([()])
 img = transform(img)
 pred = model([img])
 print('pred')
 print(pred)
 pred_score = list(pred[0]['scores'].detach().numpy())
 pred_t = [pred_score.index(x) for x in pred_score if x>threshold][-1]
 print("masks>0.5")
 print(pred[0]['masks']>0.5)
 masks = (pred[0]['masks']>0.5).squeeze().detach().cpu().numpy()
 print("this is masks")
 print(masks)
 pred_class = [COCO_INSTANCE_CATEGORY_NAMES[i] for i in list(pred[0]['labels'].numpy())]
 pred_boxes = [[(i[0], i[1]), (i[2], i[3])] for i in list(pred[0]['boxes'].detach().numpy())]
 masks = masks[:pred_t+1]
 pred_boxes = pred_boxes[:pred_t+1]
 pred_class = pred_class[:pred_t+1]
 return masks, pred_boxes, pred_class

The code functions as follows:

Getting an image from an image path

Converting Images to Image Tensor Using PyTorch Transforms

Passing images through the model to get predictions

Obtaining masks, prediction classes, and bounding box coordinates from the model

The mask for each predicted object is given a random color from a set of 11 predefined colors in order to visualize the mask on the input image.

def random_colour_masks(image):
 colours = [[0, 255, 0],[0, 0, 255],[255, 0, 0],[0, 255, 255],[255, 255, 0],[255, 0, 255],[80, 70, 180],[250, 80, 190],[245, 145, 50],[70, 150, 250],[50, 190, 190]]
 r = np.zeros_like(image).astype(np.uint8)
 g = np.zeros_like(image).astype(np.uint8)
 b = np.zeros_like(image).astype(np.uint8)
 r[image == 1], g[image == 1], b[image == 1] = colours[(0,10)]
 coloured_mask = ([r, g, b], axis=2)
 return coloured_mask

The code has some printed information to help analyze the processing

2.4 Example Segmentation Workflow

def instance_segmentation_api(img_path, threshold=0.5, rect_th=3, text_size=3, text_th=3):
 masks, boxes, pred_cls = get_prediction(img_path, threshold)
 img = (img_path)
 img = (img, cv2.COLOR_BGR2RGB)
 for i in range(len(masks)):
  rgb_mask = random_colour_masks(masks[i])
  img = (img, 1, rgb_mask, 0.5, 0)
  (img, boxes[i][0], boxes[i][1],color=(0, 255, 0), thickness=rect_th)
  (img,pred_cls[i], boxes[i][0], cv2.FONT_HERSHEY_SIMPLEX, text_size, (0,255,0),thickness=text_th)
 (figsize=(20,30))
 (img)
 ([])
 ([])
 ()

The mask, prediction class and bounding box are obtained via get_prediction.

Each mask gives a random color from 11 colors. Each mask was added to the image on a scale of 1:0.5, using opencv.

The enclosing box is drawn with the class name on it.

Show final output

The full code is below:

from PIL import Image
import  as plt
import torch
import  as T
import torchvision
import torch
import numpy as np
import cv2
import random
 
model = .maskrcnn_resnet50_fpn(pretrained=True)
()
COCO_INSTANCE_CATEGORY_NAMES = [
  '__background__', 'person', 'bicycle', 'car', 'motorcycle', 'airplane', 'bus',
  'train', 'truck', 'boat', 'traffic light', 'fire hydrant', 'N/A', 'stop sign',
  'parking meter', 'bench', 'bird', 'cat', 'dog', 'horse', 'sheep', 'cow',
  'elephant', 'bear', 'zebra', 'giraffe', 'N/A', 'backpack', 'umbrella', 'N/A', 'N/A',
  'handbag', 'tie', 'suitcase', 'frisbee', 'skis', 'snowboard', 'sports ball',
  'kite', 'baseball bat', 'baseball glove', 'skateboard', 'surfboard', 'tennis racket',
  'bottle', 'N/A', 'wine glass', 'cup', 'fork', 'knife', 'spoon', 'bowl',
  'banana', 'apple', 'sandwich', 'orange', 'broccoli', 'carrot', 'hot dog', 'pizza',
  'donut', 'cake', 'chair', 'couch', 'potted plant', 'bed', 'N/A', 'dining table',
  'N/A', 'N/A', 'toilet', 'N/A', 'tv', 'laptop', 'mouse', 'remote', 'keyboard', 'cell phone',
  'microwave', 'oven', 'toaster', 'sink', 'refrigerator', 'N/A', 'book',
  'clock', 'vase', 'scissors', 'teddy bear', 'hair drier', 'toothbrush'
]
def get_prediction(img_path, threshold):
 img = (img_path)
 transform = ([()])
 img = transform(img)
 pred = model([img])
 print('pred')
 print(pred)
 pred_score = list(pred[0]['scores'].detach().numpy())
 pred_t = [pred_score.index(x) for x in pred_score if x>threshold][-1]
 print("masks>0.5")
 print(pred[0]['masks']>0.5)
 masks = (pred[0]['masks']>0.5).squeeze().detach().cpu().numpy()
 print("this is masks")
 print(masks)
 pred_class = [COCO_INSTANCE_CATEGORY_NAMES[i] for i in list(pred[0]['labels'].numpy())]
 pred_boxes = [[(i[0], i[1]), (i[2], i[3])] for i in list(pred[0]['boxes'].detach().numpy())]
 masks = masks[:pred_t+1]
 pred_boxes = pred_boxes[:pred_t+1]
 pred_class = pred_class[:pred_t+1]
 return masks, pred_boxes, pred_class
 
def random_colour_masks(image):
 colours = [[0, 255, 0],[0, 0, 255],[255, 0, 0],[0, 255, 255],[255, 255, 0],[255, 0, 255],[80, 70, 180],[250, 80, 190],[245, 145, 50],[70, 150, 250],[50, 190, 190]]
 r = np.zeros_like(image).astype(np.uint8)
 g = np.zeros_like(image).astype(np.uint8)
 b = np.zeros_like(image).astype(np.uint8)
 r[image == 1], g[image == 1], b[image == 1] = colours[(0,10)]
 coloured_mask = ([r, g, b], axis=2)
 return coloured_mask
 
def instance_segmentation_api(img_path, threshold=0.5, rect_th=3, text_size=3, text_th=3):
 masks, boxes, pred_cls = get_prediction(img_path, threshold)
 img = (img_path)
 img = (img, cv2.COLOR_BGR2RGB)
 for i in range(len(masks)):
  rgb_mask = random_colour_masks(masks[i])
  img = (img, 1, rgb_mask, 0.5, 0)
  (img, boxes[i][0], boxes[i][1],color=(0, 255, 0), thickness=rect_th)
  (img,pred_cls[i], boxes[i][0], cv2.FONT_HERSHEY_SIMPLEX, text_size, (0,255,0),thickness=text_th)
 (figsize=(20,30))
 (img)
 ([])
 ([])
 ()

2.5 Examples

Example 1: Chickens, for example, would be recognized as birds

instance_segmentation_api('')

Input image:

Output results:

Printing information during processing:

pred
[{'boxes': tensor([[176.8106, 125.6315, 326.8023, 400.4467],
    [427.9514, 130.5811, 584.2725, 403.1004],
    [289.9471, 169.1313, 448.9896, 410.0000],
    [208.7829, 140.7450, 421.3497, 409.0258],
    [417.7833, 137.5480, 603.2806, 405.6804],
    [174.3626, 132.7247, 330.4560, 404.6956],
    [291.6709, 165.4233, 447.1820, 401.7686],
    [171.9978, 114.4133, 336.9987, 410.0000],
    [427.0312, 129.5812, 584.2130, 405.4166]], grad_fn=<StackBackward>), 'labels': tensor([16, 16, 16, 16, 20, 20, 20, 18, 18]), 'scores': tensor([0.9912, 0.9910, 0.9894, 0.2994, 0.2108, 0.1995, 0.1795, 0.1655, 0.0516],
    grad_fn=<IndexBackward>), 'masks': tensor([[[[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]]],
 
    [[[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]]],
 
    [[[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]]],
 
    ...,
 
    [[[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]]],
 
    [[[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]]],
 
    [[[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]]]], grad_fn=<UnsqueezeBackward0>)}]
masks>0.5
tensor([[[[False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     ...,
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False]]],
 
    [[[False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     ...,
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False]]],
 
    [[[False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     ...,
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False]]],
 
    ...,
 
    [[[False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     ...,
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False]]],
 
    [[[False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     ...,
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False]]],
 
    [[[False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     ...,
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False]]]])
this is masks
[[[False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]
 ...
 [False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]]
 
 [[False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]
 ...
 [False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]]
 
 [[False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]
 ...
 [False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]]
 
 ...
 
 [[False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]
 ...
 [False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]]
 
 [[False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]
 ...
 [False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]]
 
 [[False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]
 ...
 [False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]]]

masks = (pred[0]['masks']>0.5).squeeze().detach().cpu().numpy() to make masks [n x h x w],and the elements are bool values, preparing for the subsequent specification of a random color. r[image == 1], g[image == 1], b[ image == 1] = colours[(0,10)], which turns the area of the mask list that belongs to the actual object into a random color, while the rest remains 0. This code fully demonstrates the magic of advanced slicing in python, of course using the functions of numpy and ri.

Example 2: Brown Bear

instance_segmentation_api('', threshold=0.8)

Input image:

Output image:

Print the information:

pred
[{'boxes': tensor([[ 660.3120, 340.5351, 1235.1614, 846.9672],
    [ 171.7622, 426.9127, 756.6520, 784.9360],
    [ 317.9777, 184.6863, 648.0856, 473.6469],
    [ 283.0787, 200.8575, 703.7324, 664.4083],
    [ 354.9362, 308.0444, 919.0403, 812.0120]], grad_fn=<StackBackward>), 'labels': tensor([23, 23, 23, 23, 23]), 'scores': tensor([0.9994, 0.9994, 0.9981, 0.5138, 0.0819], grad_fn=<IndexBackward>), 'masks': tensor([[[[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]]],
 
    [[[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]]],
 
    [[[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]]],
 
    [[[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]]],
 
    [[[0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     ...,
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.],
     [0., 0., 0., ..., 0., 0., 0.]]]], grad_fn=<UnsqueezeBackward0>)}]
masks>0.5
tensor([[[[False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     ...,
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False]]],
 
    [[[False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     ...,
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False]]],
 
    [[[False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     ...,
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False]]],
 
    [[[False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     ...,
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False]]],
 
    [[[False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     ...,
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False],
     [False, False, False, ..., False, False, False]]]])
this is masks
[[[False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]
 ...
 [False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]]
 
 [[False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]
 ...
 [False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]]
 
 [[False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]
 ...
 [False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]]
 
 [[False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]
 ...
 [False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]]
 
 [[False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]
 ...
 [False False False ... False False False]
 [False False False ... False False False]
 [False False False ... False False False]]]

3. GPU and CPU time comparison

def check_inference_time(image_path, gpu=False):
  model = .maskrcnn_resnet50_fpn(pretrained=True)
  ()
  img = (image_path)
  transform = ([()])
  img = transform(img)
  if gpu:
    ()
    img = ()
  else:
    ()
    img = ()
  start_time = ()
  pred = model([img])
  end_time = ()
  return end_time-start_time
 
cpu_time = sum([check_inference_time('./', gpu=False) for _ in range(5)])/5.0
gpu_time = sum([check_inference_time('./', gpu=True) for _ in range(5)])/5.0
print('\\n\\nAverage Time take by the model with GPU = {}s\\nAverage Time take by the model with CPU = {}s'.format(gpu_time, cpu_time))

Results:

Average Time take by the model with GPU = 0.5736178874969482s,
Average Time take by the model with CPU = 10.966966199874879s

This above example segmentation operation in Pytorch using Mask R-CNN is all that I have shared with you.