One of the important fields of Artificial Intelligence is the computer vision. Computer vision is the science of computers and related software systems which can recognize the objects and scenes. It deals with various aspects such as object detection, image recognition, image generation and many more. Likewise, there are several amazing uses of object detection which will definitely come from the efforts of computer programmers and software developers.

The adoption of deep learning techniques helps to use accurate object detection algorithms and methods.

Now let’s walk through the concept i.e what is object detection with respect to AI.

What is object detection in a video file with respect to AI?

Object detection is a branch of computer vision used to observe objects as the images in the videos which can be located, detected and recognized by the computers. In this 21st century, detecting of images and objects in a video has become quite possible with the help of deep learning algorithms. There are specialized algorithms which are developed for detecting, locating and recognizing the objects in videos. The most beneficial algorithm is the SSD – Single Shot Detection and RetinaNet etc.

To be more concise, if you want to apply AI based deep learning techniques to detect and recognize the objects, it requires huge computational power systems, applied mathematics and solid technical knowledge with thousands of line of code.

With the help of IBM Watson AI services, you can generate accurate results in less turnaround time. With the help of IBM AI services, you can get qualitative solutions where both human intelligence and machine intelligence are clubbed together.

IBM Watson provides a flexible environment for deploying AI applications. You can use those tools, volumes of data so as to attain a very good throughput and GPU acceleration. In order to develop new object detection, applications require more computational power beyond the CPUs. We need GPUs which provide the best accuracy in less time. Here’s what the Watson AI does, it comes with best GPU power.

Writing deep learning and neural networks code from scratch is not easy for any developer, it requires a combination of human and machine intelligence. Keeping that in consideration IBM Watson provides all the frameworks and models required which are the reusable components.

Yes, here I am going to discuss the algorithm Single Shot Detection(SSD) which helps in object detection.

Single Shot Detection Algorithm

The SSD algorithm is one of the topmost algorithms to detect objects. It uses the multi-box concept, predicting object positions and scale problem.

So, how do the computers or object detection algorithms detect the objects?

If you use the algorithms to scale the solution you need to train the algorithm and that is difficult. So by using the SSD concept i.e multi-box detection, we are segmenting the image into several segments and then construct boxes for every segment.

IBM AI services

Real Time Object Detection With Deep Learning:

The deep learning and neural networks are the most powerful methods with respect to computers vision as computer will have the brain to do detections.

Now I am going to detect an object inside an video.

Before proceeding further don’t forget to connect to the virtual platform where your code is processed for output.

As a next step you need to import all the classes defined into your virtual environment.

You need to exact your folder into your virtual platform. It contains the data folder which handles the transformations required for the input images in the video.

I had created a filename with the following name.

File name:

***VOC Dataset Classes*****

import os
import os.path
import sys
import torch
import as data
import torchvision.transforms as transforms
from PIL import Image, ImageDraw, ImageFont
import cv2
import numpy as np
if sys.version_info[0] == 2:
    import xml.etree.cElementTree as ET
    import xml.etree.ElementTree as ET

VOC_CLASSES = (  # always index 0
    'aeroplane', 'bicycle', 'bird', 'boat',
    'bottle', 'bus', 'car', 'cat', 'chair',
    'cow', 'diningtable', 'dog', 'horse',
    'motorbike', 'person', 'pottedplant',
    'sheep', 'sofa', 'train', 'tvmonitor')

# for making bounding boxes pretty
COLORS = ((255, 0, 0, 128), (0, 255, 0, 128), (0, 0, 255, 128),
          (0, 255, 255, 128), (255, 0, 255, 128), (255, 255, 0, 128))

class AnnotationTransform(object)  #Transforms a VOC annotation into a Tensor of bbox coords and label index Initialized with a dictionary lookup of classnames to indexes

   # Arguments to be passed are given.

        class_to_ind (dict, optional): dictionary lookup of classnames -> indexes
            (default: alphabetic indexing of VOC's 20 classes)
        keep_difficult (bool, optional): keep difficult instances or not
            (default: False)
        height (int): height
        width (int): width
       def __init__(self, class_to_ind=None, keep_difficult=False):
        self.class_to_ind = class_to_ind or dict(
            zip(VOC_CLASSES, range(len(VOC_CLASSES))))
        self.keep_difficult = keep_difficult
    def __call__(self, target, width, height):
            target (annotation) : the target annotation to be made usable
                will be an ET.Element
            a list containing lists of bounding boxes  [bbox coords, class name]
        res = []
        for obj in target.iter('object'):
            difficult = int(obj.find('difficult').text) == 1
            if not self.keep_difficult and difficult:
            name = obj.find('name').text.lower().strip()
            bbox = obj.find('bndbox')

            pts = ['xmin', 'ymin', 'xmax', 'ymax']
            bndbox = []
            for i, pt in enumerate(pts):
                cur_pt = int(bbox.find(pt).text) - 1
                # scale height or width
                cur_pt = cur_pt / width if i % 2 == 0 else cur_pt / height
            label_idx = self.class_to_ind[name]
            res += [bndbox]  # [xmin, ymin, xmax, ymax, label_ind]
            # img_id = target.find('filename').text[:-4]

        return res  # [[xmin, ymin, xmax, ymax, label_ind], ... ]

class VOCDetection(data.Dataset):
   #VOC Detection Dataset Object

    input is image, target is annotation

        root (string): filepath to VOCdevkit folder.
        image_set (string): imageset to use (eg. 'train', 'val', 'test')
        transform (callable, optional): transformation to perform on the
            input image
        target_transform (callable, optional): transformation to perform on the
            target `annotation`
            (eg: take in caption string, return tensor of word indices)
        dataset_name (string, optional): which dataset to load
            (default: 'VOC2007')

    def __init__(self, root, image_sets, transform=None, target_transform=None,
        self.root = root
        self.image_set = image_sets
        self.transform = transform
        self.target_transform = target_transform = dataset_name
        self._annopath = os.path.join('%s', 'Annotations', '%s.xml')
        self._imgpath = os.path.join('%s', 'JPEGImages', '%s.jpg')
        self.ids = list()
        for (year, name) in image_sets:
            rootpath = os.path.join(self.root, 'VOC' + year)
            for line in open(os.path.join(rootpath, 'ImageSets', 'Main', name + '.txt')):
                self.ids.append((rootpath, line.strip()))

    def __getitem__(self, index):
        im, gt, h, w = self.pull_item(index)

        return im, gt

    def __len__(self):
        return len(self.ids)

    def pull_item(self, index):
        img_id = self.ids[index]

        target = ET.parse(self._annopath % img_id).getroot()
        img = cv2.imread(self._imgpath % img_id)
        height, width, channels = img.shape

        if self.target_transform is not None:
            target = self.target_transform(target, width, height)

        if self.transform is not None:
            target = np.array(target)
            img, boxes, labels = self.transform(img, target[:, :4], target[:, 4])
            # to rgb
            img = img[:, :, (2, 1, 0)]
            # img = img.transpose(2, 0, 1)
            target = np.hstack((boxes, np.expand_dims(labels, axis=1)))
        return torch.from_numpy(img).permute(2, 0, 1), target, height, width
        # return torch.from_numpy(img), target, height, width

    def pull_image(self, index):

‘Returns the original image object at index in PIL form

Note: not using self.__getitem__(), as any transformations passed in
could mess up this functionality.

Another folder layer which is used for some detections such as multi box detection with respect to SSD algorithm. Now you need to import the libraries required.

#importing the libraries
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.autograd import Variable
from data import v2 as cfg
from ..box_utils import match, log_sum_exp

class MultiBoxLoss(nn.Module):
    def __init__(self, num_classes, overlap_thresh, prior_for_matching,
                 bkg_label, neg_mining, neg_pos, neg_overlap, encode_target,
        super(MultiBoxLoss, self).__init__()
        self.use_gpu = use_gpu
        self.num_classes = num_classes
        self.threshold = overlap_thresh
        self.background_label = bkg_label
        self.encode_target = encode_target
        self.use_prior_for_matching = prior_for_matching
        self.do_neg_mining = neg_mining
        self.negpos_ratio = neg_pos
        self.neg_overlap = neg_overlap
        self.variance = cfg['variance']

    def forward(self, predictions, targets):
        loc_data, conf_data, priors = predictions
        num = loc_data.size(0)
        priors = priors[:loc_data.size(1), :]
        num_priors = (priors.size(0))
        num_classes = self.num_classes

        # match priors (default boxes) and ground truth boxes
        loc_t = torch.Tensor(num, num_priors, 4)
        conf_t = torch.LongTensor(num, num_priors)
        for idx in range(num):
            truths = targets[idx][:, :-1].data
            labels = targets[idx][:, -1].data
            defaults =
            match(self.threshold, truths, defaults, self.variance, labels,
                  loc_t, conf_t, idx)
        if self.use_gpu:
            loc_t = loc_t.cuda()
            conf_t = conf_t.cuda()
        # wrap targets
        loc_t = Variable(loc_t, requires_grad=False)
        conf_t = Variable(conf_t, requires_grad=False)

        pos = conf_t > 0
        num_pos = pos.sum(keepdim=True)

        # Localization Loss (Smooth L1)
        # Shape: [batch,num_priors,4]
        pos_idx = pos.unsqueeze(pos.dim()).expand_as(loc_data)
        loc_p = loc_data[pos_idx].view(-1, 4)
        loc_t = loc_t[pos_idx].view(-1, 4)
        loss_l = F.smooth_l1_loss(loc_p, loc_t, size_average=False)

        N =
        loss_l /= N
        loss_c /= N
        return loss_l, loss_c

After preparing the datasets and layers for manipulation you need to proceed to the actual coding part where you will be going to do some detections on the video.

Code for Object Detection based on IBM AI deep learning techniques:

Step1: You need to open up a new file and name it as real time object detection, after naming import the libraries required.

# Importing the libraries
import torch
from torch.autograd import Variable
import cv2
from data import BaseTransform, VOC_CLASSES as labelmap
from ssd import build_ssd
import imageio

Step2: You should define the functions that will perform detections.

# Defining a function that will do the detections

def detect(frame, net, transform): # We define a detect function that will take as inputs, a frame, a ssd neural network, and a transformation to be applied on the images, and that will return the frame with the detector rectangle.

 height, width = frame.shape[:2]    # We get the height and the width of the frame.
    frame_t = transform(frame)[0]    # Applying the transformation to our frame.
    x = torch.from_numpy(frame_t).permute(2, 0, 1)   # Convert the frame into a torch tensor.
    x = Variable(x.unsqueeze(0))    # We add a fake dimension corresponding to the batch.
    y = net(x)    # We feed the neural network ssd with the image and we get the output y.
    detections = # We create the detections tensor contained in the output y.
    scale = torch.Tensor([width, height, width, height])   # We create a tensor object of dimensions [width, height, width, height].
    for i in range(detections.size(1)): # For every class:
        j = 0       # We initialize the loop variable j that will correspond to the occurrences of the class.
        while detections[0, i, j, 0] >= 0.6:    # We take into account all the occurrences j of the class i that have a matching score larger than 0.6.
            pt = (detections[0, i, j, 1:] * scale).numpy()    # We get the coordinates of the points at the upper left and the lower right of the detector rectangle.
            cv2.rectangle(frame, (int(pt[0]), int(pt[1])), (int(pt[2]), int(pt[3])), (255, 0, 0), 2)    # We draw a rectangle around the detected object.
            cv2.putText(frame, labelmap[i - 1], (int(pt[0]), int(pt[1])), cv2.FONT_HERSHEY_SIMPLEX, 2, (255, 255, 255), 2, cv2.LINE_AA)    # We put the label of the class right above the rectangle.
            j += 1     # We increment j to get to the next occurrence.
    return frame     # We return the original frame with the detector rectangle and the label around the detected object.

Step 3: You need to create the SSD neural network as follows.

# Creating the SSD neural network

net = build_ssd('test')      # We create an object that is our neural network ssd.
net.load_state_dict(torch.load('ssd300_mAP_77.43_v2.pth', map_location = lambda storage, loc: storage))  # We get the weights of the neural network from another one that is pretrained (ssd300_mAP_77.43_v2.pth).

Step 4: Now it’s the time create the transformations.

# Creating the transformation

transform = BaseTransform(net.size, (104/256.0, 117/256.0, 123/256.0))   # We create an object of the Base Transform class, a class that will do the required transformations so that the image can be the input of the neural network.

Step 5: Now with the follow code you can do some object detection on a video.

# Doing some Object Detection on a video

reader = imageio.get_reader('funny_dog.mp4') # We open the video.
fps = reader.get_meta_data()['fps'] # We get the fps frequence (frames per second).
writer = imageio.get_writer('output.mp4', fps = fps) # We create an output video with this same fps frequence.
for i, frame in enumerate(reader): # We iterate on the frames of the output video:
    frame = detect(frame, net.eval(), transform) # We call our detect function (defined above) to detect the object on the frame.
    writer.append_data(frame) # We add the next frame in the output video.
    print(i) # We print the number of the processed frame.

Step 6: Close the file which handles detections.

writer.close() # We close the process that handles the creation of the output video.

With the following code you can definitely achieve your target to detect the objects in the video. However with the help of the SSD algorithm we did a great job to do detections. IBM AI services also helped the developers do similar tasks to identify or detect any objects in a video with their secured and encrypted platform. Now it’s the time to deploy your code with the IBM AI cloud to generate qualitative and accurate results.

Register for the IBM Developer Day event to talk with experts and explore trending technologies.