One of the important fields of Artificial Intelligence is the computer vision. Computer vision is the science of computers and related software systems which can recognize the objects and scenes. It deals with various aspects such as object detection, image recognition, image generation and many more. Likewise, there are several amazing uses of object detection which will definitely come from the efforts of computer programmers and software developers.
The adoption of deep learning techniques helps to use accurate object detection algorithms and methods.
Now let’s walk through the concept i.e what is object detection with respect to AI.
What is object detection in a video file with respect to AI?
Object detection is a branch of computer vision used to observe objects as the images in the videos which can be located, detected and recognized by the computers. In this 21st century, detecting of images and objects in a video has become quite possible with the help of deep learning algorithms. There are specialized algorithms which are developed for detecting, locating and recognizing the objects in videos. The most beneficial algorithm is the SSD – Single Shot Detection and RetinaNet etc.
To be more concise, if you want to apply AI based deep learning techniques to detect and recognize the objects, it requires huge computational power systems, applied mathematics and solid technical knowledge with thousands of line of code.
With the help of IBM Watson AI services, you can generate accurate results in less turnaround time. With the help of IBM AI services, you can get qualitative solutions where both human intelligence and machine intelligence are clubbed together.
IBM Watson provides a flexible environment for deploying AI applications. You can use those tools, volumes of data so as to attain a very good throughput and GPU acceleration. In order to develop new object detection, applications require more computational power beyond the CPUs. We need GPUs which provide the best accuracy in less time. Here’s what the Watson AI does, it comes with best GPU power.
Writing deep learning and neural networks code from scratch is not easy for any developer, it requires a combination of human and machine intelligence. Keeping that in consideration IBM Watson provides all the frameworks and models required which are the reusable components.
Yes, here I am going to discuss the algorithm Single Shot Detection(SSD) which helps in object detection.
Single Shot Detection Algorithm
The SSD algorithm is one of the topmost algorithms to detect objects. It uses the multi-box concept, predicting object positions and scale problem.
So, how do the computers or object detection algorithms detect the objects?
If you use the algorithms to scale the solution you need to train the algorithm and that is difficult. So by using the SSD concept i.e multi-box detection, we are segmenting the image into several segments and then construct boxes for every segment.
Real Time Object Detection With Deep Learning:
The deep learning and neural networks are the most powerful methods with respect to computers vision as computer will have the brain to do detections.
Now I am going to detect an object inside an video.
Before proceeding further don’t forget to connect to the virtual platform where your code is processed for output.
As a next step you need to import all the classes defined into your virtual environment.
You need to exact your folder into your virtual platform. It contains the data folder which handles the transformations required for the input images in the video.
I had created a filename with the following name.
File name: voc0712.py
***VOC Dataset Classes*****
import os import os.path import sys import torch import torch.utils.data as data import torchvision.transforms as transforms from PIL import Image, ImageDraw, ImageFont import cv2 import numpy as np if sys.version_info == 2: import xml.etree.cElementTree as ET else: import xml.etree.ElementTree as ET VOC_CLASSES = ( # always index 0 'aeroplane', 'bicycle', 'bird', 'boat', 'bottle', 'bus', 'car', 'cat', 'chair', 'cow', 'diningtable', 'dog', 'horse', 'motorbike', 'person', 'pottedplant', 'sheep', 'sofa', 'train', 'tvmonitor') # for making bounding boxes pretty COLORS = ((255, 0, 0, 128), (0, 255, 0, 128), (0, 0, 255, 128), (0, 255, 255, 128), (255, 0, 255, 128), (255, 255, 0, 128)) class AnnotationTransform(object) #Transforms a VOC annotation into a Tensor of bbox coords and label index Initialized with a dictionary lookup of classnames to indexes # Arguments to be passed are given. class_to_ind (dict, optional): dictionary lookup of classnames -> indexes (default: alphabetic indexing of VOC's 20 classes) keep_difficult (bool, optional): keep difficult instances or not (default: False) height (int): height width (int): width def __init__(self, class_to_ind=None, keep_difficult=False): self.class_to_ind = class_to_ind or dict( zip(VOC_CLASSES, range(len(VOC_CLASSES)))) self.keep_difficult = keep_difficult def __call__(self, target, width, height): Arguments: target (annotation) : the target annotation to be made usable will be an ET.Element Returns: a list containing lists of bounding boxes [bbox coords, class name] """ res =  for obj in target.iter('object'): difficult = int(obj.find('difficult').text) == 1 if not self.keep_difficult and difficult: continue name = obj.find('name').text.lower().strip() bbox = obj.find('bndbox') pts = ['xmin', 'ymin', 'xmax', 'ymax'] bndbox =  for i, pt in enumerate(pts): cur_pt = int(bbox.find(pt).text) - 1 # scale height or width cur_pt = cur_pt / width if i % 2 == 0 else cur_pt / height bndbox.append(cur_pt) label_idx = self.class_to_ind[name] bndbox.append(label_idx) res += [bndbox] # [xmin, ymin, xmax, ymax, label_ind] # img_id = target.find('filename').text[:-4] return res # [[xmin, ymin, xmax, ymax, label_ind], ... ] class VOCDetection(data.Dataset): #VOC Detection Dataset Object input is image, target is annotation Arguments: root (string): filepath to VOCdevkit folder. image_set (string): imageset to use (eg. 'train', 'val', 'test') transform (callable, optional): transformation to perform on the input image target_transform (callable, optional): transformation to perform on the target `annotation` (eg: take in caption string, return tensor of word indices) dataset_name (string, optional): which dataset to load (default: 'VOC2007') def __init__(self, root, image_sets, transform=None, target_transform=None, dataset_name='VOC0712'): self.root = root self.image_set = image_sets self.transform = transform self.target_transform = target_transform self.name = dataset_name self._annopath = os.path.join('%s', 'Annotations', '%s.xml') self._imgpath = os.path.join('%s', 'JPEGImages', '%s.jpg') self.ids = list() for (year, name) in image_sets: rootpath = os.path.join(self.root, 'VOC' + year) for line in open(os.path.join(rootpath, 'ImageSets', 'Main', name + '.txt')): self.ids.append((rootpath, line.strip())) def __getitem__(self, index): im, gt, h, w = self.pull_item(index) return im, gt def __len__(self): return len(self.ids) def pull_item(self, index): img_id = self.ids[index] target = ET.parse(self._annopath % img_id).getroot() img = cv2.imread(self._imgpath % img_id) height, width, channels = img.shape if self.target_transform is not None: target = self.target_transform(target, width, height) if self.transform is not None: target = np.array(target) img, boxes, labels = self.transform(img, target[:, :4], target[:, 4]) # to rgb img = img[:, :, (2, 1, 0)] # img = img.transpose(2, 0, 1) target = np.hstack((boxes, np.expand_dims(labels, axis=1))) return torch.from_numpy(img).permute(2, 0, 1), target, height, width # return torch.from_numpy(img), target, height, width def pull_image(self, index):
‘Returns the original image object at index in PIL form
Note: not using self.__getitem__(), as any transformations passed in
could mess up this functionality.
Another folder layer which is used for some detections such as multi box detection with respect to SSD algorithm. Now you need to import the libraries required.
#importing the libraries import torch import torch.nn as nn import torch.nn.functional as F from torch.autograd import Variable from data import v2 as cfg from ..box_utils import match, log_sum_exp class MultiBoxLoss(nn.Module): def __init__(self, num_classes, overlap_thresh, prior_for_matching, bkg_label, neg_mining, neg_pos, neg_overlap, encode_target, use_gpu=True): super(MultiBoxLoss, self).__init__() self.use_gpu = use_gpu self.num_classes = num_classes self.threshold = overlap_thresh self.background_label = bkg_label self.encode_target = encode_target self.use_prior_for_matching = prior_for_matching self.do_neg_mining = neg_mining self.negpos_ratio = neg_pos self.neg_overlap = neg_overlap self.variance = cfg['variance'] def forward(self, predictions, targets): loc_data, conf_data, priors = predictions num = loc_data.size(0) priors = priors[:loc_data.size(1), :] num_priors = (priors.size(0)) num_classes = self.num_classes # match priors (default boxes) and ground truth boxes loc_t = torch.Tensor(num, num_priors, 4) conf_t = torch.LongTensor(num, num_priors) for idx in range(num): truths = targets[idx][:, :-1].data labels = targets[idx][:, -1].data defaults = priors.data match(self.threshold, truths, defaults, self.variance, labels, loc_t, conf_t, idx) if self.use_gpu: loc_t = loc_t.cuda() conf_t = conf_t.cuda() # wrap targets loc_t = Variable(loc_t, requires_grad=False) conf_t = Variable(conf_t, requires_grad=False) pos = conf_t > 0 num_pos = pos.sum(keepdim=True) # Localization Loss (Smooth L1) # Shape: [batch,num_priors,4] pos_idx = pos.unsqueeze(pos.dim()).expand_as(loc_data) loc_p = loc_data[pos_idx].view(-1, 4) loc_t = loc_t[pos_idx].view(-1, 4) loss_l = F.smooth_l1_loss(loc_p, loc_t, size_average=False) N = num_pos.data.sum() loss_l /= N loss_c /= N return loss_l, loss_c
After preparing the datasets and layers for manipulation you need to proceed to the actual coding part where you will be going to do some detections on the video.
Code for Object Detection based on IBM AI deep learning techniques:
Step1: You need to open up a new file and name it as real time object detection, after naming import the libraries required.
# Importing the libraries import torch from torch.autograd import Variable import cv2 from data import BaseTransform, VOC_CLASSES as labelmap from ssd import build_ssd import imageio
Step2: You should define the functions that will perform detections.
# Defining a function that will do the detections
def detect(frame, net, transform): # We define a detect function that will take as inputs, a frame, a ssd neural network, and a transformation to be applied on the images, and that will return the frame with the detector rectangle.
height, width = frame.shape[:2] # We get the height and the width of the frame. frame_t = transform(frame) # Applying the transformation to our frame. x = torch.from_numpy(frame_t).permute(2, 0, 1) # Convert the frame into a torch tensor. x = Variable(x.unsqueeze(0)) # We add a fake dimension corresponding to the batch. y = net(x) # We feed the neural network ssd with the image and we get the output y. detections = y.data # We create the detections tensor contained in the output y. scale = torch.Tensor([width, height, width, height]) # We create a tensor object of dimensions [width, height, width, height]. for i in range(detections.size(1)): # For every class: j = 0 # We initialize the loop variable j that will correspond to the occurrences of the class. while detections[0, i, j, 0] >= 0.6: # We take into account all the occurrences j of the class i that have a matching score larger than 0.6. pt = (detections[0, i, j, 1:] * scale).numpy() # We get the coordinates of the points at the upper left and the lower right of the detector rectangle. cv2.rectangle(frame, (int(pt), int(pt)), (int(pt), int(pt)), (255, 0, 0), 2) # We draw a rectangle around the detected object. cv2.putText(frame, labelmap[i - 1], (int(pt), int(pt)), cv2.FONT_HERSHEY_SIMPLEX, 2, (255, 255, 255), 2, cv2.LINE_AA) # We put the label of the class right above the rectangle. j += 1 # We increment j to get to the next occurrence. return frame # We return the original frame with the detector rectangle and the label around the detected object.
Step 3: You need to create the SSD neural network as follows.
# Creating the SSD neural network
net = build_ssd('test') # We create an object that is our neural network ssd. net.load_state_dict(torch.load('ssd300_mAP_77.43_v2.pth', map_location = lambda storage, loc: storage)) # We get the weights of the neural network from another one that is pretrained (ssd300_mAP_77.43_v2.pth).
Step 4: Now it’s the time create the transformations.
# Creating the transformation
transform = BaseTransform(net.size, (104/256.0, 117/256.0, 123/256.0)) # We create an object of the Base Transform class, a class that will do the required transformations so that the image can be the input of the neural network.
Step 5: Now with the follow code you can do some object detection on a video.
# Doing some Object Detection on a video
reader = imageio.get_reader('funny_dog.mp4') # We open the video. fps = reader.get_meta_data()['fps'] # We get the fps frequence (frames per second). writer = imageio.get_writer('output.mp4', fps = fps) # We create an output video with this same fps frequence. for i, frame in enumerate(reader): # We iterate on the frames of the output video: frame = detect(frame, net.eval(), transform) # We call our detect function (defined above) to detect the object on the frame. writer.append_data(frame) # We add the next frame in the output video. print(i) # We print the number of the processed frame.
Step 6: Close the file which handles detections.
writer.close() # We close the process that handles the creation of the output video.
With the following code you can definitely achieve your target to detect the objects in the video. However with the help of the SSD algorithm we did a great job to do detections. IBM AI services also helped the developers do similar tasks to identify or detect any objects in a video with their secured and encrypted platform. Now it’s the time to deploy your code with the IBM AI cloud to generate qualitative and accurate results.
Register for the IBM Developer Day event to talk with experts and explore trending technologies.