Features

Here are a few of the key features and concepts of the object detection module.

Cameras

The object detection features have been tested with Orlaco cameras but any ethernet camera can be used. We use GStreamer pipelines to fetch the camera stream and route it into the applications. You can read more about cameras and pipelines in the Camera Module.

Camera resolutions of up to around 1280x800 are supported. We recommend not to use higher resolutions than necessary for best performance.

Models

The AI based models that can be run on the Google Coral Edge TPU must be in the Tensorflow Lite format. TensorFlow Lite is a variant of the TensorFlow toolset, but adapted to mobile applications. The performance of the object detection depends on the model complexity and the image size it takes as input.

In the example that is included in this guide, the result of the detection is expected to be a vector of 32-bit floating point numbers representing 10 rectangles. The vector is divided into four parts where each part has a specific meaning. See the table below.

Part Size Count Content
1 4*N 40 X/Y/Width/Height values for positioning the rectangles.
2 N 10 Label indexes for classifying the 10 objects.
3 N 10 Confidence levels of the 10 objects.
4 1 1 Number of outputs, i.e. N=10.

Along with the model, also a label file is available which contains indexes and labels for the classification of each object, for example person, apple, car, etcetera.

The confidence level is a useful value that tells how confident the model is about the accuracy of each detection. It can be used to filter out invalid or uninteresting objects before displaying the result to the user.

In case of the SSDLite MobileDet model, the vector always contains 10 detected rectangles, and if less than 10 objects are detected, the confidence levels will be set to 0.0 for these and should be ignored when rendering the result.

Another interesting aspect of the Tensorflow models is the image size that they requre. In the case of the SSDLite MobileDet, it expects a quadrat of 320x320 pixels as input. In most cases, the input rectangle is much smaller than the actual camera feed, so a pre-process using cropping or scaling is needed for each frame that is sent to the inference accelerator. On the web site for Google Coral, you can see several examples of different models and the required input frame size. Further more, the image needs to be in 24-bit color RGB, i.e. 3 bytes per pixel so if the image contains alpha channel and/or another image format, it needs to be converted before being used in the detection. You can see an example of such conversion in the Tensorflow Recipe, if you search for the COLOR_YUV2RGB_YV12 color conversion code.

In the example, a simple scaling from the camera rectangle down to the 320x320 pixels square is made without any regards of maintaining the aspect ratio. Tests show that the detection is giving satisfying results, but this is an area where some optimizations may be possible to make.

The following models have been used for testing the Google Coral accelerator with the displays.

Model Type of result DPS*
posenet_mobilenet_v1_075_481_641_quant_decoder_edgetpu.tflite Pose detection 50
ssd_mobilenet_v2_face_quant_postprocess_edgetpu.tflite Face detection 30
ssdlite_mobiledet_coco_qat_postprocess_edgetpu.tflite Identifying 90 objects 55
tf2_mobilenet_v1_1.0_224_ptq_edgetpu.tflite Classification of 1000 objects 200

* The DPS is an estimate and may depend on other background processes.

The models are all represented in the Machinelearningdemo.

Performance

AI based operations are known to be heavy. With the assistance of the Google Coral accelerator, it is possible to get high performing object detections while still having processing power left for the rest of the application.

With the SSDLite MobileDet mentioned above as a benchmark, tests show that up to 50 detections per second is achievable with the mPCIe based Coral accelerator. With the USB Dongle version, the value is up to 20 DPS.

Another heavy operation is to display the video feed itself, and the higher the resolution is the more resources are needed to generate a smooth feed on the screen. Great performance improvements can be made by reducing the camera resolution so it is not higher than necessary for the application.