Pose estimation is a computer vision and deep learning method where the goal is to detect a person and their pose in a given image. This is done by locating specific landmarks like the head, shoulders, elbows, hands, hip, knees, feet, etc. By tracking the position and orientation of human body parts, a rough estimate of the person and all their movements is obtained. It’s basically like if AI did the Glowstick man lockdown challenge!
Before we get into that, let’s take a quick look at the Flo Edge One, a must-have in every AI and robotics engineer’s toolbox. Here are some remarkable benchmarks that make it a top competitor in edge devices!
- Pre-installed with Ubuntu 22.04 and tools like ROS2, OpenCV, TFlite, etc.
- Qualcomm Adreno 630 GPU.
- 12 MP 4k camera at up to 30fps.
- Inferencing PoseNet with a mobileNet base at 17 milliseconds.
Human pose estimation is used to track the keypoints of human bodies. Some examples of such keypoints are ‘left knee”, “right hip”, and so on. The performance of keypoint tracking on a live video requires high computational resources and lacks accuracy. But with new advancements in hardware and model architectures, this task has become far more feasible. Today, the base of all image processing techniques is a very powerful tool called a convolutional neural network (CNN). Hence, a CNN has been tailored particularly for pose estimation as well.
Typically, human pose estimation is preceded by identifying a person in the image. This classifies as object detection, where a person is detected and bounded by a box and then landmarks/keypoints are detected within that box for live pose tracking. Modern deep learning methods have achieved several breakthroughs by inferencing both 2D and 3D pose estimation as well as multi-person pose estimation. In this blog, we will be looking at PoseNet – a very commonly used 2D single pose detection architecture that is both fast and lightweight.
Dataset and Model:
As mentioned earlier, most pose estimation models are two step architectures that first detect human bounding boxes and then estimate keypoints within the boxes. The model has been trained on the COCO benchmark dataset with 17 identifiable keypoints – “nose”, “left_eye”, “right_eye”, “left_ear”, “right_ear”, “left_shoulder”, “right_shoulder”, “left_elbow”, “right_elbow”, “left_wrist”, “right_wrist”, “left_hip”, “right_hip”, “left_knee”, “right_knee”, “left_ankle”, “right_ankle”. Each keypoint is annotated with (x,y,v) where x and y mark the coordinates of the keypoint and v indicates if it is visible.
PoseNet is supported by a MobileNet backbone which is a lightweight architecture perfect for web operations and Edge devices like the Flo Edge One. It takes a 257 x 257 RGB image (video stream/camera stream/image) as an input and produces four 9 x 9 tensors with channel sizes 17, 34, 32, and 32. Of these four tensors, the first two – heat maps and offsets – are used to calculate the position and confidence scores for each of the 17 keypoints.
This vector contains 17 channels, each for each identifiable keypoint. A single channel contains the heatmap for its corresponding keypoint which indicates all the estimated locations of the point and the confidence score. The most likely location is marked down based on these scores.
This vector contains 34 channels which is twice the number of identifiable keypoints in a human body. This vector marks the position of each of the 17 keypoints. The first 17 channels gives the x coordinates while the last 17 give the y coordinates.
Now let’s take a look at how this model can be run efficiently on the Flo Edge One. What is Flo Edge One you ask? This impressive device boasts a light GPU that can deliver smooth results, all while maintaining a high level of accuracy. The Flo Edge One is truly a remarkable device, providing a plethora of impressive features that make it a must-have for tech enthusiasts. With its onboard camera, inbuilt IMU, and GNSS capabilities, this device truly has it all. It comes pre-installed with Ubuntu 22.04, ROS2, OpenCV, and various other tools, making it the perfect choice for your robotics ventures. The best part? Amidst the semiconductor crisis, the Flo Edge One is affordable and ready to ship! So you can get started ASAP, without breaking the bank.
Want to build a hobby project version of the AMP robot from the Avatar movies? (Of course, not to destroy a whole species but just for fun.) Equip your bot with a human pose estimation model to control it just by gestures! With the Flo Edge One 12 MP camera, run the pose estimation model and track your actions. Based on a set of predetermined relative gestures, match your gesture and make the robot perform any action in response.
The inference time of this model was around 17 milliseconds and 6-7 FPS. The PoseNet architecture was light and easily to load onto the Flo Edge One GPU. The confidence scores for each keypoint were in the range of 0.6-1, while the average score was 0.88 – 0.92. The accuracy of the model was quite decent but it has trouble detecting keypoints if the face was not visible.
Since the development of PoseNet, many other highly accurate and fast models have been designed and implemented such as OpenPose, MoveNet lightning, MoveNet Thunder, etc. Some of these models are computationally heavy, making them a better choice for hardware that can bear the load rather than Edge devices.
The pose estimation model is lightweight and fast when run on the Flo Edge GPU and gives good tracking results. Coupled with the 12 MP onboard camera a wide range of systems can be developed for use cases like 3D modelling, robot training, console ontrol, surveillance, and many more.