3D Object Detection: Bringing Depth to YOLO on the Edge
In this post, just by using RAiV, we will detect objects, estimate the scene's depth map and then by using the depth map and camera parameters we will localize the detected objects in 3D.
Road So Far
In the previous posts, we have shown how to:
Now its time to combine and expand the previous codes to achieve 3D localization of the detected objects.
Challenge: Image Rectification
In the previous post estimate depth, if you examined the depth map closely, you would have noticed that the depth map did not align perfectly with either of the stereo image pairs. This is because, the stereo image pair was rectified before the depth estimation to make them perfectly horizontally aligned.
What is image rectification?
Image Rectification is the process of geometrically transforming two images (from the left and right cameras) so that they appear as if they were taken by two perfectly horizontally aligned cameras.
In depth from stereo it is important to have horizontally aligned cameras. Because, if the images are horizontally aligned, we only make a horizontal disparity search which will speed up the depth map estimation process.
However, having a perfectly horizontally aligned cameras is hard to achieve in real world. To overcome this alignment issue, image rectification was introduced. During the calibration of the stereo camera pair, rectification matrices are estimated for each camera. By using these matrices, stereo images are aligned horizontally perfectly.
Ok, so what?
So, you cannot use the detected object coordinates from the stereo images pair on the estimated depth map. You have to convert the object coordinated to rectified coordinates.
Prepare & Upload the Code
In the code below, we will combine the previous object detection and depth map estimation codes, then use the output of these codes and the camera parameters to localize object in 3D.
You can find this example in our Github Repository with all the necessary modules. Please download the example code from the github repository and upload it to RAiV via the web interface.
# For accessing data pipeline
from qCU_Data import qCUData
# For Yolo Helper Functions
from YOLOv8ObjectDetector import YOLOv8ObjectDetector
# For Depth Estimation
from StereoDepthEstimator import StereoDepthEstimator
import depthUtils
def main():
# Create interface
theQCUData = qCUData()
# Initialize shared memory
if not theQCUData.init():
print("Failed to initialize shared memory")
return
# Initialize COCO classes
COCO_CLASSES = [
"person", "bicycle", "car", "motorcycle", "airplane", "bus", "train",
"truck", "boat", "traffic light", "fire hydrant", "stop sign",
"parking meter", "bench", "bird", "cat", "dog", "horse", "sheep",
"cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella",
"handbag", "tie", "suitcase", "frisbee", "skis", "snowboard",
"sports ball", "kite", "baseball bat", "baseball glove", "skateboard",
"surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork",
"knife", "spoon", "bowl", "banana", "apple", "sandwich", "orange",
"broccoli", "carrot", "hot dog", "pizza", "donut", "cake", "chair",
"couch", "potted plant", "bed", "dining table", "toilet", "tv",
"laptop", "mouse", "remote", "keyboard", "cell phone", "microwave",
"oven", "toaster", "sink", "refrigerator", "book", "clock", "vase",
"scissors", "teddy bear", "hair drier", "toothbrush"
]
# Initialize Yolo detector post processor
objDetector = YOLOv8ObjectDetector(ai_classes=COCO_CLASSES, confidence_threshold=0.5, iou_threshold=0.45)
# Initialize OpenCV's depth estimation algorithms
depthScale = 0.5
depthMinMM = 250 #50.0
depthMaxMM = 650 #5500.0
depthEstimator = StereoDepthEstimator(
scale_factor=depthScale,
# The depth values are in milimeters ("mm")
min_depth=depthMinMM,
max_depth=depthMaxMM,
)
# Enter processing loop
try:
while True:
# Get Ai data
ai_data = theQCUData.getDataAi()
if ai_data:
if 'error' in ai_data:
print(f"Error occurred: {ai_data['error']}")
else:
# Postprocess the ai_data
detected_objects = objDetector.detect_objects(ai_data)
# NOTE: 1. We are processing AI output images. Stereo camera output can also be processed
# 2. Due to the stereo camera setup the output depth map size is 726x585
# Process Ai processor output images
depthMap = depthUtils.getDepthFromStereo(ai_data, memConfig, depthEstimator)
# Get the image ai preprocessing parameters
aiHeader = ai_data['header']
aiPrepro = aiHeader.imPreproPrms
detected_objsNDepth_image = []
for obj in detected_objects:
# Convert yolo coordinates to image coordinates
bbox_img_float = objDetector.yolo_to_coords_float(aiPrepro, obj['bbox'])
bbox_img_int = [int(coord) for coord in bbox_img_float]
# For lens EFL 2.8mm
lensHFov = 81.20; # degrees
lensVFov = 69.71; # degrees
ctrDirectionDegrees = objDetector.get_center_degree(aiPrepro, bbox_img_float, lensHFov, lensVFov)
# Get depth of the object
obj_depth_min, obj_depth_max, obj_depth_median = depthEstimator.get_depth_of_rect(depthMap, bbox_img_int)
detected_objsNDepth_image.append({
'class_id': obj['class_id'],
'class_name': obj['class_name'],
'confidence': obj['confidence'],
'bbox': bbox_img_int,
'depth_min': float(obj_depth_min),
'depth_max': float(obj_depth_max),
'depth_med': float(obj_depth_median),
'ctrDirectDeg': ctrDirectionDegrees
})
# Print the 3D localization results
print(detected_objsNDepth_image)
else:
# Wait to avoid high CPU utilization
time.sleep(0.1)
except Exception as e:
print(f"An error occurred: {e}")
finally:
print("Cleanup completed")
if __name__ == "__main__":
main()
Live Action: Feed the Data Pipeline
Now to trigger the data pipeline, please press the "Snapshot" button. As soon as the image is displayed on the web interface, PC side displays the objects with 3D coordinates.
What is Next?
Check our Python SDK:
RAiV Python SDKCheck our Github Repository For Sample Codes:
Our Github Repository