In this post, just by using RAiV, we will detect objects, estimate the scene's depth map and then by using the depth map and camera parameters we will localize the detected objects in 3D.

Road So Far

In the previous posts, we have shown how to:

Now its time to combine and expand the previous codes to achieve 3D localization of the detected objects.

Challenge: Image Rectification

In the previous post estimate depth, if you examined the depth map closely, you would have noticed that the depth map did not align perfectly with either of the stereo image pairs. This is because, the stereo image pair was rectified before the depth estimation to make them perfectly horizontally aligned.

What is image rectification?

Image Rectification is the process of geometrically transforming two images (from the left and right cameras) so that they appear as if they were taken by two perfectly horizontally aligned cameras.

In depth from stereo it is important to have horizontally aligned cameras. Because, if the images are horizontally aligned, we only make a horizontal disparity search which will speed up the depth map estimation process.

However, having a perfectly horizontally aligned cameras is hard to achieve in real world. To overcome this alignment issue, image rectification was introduced. During the calibration of the stereo camera pair, rectification matrices are estimated for each camera. By using these matrices, stereo images are aligned horizontally perfectly.

Ok, so what?

So, you cannot use the detected object coordinates from the stereo images pair on the estimated depth map. You have to convert the object coordinated to rectified coordinates.

Prepare & Upload the Code

In the code below, we will combine the previous object detection and depth map estimation codes, then use the output of these codes and the camera parameters to localize object in 3D.

You can find this example in our Github Repository with all the necessary modules. Please download the example code from the github repository and upload it to RAiV via the web interface.

# For accessing data pipeline
from qCU_Data import qCUData

# For Yolo Helper Functions
from YOLOv8ObjectDetector import YOLOv8ObjectDetector

# For Depth Estimation
from StereoDepthEstimator import StereoDepthEstimator
import depthUtils



def main():
    # Create interface
    theQCUData = qCUData()

    # Initialize shared memory
    if not theQCUData.init():
        print("Failed to initialize shared memory")
        return

    # Initialize COCO classes
    COCO_CLASSES = [
        "person", "bicycle", "car", "motorcycle", "airplane", "bus", "train",
        "truck", "boat", "traffic light", "fire hydrant", "stop sign",
        "parking meter", "bench", "bird", "cat", "dog", "horse", "sheep",
        "cow", "elephant", "bear", "zebra", "giraffe", "backpack", "umbrella",
        "handbag", "tie", "suitcase", "frisbee", "skis", "snowboard",
        "sports ball", "kite", "baseball bat", "baseball glove", "skateboard",
        "surfboard", "tennis racket", "bottle", "wine glass", "cup", "fork",
        "knife", "spoon", "bowl", "banana", "apple", "sandwich", "orange",
        "broccoli", "carrot", "hot dog", "pizza", "donut", "cake", "chair",
        "couch", "potted plant", "bed", "dining table", "toilet", "tv",
        "laptop", "mouse", "remote", "keyboard", "cell phone", "microwave",
        "oven", "toaster", "sink", "refrigerator", "book", "clock", "vase",
        "scissors", "teddy bear", "hair drier", "toothbrush"
    ]

    # Initialize Yolo detector post processor
    objDetector = YOLOv8ObjectDetector(ai_classes=COCO_CLASSES, confidence_threshold=0.5, iou_threshold=0.45)

    # Initialize OpenCV's depth estimation algorithms
    depthScale = 0.5
    depthMinMM = 250 #50.0
    depthMaxMM = 650 #5500.0
    depthEstimator = StereoDepthEstimator(
        scale_factor=depthScale,
        # The depth values are in milimeters ("mm")
        min_depth=depthMinMM,
        max_depth=depthMaxMM,
    )

    # Enter processing loop
    try:
        while True:
            # Get Ai data
            ai_data = theQCUData.getDataAi()
            if ai_data:
                if 'error' in ai_data:
                    print(f"Error occurred: {ai_data['error']}")
                else:
                    # Postprocess the ai_data
                    detected_objects = objDetector.detect_objects(ai_data)

                    # NOTE: 1. We are processing AI output images. Stereo camera output can also be processed
                    #       2. Due to the stereo camera setup the output depth map size is 726x585

                    # Process Ai processor output images
                    depthMap = depthUtils.getDepthFromStereo(ai_data, memConfig, depthEstimator)

                    # Get the image ai preprocessing parameters
                    aiHeader = ai_data['header']
                    aiPrepro = aiHeader.imPreproPrms

                    detected_objsNDepth_image = []
                    for obj in detected_objects:

                        # Convert yolo coordinates to image coordinates
                        bbox_img_float = objDetector.yolo_to_coords_float(aiPrepro, obj['bbox'])
                        bbox_img_int = [int(coord) for coord in bbox_img_float]

                        # For lens EFL 2.8mm
                        lensHFov = 81.20; # degrees
                        lensVFov = 69.71; # degrees
                        ctrDirectionDegrees = objDetector.get_center_degree(aiPrepro, bbox_img_float, lensHFov, lensVFov)

                        # Get depth of the object
                        obj_depth_min, obj_depth_max, obj_depth_median = depthEstimator.get_depth_of_rect(depthMap, bbox_img_int)
                        detected_objsNDepth_image.append({
                            'class_id': obj['class_id'],
                            'class_name': obj['class_name'],
                            'confidence': obj['confidence'],
                            'bbox': bbox_img_int,
                            'depth_min': float(obj_depth_min),
                            'depth_max': float(obj_depth_max),
                            'depth_med': float(obj_depth_median),
                            'ctrDirectDeg': ctrDirectionDegrees
                        })

                    # Print the 3D localization results
                    print(detected_objsNDepth_image)
            else:
                # Wait to avoid high CPU utilization
                time.sleep(0.1)
    except Exception as e:
        print(f"An error occurred: {e}")
    finally:
        print("Cleanup completed")


if __name__ == "__main__":
    main()

Live Action: Feed the Data Pipeline

Now to trigger the data pipeline, please press the "Snapshot" button. As soon as the image is displayed on the web interface, PC side displays the objects with 3D coordinates.

3D Object Localization with RAiV

What is Next?

Check our Python SDK:

RAiV Python SDK

Check our Github Repository For Sample Codes:

Our Github Repository