Learning-Based Feature Selection for ORB-SLAM3

Date: Dec 14, 2023

Summary

Integrated the YOLOv5 with ORB-SLAM3 to filter features on dynamic objects and improve data association in visual SLAM. After ORB keypoints are extracted, detections from YOLOv5 identify dynamic classes; any ORB features inside those bounding boxes are dropped, and tracking proceeds. We evaluate on EuRoC, KITTI, and a custom NUance driving dataset; results show near-perfect EuRoC trajectories, no degradation on KITTI 00 (≈0.5% translational drift with vanilla ORB-SLAM3), and improved straight-line tracking on the low-FPS NUance run. We also compare with DROID-SLAM as a learning baseline.

Motivation

Dynamic objects corrupt epipolar geometry and feature correspondences, degrading VO/SLAM. Our objective was to reduce the impact of moving objects in real driving scenes (occlusions, motion, calibration & distortion issues) and produce better camera trajectories—especially on NUance, a challenging real-world dataset often used at Northeastern.

Datasets

EuRoC — indoor visual-inertial MAV sequences with GT; few dynamic objects; good for sanity checks.
KITTI — outdoor stereo/LiDAR/IMU (2012); relatively controlled scenes with mainly static parked vehicles.
NUance (Boston) — custom ROS bag from the NUance self-driving car (cameras + IMU + GPS); many dynamic objects (pedestrians, cyclists, traffic). Captured at ~2 FPS, which stresses SLAM during turns.

System & Method

Baseline and insertion point

We keep the standard ORB-SLAM3 pipeline (front-end tracking, back-end map/loop). After ORB features are extracted on each frame, we run YOLOv5 to obtain bounding boxes and remove ORB features that fall inside those boxes (assumed dynamic). The rest of ORB-SLAM3 proceeds unchanged.

Detector and classes

YOLOv5 is used for its practical C++ integration and speed. We flag common dynamic classes: “person”, “car”, “motorbike”, “bus”, “train”, “truck”, “boat”, “bird”, “cat”, “dog”, “horse”, “sheep”, “crow”, “bear”—features within these boxes are discarded.

Experiments

Setup

We tested in order of complexity: EuRoC → KITTI → NUance. Evaluation was both qualitative (trajectory & map overlays) and quantitative via L1 error against ground-truth (or GPS).

Metrics

Qualitative: visual alignment and map continuity.
Quantitative: L1 absolute error vs ground truth/GPS trajectory.

Results

EuRoC

Trajectory closely follows ground truth (2D overlay).
L1 error: X = 0.0226, Y = 0.0257 (very small residuals).

KITTI

Vanilla ORB-SLAM3 achieves ~0.5% translational drift; the YOLO-filtered variant matches this, confirming no regression and similar compute behavior.

NUance (Boston)

Low FPS (~2 Hz) is the primary failure driver during sharp turns (only ~4 frames across a 1–2 s maneuver).
With YOLO filtering, straight segments track better than vanilla, but 90° turns still fail due to too few frames—map breaks around the U-turn.

DROID-SLAM baseline

On a cropped NUance route (avoiding the U-turn), DROID-SLAM produces a trajectory close to GPS ground truth, underscoring the value of learned feature selection and dynamic-object handling—albeit with heavy compute (NVIDIA Tesla V100 on Discovery cluster).

Limitations & Lessons

1) Frame rate matters: For car-scale motion, 20–30 FPS is a practical floor; 2 FPS is insufficient at intersections.
2) Turns vs straights: Filtering helped straight-line tracking but couldn’t compensate for too few frames during rapid turns.
3) Proofs of concept: On EuRoC and KITTI, the method doesn’t break ORB-SLAM3’s performance, supporting feasibility.
4) Deep SLAM trade-offs: DROID-SLAM is promising but computationally heavy for onboard self-driving deployments.

Credits & Assets

Presented the project along with Tarun Srinivasan and Thomas Rowan
Demo: DROID-SLAM on NUance.

Back to Projects