Visual-SLAM Developer Roadmap - 2026

Visual-SLAM is a special case of 'Simultaneous Localization and Mapping', which you use a camera device to gather exteroceptive sensory data.
Below there is a set of topics you need to understand in Visual-SLAM, from an absolute beginner difficulty to getting ready to become a Visual-SLAM engineer / researcher.
Visual-SLAM is often portrayed as a rather difficult topic - many think good C++ programming skills and deep understanding of mathematics is necessary.
On the other hand, there are not many courses provided for beginners, especially in non-English languages.
I made these charts to share my thoughts and experience on studying Visual-SLAM, and hopefully the beginner learners can get a grasp of where to start from.
Purpose of these Roadmaps
The purpose of these roadmaps is to give you an idea about the general overview of Visual-SLAM, and to guide you if you are confused about where to start from.
Note to Beginners
Acknowledge that SLAM has a relatively high entry barrier - it's not because of the requirement of undertanding difficult mathematics, but the requirement of equipping yourself with various types of skills. Don't feel overwhelmed - you don't need to learn everything if you are just getting started. Instead, enjoy the journey itself and progress topic by topic. The result will be very rewarding.
Table of Contents
| Level |
Topic |
Focus |
| 1 |
Beginner |
Programming, Math, Projective Geometry, Camera, Image |
| 2 |
Getting Familiar |
Feature matching, MVG, Optimization, Factor Graph, Mapping, Sensors |
| 3 |
Monocular SLAM |
Feature/Direct/Hybrid/Learning-based, Foundation Model, Neural Representation, Semantic |
| 4 |
RGB-D SLAM |
KinectFusion, ElasticFusion, BundleFusion, DSP-SLAM |
| 5 |
Deep Learning + SLAM |
A. Frontend · B. Backend · C. Systems · D. Scene Understanding |
| 6 |
VIO / VINS |
Filter-based (MSCKF) vs Optimization-based (VINS-Mono, OKVIS2-X) |
| 7 |
Stereo SLAM |
S-PTAM, ORB-SLAM2/3 stereo, LDSO |
| 8 |
Collaborative SLAM |
CCM-SLAM, Kimera-Multi, Swarm-SLAM |
| 9 |
LiDAR & Visual-LiDAR |
LOAM, FAST-LIO2, LVI-SAM, R3LIVE, FAST-LIVO2 |
| 10 |
Event Camera SLAM |
EVO, Ultimate-SLAM, DEVO |
| 11 |
World Models & Spatial AI |
GAIA-1, Cosmos, VLM/VLA, Generative 3D |
Level 1: Beginner
Programming
- C++: Pointer, OOP
- Python
- Bash/Linux: Basic terminal usage
Mathematics
- Basic Probability & Statistics: Gaussian distribution, Bayes' theorem
- Basic Linear Algebra: Vectors & Matrices, Determinant, Dot & Cross product, Rank, Inverse matrix, Transpose matrix, SVD, Eigenvalues/Eigenvectors
- Logarithm & Exponential
- Basic Calculus: Differentiation, Taylor expansion
Projective Geometry
- Pinhole camera model → Image projection
- Camera calibration: Intrinsic/Extrinsic parameters, Lens distortion
- Rigid body motion: Euler/Quaternion/Rotation Matrix, Projective space & Vanishing point, Homogeneous transformation
- Epipolar geometry → Essential & Fundamental matrix
- Triangulation
Camera Device
- Lens, Sensor, Resolution/ISO/Aperture
Image Data
- Colour image, Resolution, Grayscale image
- Thresholding, Gaussian blur
- Corner detector: Harris corner
- Edge detector: Sobel & Canny Edge
- Stereovision, RGB-D, Disparity, Depth
Level 2: Getting Familiar with SLAM
Programming
- C++: OOP, Modern C++, Data structures & Algorithms, Compilers, CMake/Makefile/Ninja, Design patterns, OpenCV C++
- C
- Git/GitHub
- OpenCV (opencv-python)
- Python: Deep learning, Graph plots, System scripts
- Bash/Linux: ssh, CLI text editor/Vim/tmux
- Concurrency: SIMD-SSE/AVX/Neon, OpenMP, CUDA
- Mobile: Android (Java/Kotlin), iOS (Objective-C/Swift)
- Maths library: Eigen, Ceres-solver/GTSAM/g2o
- C++/Python interop: PyBind11, nanobind
- Docker
- C#: COLMAP, Unity AR, Microsoft Hololens
- CI/CD: GitHub Actions, Apache Airflow
- ROS/ROS2
- Simulation: Gazebo, Isaac Sim
Image Processing
- Keypoints → Detector/Descriptor
- SIFT, FAST, ORB, AKAZE
- Deep features: R2D2, Superpoint
- Image pyramid, oFAST, rBRIEF
Local Feature Matching
- Brute-Force, FLANN, Kd-Tree
- LSH, Multi-probe LSH, HBST
- Superglue
Global Feature Matching
- Bag of Visual Words, NetVLAD
- Deep image retrieval, Hierarchical localization
Feature Tracking
- Optical flow, KLT Tracker
Multiple View Geometry
- 2D-2D correspondence: Essential/Fundamental, Homography
- 2D-3D correspondence: P3P, PnP, SVD
- 3D-3D correspondence: ICP
Outlier Rejection
- RANSAC, PROSAC, M-Estimator, MAXCON, Convex relaxation
Least Squares Optimisation
- Reprojection error, Bundle adjustment
- Non-linear optimisation, Lie algebra
- Lie groups: SO(3), SE(3)
- Gauss-Newton, Levenberg-Marquardt
- Pose graph optimization
- Schur complement / Sparsity
Motion Model
- Proprioceptive sensor: IMU, Wheel
- Odometry (pose)
Observation Model
- Exteroceptive sensor: Camera, LiDAR
- Landmark (Map)
- Joint optimisation, MLE & MAP
Factor Graph Optimisation
Mapping
- Point cloud, Occupancy grid mapping, TSDF, Surfel, Voxel map
Sensors
- Camera device: Wide/telecentric lens, Lens MTF, CCD/CMOS, Rolling/Global shutter, Exposure/ISO, Stereovision, RGB-D, Structured light, Active IR/ToF
- LiDAR → Visual-LiDAR fusion
- IMU → VIO
- RADAR → Sensor fusion, Extended Kalman filter
- Sonar
- Multi-sensor calibration: Camera-IMU, Camera-LiDAR
Evaluation
- Metrics: ATE (Absolute Trajectory Error), RPE (Relative Pose Error)
- Datasets: KITTI, TUM RGB-D, EuRoC
Next Levels
Monocular SLAM · VIO/VINS · Stereo SLAM · Visual-LiDAR Fusion · RGB-D SLAM · Collaborative SLAM · Deep SLAM/Localization
Level 3: Monocular Visual-SLAM
Key Concepts
- VO vs SLAM — VO is local (no loop closure), SLAM includes global map + loop closure
- Scale ambiguity — Fundamental limitation of monocular SLAM; absolute scale is unrecoverable from images alone
- Covisibility graph — Shared map point visibility between keyframes; core data structure in ORB-SLAM
- Visual Place Recognition (VPR) — Recognising previously visited places for loop closure
- Self-supervised depth — Learning monocular depth without ground truth (Monodepth2, Godard 2019)
Feature-based SLAM
| System |
Author/Year |
Key Concepts |
| Visual Odometry |
Nister 2004 |
Fundamental matrix, Triangulation, VO (local-only, no loop closure) |
| MonoSLAM |
Davison 2007 |
First real-time monocular SLAM, EKF-based, single camera, sparse 3D map, probabilistic feature initialization |
| PTAM |
Klein & Murray 2007 |
FAST feature, Tracking, Frontend/Backend separation, Parallel threads, Keyframe, Mapping, Bundle adjustment, Manual initialisation |
| Visual-SLAM why filter? |
Strasdat 2012 |
Bundle adjustment, Scale-aware BA, Motion-only BA |
| ORB-SLAM |
Mur-Artal 2015 |
ORB keypoint, Automatic initialisation (Homography vs Fundamental selection), Tracking thread, Sliding-window BA, Local mapping, Large-scale, Loop closure, Bag of visual words, Global optimisation, Covisibility graph, Map point management (culling, merging) |
| Pop-up SLAM |
Yang 2016 |
Line/Plane features |
| PL-SLAM |
Pumarola 2017 |
Point/Line features |
| ORB-SLAM2 |
Mur-Artal 2017 |
→ Stereo SLAM, → RGB-D SLAM |
| CubeSLAM |
Yang 2019 |
Monocular 3D cuboid detection + SLAM, 9-DoF object representation |
| OpenVSLAM |
Sumikura 2019 |
— |
| Stella-VSLAM |
(fork) 2021 |
OpenVSLAM successor, license reboot |
| UcoSLAM |
Munoz-Salinas 2019 |
Fiducial markers |
| DeepFusion |
LaidLow 2019 |
— |
| ORB-SLAM3 |
Campos 2020 |
Monocular + Stereo + VIO, Multi-map, IMU integration |
| DXSLAM |
Li 2020 |
Deep features for SLAM |
| PyCuVSLAM |
NVIDIA 2026 |
Python + CUDA GPU-accelerated VSLAM toolkit (cuVSLAM wrapper) |
Direct SLAM
| System |
Author/Year |
Key Concepts |
| DTAM |
Newcombe 2011 |
Dense mapping, Keyframe mapping, GPGPU |
| LSD-SLAM |
Engel 2014 |
Photometric error minimisation, High gradient pixels/edges, Large scale, Loop closure, Pose graph optimisation |
| DSO |
Engel 2016 |
Photometric bundle adjustment, Sliding window BA, No loop closure/global optimisation |
| LDSO |
Gao 2018 |
DSO + Loop closure (BoW-based), addresses DSO's main weakness |
| CNN-SLAM |
Tateno 2017 |
Depth from LSD-SLAM + deep depth, Semantic label |
| DVSO |
Yang 2018 |
Deep single image depth estimation, StackNet |
| Basalt |
Usenko 2020 |
Non-linear recovery (→ primarily VIO, see Level 6) |
| D3VO |
Yang 2020 |
Deep single image depth estimation, Deep pose, Deep aleatoric uncertainty |
Hybrid (Feature + Direct)
| System |
Author/Year |
Key Concepts |
| SVO |
Forster 2014 |
FAST feature detection, Direct-based feature tracking, Bundle adjustment |
| SVO2 |
Forster 2017 |
Multi-camera/Fisheye, Probabilistic depth estimation, Direct method convergence, Sparse method |
| Stereo DSO |
Wang 2017 |
→ Stereo SLAM |
| VI-DSO |
Gao 2018 |
→ VIO/VINS |
Learning-based SLAM
| System |
Author/Year |
Key Concepts |
| DROID-SLAM |
Teed 2021 |
Differentiable BA, dense optical flow, end-to-end learned |
| TartanVO |
Wang 2021 |
Generalizable visual odometry |
| DPV-SLAM / DPVO |
Teed 2023 |
DROID-SLAM lightweight, patch-based visual odometry |
| MAC-VO |
Qu 2024 |
Learning-based VO, metric-aware |
| VoT |
Yugay 2025 |
Visual Odometry with Transformers |
Foundation Model SLAM
| System |
Author/Year |
Key Concepts |
| DUSt3R |
Wang 2024 |
Pointmap regression from image pairs, no calibration needed |
| MASt3R |
Leroy 2024 |
DUSt3R + local feature matching |
| MASt3R-SLAM |
Leroy 2024 |
Real-time dense SLAM from MASt3R |
| VGGT |
Wang (Meta) 2025 |
Feed-forward inference of poses, depths, pointmaps, tracks from N views (CVPR 2025 Best Paper) |
| VGGT-SLAM |
2025 |
VGGT as frontend for real-time SLAM |
| VGGT-SLAM 2.0 |
2026 |
Improved VGGT-SLAM |
| VGGT-Geo |
2026 |
Probabilistic geometric fusion of VGGT priors for dense indoor SLAM |
| IGGT |
Li 2026 |
VGGT + VLM — language-grounded 3D geometry |
| AMB3R |
Wang 2025 |
MASt3R frontend + Transformer backend for SfM/SLAM |
| MASt3R-Fusion |
WHU 2025 |
MASt3R-SLAM + IMU + GNSS fusion |
SfM Tools
| System |
Author/Year |
Key Concepts |
| InstantSfM |
2025 |
GPU-accelerated SfM pipeline, 40× faster than COLMAP |
Neural Representation SLAM
NeRF-based
| System |
Author/Year |
Key Concepts |
| iMAP |
Sucar 2021 |
First NeRF-SLAM, single MLP, real-time tracking/mapping |
| BARF |
Lin 2021 |
Bundle-Adjusting NeRF, coarse-to-fine positional encoding, joint pose+NeRF opt (not full SLAM — pose+NeRF co-optimization) |
| NICE-SLAM |
Zhu & Peng 2022 |
Hierarchical feature grid (coarse/mid/fine), scalable |
| Co-SLAM |
Wang 2023 |
Hash grid (Instant-NGP) + coordinate encoding, 5-10× faster than NICE-SLAM |
| ESLAM |
Johari 2023 |
Tri-plane representation, O(N²) vs O(N³) memory |
| Point-SLAM |
Sandström 2023 |
Neural point cloud based |
| NeRF-SLAM |
Rosinol 2023 |
NeRF + classical SLAM pipeline |
| NICER-SLAM |
Zhu 2024 |
RGB-only NeRF-SLAM (no depth sensor), monocular depth integration |
| vMAP |
Kong 2023 |
Object-level NeRF-SLAM, per-object neural fields |
| GO-SLAM |
Zhang 2023 |
Global optimization + NeRF-SLAM, loop closure + global BA |
3DGS-based
| System |
Author/Year |
Key Concepts |
| SplaTAM |
Keetha 2024 |
First 3DGS-SLAM, RGB-D, silhouette-guided densification |
| MonoGS |
Matsuki 2024 |
Monocular 3DGS-SLAM, depth network + triangulation fusion |
| GS-ICP SLAM |
Yu 2024 |
Gaussian-to-Gaussian ICP (Mahalanobis distance), geometric tracking |
| Photo-SLAM |
Huang 2024 |
Explicit geometry + implicit appearance (MLP color), anti-aliasing |
| RTG-SLAM |
2024 |
Real-time focus, adaptive Gaussian budget, Jetson Orin 25 FPS |
| EGG-Fusion |
ZJU 2025 |
Gaussian surfel fusion, information-filter-based, real-time 24 FPS |
| Online-Mono-3DGS (MODP) |
2025 |
ORB-SLAM3 tracking + Hierarchical Gaussian Management |
| ActiveSplat |
Li 2025 |
Active mapping with 3DGS + Voronoi-based path planning |
| Open-S3SLAM |
2026 |
Open-set semantic 3DGS SLAM for smartphones (ICRA 2026) |
| LEGS |
2025 |
Language Embedded Gaussian Splats, real-time language-queryable 3D |
Semantic / Language-Grounded SLAM
| System |
Author/Year |
Key Concepts |
| ConceptFusion |
Jatavallabhula (MIT) 2023 |
CLIP features fused into 3D map, open-vocabulary language queries |
| LERF |
Kerr 2023 |
Language Embedded Radiance Fields, DINO multi-scale, NeRF + CLIP |
| OpenScene |
Peng (ETH) 2023 |
Language features back-projected to 3D point clouds |
| ConceptGraphs |
Gu 2023 |
Open-vocabulary 3D Scene Graph, SAM + CLIP + LLM spatial relations |
| SpatialLLM |
Mao 2025 |
Point cloud → LLM, structured indoor modeling as Python scripts |
Also see: LEGS, Open-S3SLAM (3DGS-based section above); Open-YOLO 3D (Level 5 Object Detection)
Level 4: RGB-D Visual-SLAM
RGB-D Camera Devices
- Intel RealSense D series
- Microsoft Kinect v1/v2
- Azure Kinect DK
- Occipital Structure Core
- Orbbec Astra
GPGPU Programming
Systems
| System |
Author/Year |
Key Concepts |
| ICP |
Besl & McKay 1992 |
— |
| DTAM |
Newcombe 2011 |
— |
| KinectFusion |
Newcombe 2011 |
GPGPU, Tracking (project depth → 3D, surface normal, coarse-to-fine ICP), Mapping (volumetric integration, TSDF), Robust to small scene changes, Cannot model deformation, Map growth cubic, Room-size only |
| Double Window Optimisation |
Strasdat 2011 |
— |
| Kintinuous |
Whelan 2012 |
Volume shift, Geometric, Photometric, dBoW+SURF, Optimisation, Loop closure |
| RGBD-SLAM-V2 |
Endres 2013 |
Tracking (colour image, visual features, depth image, point cloud, transformation), Mapping (OctoMap 2013) |
| SLAM++ |
Salas-Moreno 2013 |
Object-oriented SLAM |
| DVO |
Kerl 2013 |
Keyframe, Depth, Direct method, Optimisation, Loop closure |
| RTAB-Map |
Labbé 2014 |
Loop closure, Map merge, Multi-session memory management |
| MRS-Map |
Stuckler 2014 |
— |
| ElasticFusion |
Whelan 2015 |
Active: frame-to-model tracking (photometric + geometric), joint optimisation, fused surfel-based model reconstruction · Inactive: local loop closure (model-to-model local surface, submodel separation), global loop closure (randomised fern encoding, non-rigid space deformation) |
| DynamicFusion |
Newcombe 2015 |
6D motion field, Deformable scene |
| ORB-SLAM2 |
Mur-Artal 2016 |
Bundle adjustment, Sparse reconstruction |
| BundleFusion |
Dai 2016 |
Local-to-global optimisation, Sparse RGB feature, Coarse global pose estimation, Fine pose refinement (geometric + photometric) |
| SemanticFusion |
McCormac 2016 |
Deep Learning CNN, Deep Semantic SLAM |
| InfiniTAM v3 |
Prisacariu 2017 |
Tracking (scene raycast, depth image, RGB image), Relocalisation (random ferns), Mapping (TSDF reconstruction, voxel hashing, surfel reconstruction) |
| Fusion++ |
McCormac & Clark 2018 |
Deep Learning CNN, Mask-RCNN instance segmentation, Object-level SLAM, No prior, Object-level TSDF reconstruction |
| PointFusion / DenseFusion |
Xu 2018 / Wang 2019 |
RGB-D object pose estimation, Tracking, Relocalisation, Loop closure detection |
| BAD SLAM |
Schops 2019 |
Direct bundle adjustment, Deep Semantic SLAM |
| RTAB-Map v2 |
Labbé 2019 |
RGB-D/LiDAR, Light-source detection (2016) |
| MoreFusion |
Wada & Sucar 2020 |
DL instance segmentation, Object-level volumetric fusion, Volumetric pose prediction, 3D scene reconstruction, Collision-based refinement, Semantic SLAM, Object pose estimation, CAD object fitting |
| NodeSLAM |
Wada & Sucar 2020 |
Occupancy VAE, Object-level SLAM (→ also in Level 5 Latent Representation) |
| Kimera / 3D Dynamic Scene Graph |
Rosinol 2020 |
Kimera-VIO, Kimera-Mesher, Kimera-PGMO, Kimera-Semantics, Kimera-DSG |
| DSP-SLAM |
Wang (UCL) 2021 |
DeepSDF shape prior + ORB-SLAM2, object-level dense reconstruction (mono/stereo/LiDAR) |
Level 5: Applying Deep Learning
Level 5 is organized into four pillars:
A. Frontend — learned perception components replacing hand-crafted modules
B. Backend — learned/certifiable optimization replacing classical solvers
C. Systems — end-to-end deep VO/SLAM pipelines
D. Scene Understanding — semantic, language, and relational reasoning on SLAM maps
A. Deep Frontend — Perception
Feature Detection & Matching
| System |
Author/Year |
Key Concepts |
| NetVLAD |
Arandjelovic 2016 |
VLAD, place recognition |
| SuperPoint |
DeTone 2017 |
Homographic Adaptation, Self-supervised, VGG encoder + detector/descriptor heads |
| HardNet |
Mishchuk 2017 |
Learned local descriptor |
| R2D2 |
Revaud 2019 |
Repeatable + Reliable detector/descriptor, explicit repeatability/reliability maps |
| KeyNet |
Barroso-Laguna 2019 |
Learned keypoint detector |
| HF-Net |
Sarlin 2019 |
Global feature, Local feature, Visual localization |
| SuperGlue |
Sarlin 2020 |
Self/Cross-attention GNN, Sinkhorn optimal assignment, dustbin for outliers |
| DISK |
Tyszkiewicz 2020 |
Policy gradient (RL) training, match success/failure as reward |
| Patch NetVLAD |
Hausler 2021 |
Multi-scale patch-level VLAD |
| LoFTR |
Sun 2021 |
Detector-free, Transformer coarse-to-fine dense matching |
| LightGlue |
Lindenberger 2023 |
Adaptive depth/width, 5-10× faster than SuperGlue |
| XFeat |
Potje 2024 |
0.3M params, 1400 FPS (RTX 4090), 64-dim descriptor, embedded-friendly |
| RoMA |
Edstedt 2024 |
DINOv2 foundation feature + coarse-to-fine dense matching |
| DeDoDe |
Edstedt 2024 |
Joint detect-and-describe in one stage |
| RoMA V2 |
Edstedt 2026 |
Improved RoMA |
Depth Estimation
| System |
Author/Year |
Key Concepts |
| MonoDepth |
Godard 2016 |
Left-Right photometric consistency, self-supervised |
| MiDaS |
Ranftl 2020 |
Multi-dataset mixing, scale-and-shift invariant loss, relative depth |
| DPT |
Ranftl 2021 |
Dense Prediction Transformer (ViT backbone), global context |
| ZoeDepth |
Bhat 2023 |
Zero-shot metric depth, Metric Bins Module |
| Metric3D |
Yin 2023 |
Camera intrinsic-conditioned metric depth, Canonical Camera Space |
| Depth Anything |
Yang 2024 |
62M images, foundation model for monocular depth |
| Depth Anything V2 |
Yang 2024 |
Improved with synthetic data, better edge preservation |
| Marigold |
Ke 2024 |
Stable Diffusion for depth, fine detail, uncertainty via sampling |
| Align3r |
Melou 2025 |
Video temporal consistency, DUSt3R-based, CVPR 2025 Highlight |
| Masked Depth Modeling (LingBot-Depth) |
2026 |
Fixes RGB-D failures on glass/mirrors/metal |
Optical Flow & Scene Flow
| System |
Author/Year |
Key Concepts |
| FlowNet |
Dosovitskiy 2015 |
First end-to-end deep optical flow (SimpleNet / CorrNet) |
| FlowNet 2.0 |
Ilg 2017 |
Stacked networks, classical-level accuracy |
| PWC-Net |
Sun 2018 |
Pyramid-Warping-Cost volume, coarse-to-fine, 8.4M params |
| FlowNet3D |
Liu 2019 |
Point cloud scene flow, PointNet++ based |
| RAFT |
Teed 2020 |
All-Pairs Correlation + iterative ConvGRU update, ECCV Best Paper |
| RAFT-3D |
Teed 2021 |
Scene flow (3D motion) from RAFT |
| FlowFormer |
Huang 2022 |
Transformer on cost volume tokens, global context |
| SEA-RAFT |
2024 |
Efficient RAFT variant for real-time |
Camera Pose Regression & Relocalization
| System |
Author/Year |
Key Concepts |
| PoseNet |
Kendall 2015 |
CNN-based 6-DoF pose regression (APR), GoogLeNet backbone |
| DSAC |
Brachmann 2017 |
Differentiable RANSAC, Scene Coordinate Regression (SCR) |
| DSAC++ |
Brachmann 2018 |
Self-supervision, RGB-D support |
| CNN Pose Regression Limitations |
Sattler 2019 |
Pose regression ≈ image retrieval performance |
| LM-Reloc |
von Stumberg 2020 |
Deep direct relocalization |
| DSAC* |
Brachmann 2021 |
Improved learning stability |
| ACE |
Brachmann 2023 |
Accelerated Coordinate Encoding, 5-min training per scene |
| ACE Zero |
Brachmann 2024 |
Zero-shot SCR, no pre-built 3D map needed |
| ACE-G |
Brachmann 2024 |
Generalizable SCR via cross-attention, new scenes without fine-tuning |
| ACE-SLAM |
Tang 2024 |
Neural implicit real-time SLAM, network weights = map |
| hloc |
Sarlin 2019+ |
Hierarchical Localization: coarse (NetVLAD) → fine (SuperGlue) pipeline |
Object Detection & Segmentation for SLAM
| System |
Author/Year |
Key Concepts |
| YOLO (v1→v11) |
Redmon 2016→2024 |
Real-time object detection, Ultralytics ecosystem |
| DETR |
Carion 2020 |
Transformer detection, anchor-free, no NMS |
| RT-DETR |
Lv (Baidu) 2023 |
Real-time DETR, YOLO-speed + Transformer quality |
| SAM |
Kirillov 2023 |
Segment Anything, prompt-based, Foundation Model |
| SAM 2 |
Meta 2024 |
Video segmentation, Memory Attention, temporal consistency |
| Grounding DINO |
Liu 2023 |
Text-prompted detection → SAM pipeline (Grounded SAM) |
| Open-YOLO 3D |
Benseddik 2025 |
2D open-vocab detection → 3D instance seg, 16× faster |
B. Deep Backend — Optimization
Differentiable Bundle Adjustment
| System |
Author/Year |
Key Concepts |
| BA-Net |
Tang 2019 |
FPN + differentiable LM layer, end-to-end SfM (ICLR) |
| DROID-SLAM |
Teed 2021 |
Dense optical flow + differentiable dense BA, all-pixels reprojection |
| DPVO |
Teed 2023 |
Patch-based DROID-SLAM, 30+ FPS real-time |
| Theseus |
Pineda (Meta) 2022 |
Differentiable nonlinear optimization library (PyTorch) |
| Lietorch |
Teed 2021 |
Lie group operations for PyTorch (SE(3)/SO(3)) |
Certifiably Optimal Algorithms
| System |
Author/Year |
Key Concepts |
| SE-Sync |
Rosen 2019 |
Certifiable pose graph optimization via SDP + Riemannian opt |
| TEASER++ |
Yang 2020 |
Point cloud registration, 90%+ outlier robust, TLS + Max Clique (T-RO/RSS 2020) |
| GNC |
Yang 2020 |
Graduated Non-Convexity, continuation from convex → robust cost |
| QUASAR |
Yang 2022 |
Certifiable rotation averaging, SDP + robust cost |
Gaussian Belief Propagation & Graph Processors
| System |
Author/Year |
Key Concepts |
| FutureMapping 1 |
Davison 2018 |
Computational structure of Spatial AI, GBP for SLAM |
| FutureMapping 2 |
Ortiz 2019 |
GBP as core Spatial AI primitive, visual intro to GBP |
| BA on Graph Processor |
Ortiz 2020 |
Bundle Adjustment on Graphcore IPU, tile-based parallelism |
| DANCeRS |
2023 |
GBP-based distributed consensus in robot swarms |
C. End-to-End Deep VO / SLAM Systems
Self-supervised & Learned VO
| System |
Author/Year |
Key Concepts |
| DeepVO |
Wang 2017 |
Supervised learning |
| SfM-Learner |
Zhou 2017 |
Unsupervised, deep depth + deep pose |
| DeMoN |
Ummenhofer 2017 |
Depth + Motion from two frames, encoder-decoder |
| UndeepVO |
Li 2018 |
Stereo self-supervised, absolute scale recovery |
| DeepTAM |
Zhou 2018 |
Deep tracking and mapping, cost volume based |
| DeepV2D |
Teed 2018 |
Iterative depth from video, differentiable geometry layers |
| Depth from Video in the Wild |
Gordon 2019 |
Unconstrained video depth, learned camera intrinsics |
| Neural Ray Surfaces |
Vasiljevic 2020 |
Learned ray surface model, non-pinhole cameras |
| GradSLAM |
Murthy 2020 |
Differentiable SLAM framework (PyTorch, supports multiple SLAM backends) |
| DeepSLAM |
Wang 2020 |
TrackingNet, MappingNet, LoopNet |
| MonoRec |
Wimbauer 2021 |
Self-supervised monocular 3D reconstruction, moving objects |
| TANDEM |
Koestler 2021 |
Real-time tracking + dense mapping via MVS depth, DSO-based |
| DROID-SLAM |
Teed 2021 |
Dense BA + correlation, SOTA on TartanAir/EuRoC (→ see Differentiable BA) |
| DPVO |
Teed 2023 |
Patch-based lightweight DROID (→ see Differentiable BA) |
Latent Representation SLAM
| System |
Author/Year |
Key Concepts |
| CodeSLAM |
Bloesch 2018 |
Depth as 128-dim latent code, photometric BA on codes + poses |
| SceneCode |
Zhi 2019 |
Depth + semantic in single latent code, cross-modal constraints |
| DeepFactors |
Czarnowski 2020 |
Probabilistic depth codes + factor graph, GPU 30+ FPS |
| NodeSLAM |
Sucar 2020 |
Object-level DeepSDF codes, occupancy VAE per object |
| CodeMapping |
Shao 2021 |
Sparse SLAM + learned dense mapping, hybrid approach |
Neural Rendering (reference)
NeRF/3DGS-based SLAM systems → see Level 3: Neural Representation SLAM
| System |
Author/Year |
Key Concepts |
| NeRF |
Mildenhall 2020 |
Neural Radiance Fields, novel view synthesis (foundational) |
| DIFIX3D+ |
2026 |
Single-step diffusion for 3D reconstruction artifact removal (post-processing) |
D. Scene Understanding
Benchmarks & Foundations
| System |
Author/Year |
Key Concepts |
| EFM3D |
Straub (Meta) 2024 |
Egocentric Foundation Model 3D benchmark, depth/surface/semantic from ego-video |
3D Scene Graph
| System |
Author/Year |
Key Concepts |
| Hydra |
Hughes (MIT SPARK) 2022 |
Real-time hierarchical Scene Graph (mesh→objects→places→rooms→buildings) |
| Hydra-Multi |
Hughes 2023 |
Distributed multi-robot 3D Scene Graph |
| Clio |
Maggio (MIT SPARK) 2024 |
Open-set task-driven Scene Graph, CLIP embeddings per node |
| Khronos |
Schmid (MIT SPARK) 2024 |
Spatio-temporal Scene Graph, dynamic object history tracking |
| ConceptGraphs |
Gu 2023 |
Open-vocabulary 3D Scene Graph, SAM + CLIP + LLM relations (→ also in L3 Semantic) |
Level 6: VIO / VINS
Key Concepts
- Tightly-coupled vs Loosely-coupled — Joint vs separate optimization of visual and inertial measurements
- Filter-based vs Optimization-based — EKF approaches vs nonlinear optimization (BA)
- IMU preintegration — On-manifold IMU integration between keyframes (Forster 2015)
- IMU noise model — Bias, random walk, Allan variance
- Observability — Yaw and global position are unobservable in VIO
Foundations
| Resource |
Author/Year |
Key Concepts |
| Introduction to Inertial Navigation |
Woodman 2007 |
IMU fundamentals, coordinate frames, error sources — essential prerequisite |
| IMU Preintegration on Manifold |
Forster 2015 |
On-manifold preintegration, bias correction without re-integration |
| Quaternion kinematics for error-state KF |
Sola 2017 |
Quaternion math, error-state formulation |
Filter-based
| System |
Author/Year |
Key Concepts |
| MSCKF |
Mourikis 2007 |
Multi-State Constraint KF, efficient VIO without landmarks in state |
| ROVIO |
Bloesch 2015 |
Robocentric VIO, direct photometric tracking + EKF |
| OpenVINS |
Geneva 2020 |
Open-source MSCKF, modular, extensible |
Optimization-based
| System |
Author/Year |
Key Concepts |
| OKVIS |
Leutenegger 2015 |
Keyframe-based, tightly-coupled, sliding window optimization |
| VINS-Mono |
Qin 2018 |
Tightly-coupled, relocalization, loop closure, pose graph optimization |
| VINS-Fusion |
Qin 2019 |
Stereo + GPS fusion extension |
| MAPLAB |
Schneider 2018 |
Multi-session visual-inertial mapping framework |
| Kimera-VIO |
Rosinol 2020 |
Fast VIO frontend for Kimera pipeline, structureless vision factors |
| Basalt |
Usenko 2020 |
Non-linear recovery, visual-inertial odometry + mapping |
| ORB-SLAM3 |
Campos 2020 |
VIO mode, multi-map, IMU initialization |
| DM-VIO |
von Stumberg 2022 |
Deep monocular VIO, delayed marginalization |
| OKVIS2 |
Leutenegger 2022 |
Multi-session, improved marginalization |
| AirVO |
Xu 2023 |
Point-line VIO, illumination-robust |
| OKVIS2-X |
Boche & Leutenegger 2025 |
Multi-sensor SLAM (Visual+Inertial+Depth+LiDAR+GNSS), dense volumetric occupancy maps, submapping for large-scale (9km+), EuRoC/Hilti22 SOTA |
Level 7: Stereo SLAM
Key Concepts
- Stereo rectification — Epipolar alignment for efficient disparity search
- Disparity vs Depth — d = f·B/Z, baseline determines depth range/accuracy
- Scale observability — Stereo provides metric scale (unlike monocular)
Systems
| System |
Author/Year |
Key Concepts |
| S-PTAM |
Pire 2017 |
Stereo PTAM, ROS-compatible, real-time |
| ORB-SLAM2 (stereo) |
Mur-Artal 2016 |
Stereo + RGB-D modes, loop closure, relocalization |
| StereoMSCKF |
Sun 2018 |
MSCKF with stereo, efficient for resource-constrained platforms |
| RTAB-Map |
Labbé 2019 |
Multi-sensor (stereo/RGB-D/LiDAR), memory management, large-scale |
| ORB-SLAM3 (stereo) |
Campos 2020 |
Multi-map, Atlas, stereo + IMU |
| Stella-VSLAM |
Community 2022 |
Open-source fork of OpenVSLAM, stereo support |
| LDSO |
Gao 2018 |
Direct stereo odometry with loop closure (DSO extension) |
Level 8: Collaborative / Multi-Robot SLAM
Key Concepts
- Centralized vs Decentralized — Single server vs peer-to-peer map merging
- Inter-robot loop closure — Place recognition across robots with different viewpoints
- Communication constraints — Bandwidth-limited map sharing, sparse descriptors
- Map merging — Aligning submaps from different robots into a global map
Systems
| System |
Author/Year |
Key Concepts |
| C2TAM |
Riazuelo 2014 |
Cloud-based collaborative monocular SLAM |
| CCM-SLAM |
Schmuck & Chli 2019 |
Centralized collaborative monocular SLAM, robust to comm failures |
| DOOR-SLAM |
Lajoie 2020 |
Distributed, outlier-resilient SLAM with pairwise consistency |
| Kimera-Multi |
Tian 2022 |
Distributed multi-robot metric-semantic SLAM, mesh reconstruction |
| Swarm-SLAM |
Lajoie 2024 |
Decentralized, sparse, scalable C-SLAM, supports LiDAR/stereo/RGB-D |
| CoPeD-Advancing |
Stathoulopoulos 2024 |
Multi-robot collaborative perception for autonomous exploration |
| MAPLAB 2.0 |
Cramariuc 2023 |
Multi-session, multi-robot visual-inertial mapping |
Level 9: LiDAR & Visual-LiDAR Fusion SLAM
Key Concepts
- LiDAR-Visual-Inertial (LVI) — Triple fusion for robust outdoor SLAM
- Tightly-coupled LiDAR-camera — Joint optimization of point cloud and visual features
- Direct LiDAR-camera alignment — Photometric/geometric alignment without feature extraction
- Degradation handling — Graceful fallback when one modality fails (e.g., LiDAR in rain, camera in darkness)
- Range image — 2D projection of LiDAR scans for efficient processing (SuMa, RangeNet++)
LiDAR / LiDAR-Inertial SLAM
| System |
Author/Year |
Key Concepts |
| LOAM |
Zhang 2014 |
LiDAR odometry and mapping (foundational), edge + planar features |
| SuMa |
Behley (Bonn) 2018 |
Surfel-based LiDAR SLAM, projective ICP on range images |
| SuMa++ |
Chen (Bonn) 2019 |
SuMa + RangeNet++ semantics, semantic ICP weighting, dynamic object filtering |
| LIO-SAM |
Shan 2020 |
Tightly-coupled LiDAR-inertial, factor graph, GPS fusion |
| FAST-LIO2 |
Xu 2022 |
Direct LiDAR-inertial, ikd-Tree, extremely fast |
| PIN-SLAM |
Pan (Bonn) 2024 |
Neural point cloud LiDAR SLAM, point-to-SDF registration, elastic map deformation for loop closure |
Visual-LiDAR Fusion SLAM
| System |
Author/Year |
Key Concepts |
| LVI-SAM |
Shan 2021 |
LiDAR-Visual-Inertial via factor graph, LIO-SAM + VINS-Mono |
| R3LIVE |
Lin 2022 |
Real-time LiDAR-Visual-Inertial, dense RGB point cloud map |
| R3LIVE++ |
Lin 2023 |
Improved R3LIVE with mesh reconstruction |
| FAST-LIVO |
Zheng 2022 |
FAST-LIO + direct visual odometry, tightly-coupled LVI |
| FAST-LIVO2 |
Zheng 2024 |
Improved, sequential image processing, direct photometric fusion |
| OKVIS2-X |
Boche 2025 |
Visual+Inertial+Depth+LiDAR+GNSS configurable (also in Level 6) |
Resources
| Resource |
Key Concepts |
| LiDAR-Visual-Inertial Survey (Zheng 2024) |
Comprehensive survey of LVI SLAM systems |
Level 10: Event Camera SLAM
Key Concepts
- Event cameras (DVS) — Asynchronous per-pixel brightness change detection, μs temporal resolution
- Advantages — HDR (140dB+), no motion blur, low latency, low power
- Challenges — No absolute intensity, sparse asynchronous output, requires new algorithms
- Event representations — Event frames, time surfaces, voxel grids, spike tensors
Foundations
| Resource |
Author/Year |
Key Concepts |
| Event-based Vision Survey |
Gallego 2020 |
Comprehensive survey of event camera algorithms |
| Awesome-Event-based-SLAM |
KwanWaiPang |
Curated GitHub list of event-based SLAM papers |
Systems
| System |
Author/Year |
Key Concepts |
| EVO |
Rebecq 2017 |
Event-based Visual Odometry, 3D reconstruction from events |
| ESVO |
Zhou 2021 |
Event-based Stereo Visual Odometry |
| Ultimate-SLAM |
Vidal 2018 |
Events + frames + IMU fusion |
| EKLT |
Gehrig 2020 |
Event-based KLT feature tracking |
| ESVIO |
Chen 2023 |
Event-based Stereo VIO |
| EDS |
Hidalgo-Carrió 2022 |
Event-aided direct sparse odometry |
| DEVO |
Pellerito 2024 |
Deep event-based visual odometry (DROID-SLAM style) |
| VIO-GO |
2025 |
Event-based VIO with optimized parameters for HDR scenarios |
Level 11: World Models & Spatial AI
World Models
| System |
Author/Year |
Key Concepts |
| GAIA-1 |
Wayve 2023 |
Driving World Model, action-conditioned future scene generation |
| Sora / DiT |
OpenAI 2024 |
Diffusion Transformer, spacetime patches, emergent 3D understanding |
| NVIDIA Cosmos |
NVIDIA 2026 |
World Foundation Model platform for Physical AI, synthetic data for AV/robots |
| World Labs / Marble |
Fei-Fei Li 2026 |
3D world generation from images/video/text ($1B funding) |
| WorldVLA |
Alibaba 2025 |
Autoregressive action world model, learns physics for action generation |
| SceneDINO |
2025 |
Feed-forward unsupervised semantic scene completion |
Generative 3D
| System |
Author/Year |
Key Concepts |
| DreamFusion |
Poole 2023 |
Text-to-3D via Score Distillation Sampling (SDS) + NeRF |
Vision-Language Models (VLM)
| System |
Author/Year |
Key Concepts |
| CLIP |
Radford (OpenAI) 2021 |
Contrastive image-text pretraining, 400M pairs, zero-shot |
| SigLIP |
Zhai (Google) 2023 |
Sigmoid loss CLIP, more efficient, better at small model sizes |
| BLIP-2 |
Li (Salesforce) 2023 |
Q-Former bridges frozen LLM + image encoder |
| LLaVA |
Liu 2023 |
LLaMA + vision, conversational VLM |
Vision-Language-Action Models (VLA)
| System |
Author/Year |
Key Concepts |
| RT-2 |
Brohan (DeepMind) 2023 |
Robot actions as text tokens, emergent generalization |
| OpenVLA |
Kim 2024 |
Open-source VLA, SigLIP + Llama 7B + Action Head |
| Navila |
2024 |
Navigation-specialized VLA, SLAM integration for localization |
Resources
| Resource |
Key Concepts |
| Awesome-Transformer-based-SLAM |
Curated GitHub list of Transformer-based SLAM methods |
Study Resources
YouTube Lecture Series
Books
| Book |
Author |
Key Topics |
| Introduction to Visual SLAM |
Xiang Gao et al. |
VO, optimization, Lie algebra, backend, loop closure — best entry-level SLAM book |
| Photogrammetric Computer Vision |
Wolfgang Förstner & Bernhard Wrobel |
Camera geometry, estimation, 3D reconstruction — mathematically rigorous |
| Multiple View Geometry in Computer Vision |
Richard Hartley & Andrew Zisserman |
Epipolar geometry, trifocal tensor, reconstruction — THE bible |
| Computer Vision: Algorithms and Applications |
Richard Szeliski |
Feature detection, stereo, motion, 3D — comprehensive reference (2nd ed. free PDF) |
Code & Practice
| Resource |
Link |
| changh95/slam_lecture_codes |
GitHub — Hands-on SLAM lecture code collection |
Wrap Up
If you think any of the roadmaps can be improved, please do open a PR with any updates and submit any issues. Also, I will continue to improve this, so you might want to watch/star this repository to revisit.
Also, check out my GitHub and blog :smiley_cat:
Contribution
- Open pull request with improvements
- Discuss ideas in issues
- Spread the word
- Reach out to me directly at hyunggi.chang95[at]gmail.com.
Discussion
To discuss any topics or ask questions, please use the issue tab.
License
The class is licensed under the MIT License:
Copyright © 2026 Hyunggi Chang.
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.