Visual-SLAM Developer Roadmap - 2026

title

Visual-SLAM is a special case of 'Simultaneous Localization and Mapping', which you use a camera device to gather exteroceptive sensory data.

Below there is a set of topics you need to understand in Visual-SLAM, from an absolute beginner difficulty to getting ready to become a Visual-SLAM engineer / researcher.

Visual-SLAM is often portrayed as a rather difficult topic - many think good C++ programming skills and deep understanding of mathematics is necessary.

On the other hand, there are not many courses provided for beginners, especially in non-English languages.

I made these charts to share my thoughts and experience on studying Visual-SLAM, and hopefully the beginner learners can get a grasp of where to start from.

Purpose of these Roadmaps

The purpose of these roadmaps is to give you an idea about the general overview of Visual-SLAM, and to guide you if you are confused about where to start from.

Note to Beginners

Acknowledge that SLAM has a relatively high entry barrier - it's not because of the requirement of undertanding difficult mathematics, but the requirement of equipping yourself with various types of skills. Don't feel overwhelmed - you don't need to learn everything if you are just getting started. Instead, enjoy the journey itself and progress topic by topic. The result will be very rewarding.

Level	Topic	Focus
1	Beginner	Programming, Math, Projective Geometry, Camera, Image
2	Getting Familiar	Feature matching, MVG, Optimization, Factor Graph, Mapping, Sensors
3	Monocular SLAM	Feature/Direct/Hybrid/Learning-based, Foundation Model, Neural Representation, Semantic
4	RGB-D SLAM	KinectFusion, ElasticFusion, BundleFusion, DSP-SLAM
5	Deep Learning + SLAM	A. Frontend · B. Backend · C. Systems · D. Scene Understanding
6	VIO / VINS	Filter-based (MSCKF) vs Optimization-based (VINS-Mono, OKVIS2-X)
7	Stereo SLAM	S-PTAM, ORB-SLAM2/3 stereo, LDSO
8	Collaborative SLAM	CCM-SLAM, Kimera-Multi, Swarm-SLAM
9	LiDAR & Visual-LiDAR	LOAM, FAST-LIO2, LVI-SAM, R3LIVE, FAST-LIVO2
10	Event Camera SLAM	EVO, Ultimate-SLAM, DEVO
11	World Models & Spatial AI	GAIA-1, Cosmos, VLM/VLA, Generative 3D

Level 1: Beginner

Programming

C++: Pointer, OOP
Python
Bash/Linux: Basic terminal usage

Mathematics

Basic Probability & Statistics: Gaussian distribution, Bayes' theorem
Basic Linear Algebra: Vectors & Matrices, Determinant, Dot & Cross product, Rank, Inverse matrix, Transpose matrix, SVD, Eigenvalues/Eigenvectors
Logarithm & Exponential
Basic Calculus: Differentiation, Taylor expansion

Projective Geometry

Pinhole camera model → Image projection
Camera calibration: Intrinsic/Extrinsic parameters, Lens distortion
Rigid body motion: Euler/Quaternion/Rotation Matrix, Projective space & Vanishing point, Homogeneous transformation
Epipolar geometry → Essential & Fundamental matrix
Triangulation

Camera Device

Lens, Sensor, Resolution/ISO/Aperture

Image Data

Colour image, Resolution, Grayscale image
Thresholding, Gaussian blur
Corner detector: Harris corner
Edge detector: Sobel & Canny Edge
Stereovision, RGB-D, Disparity, Depth

Level 2: Getting Familiar with SLAM

Programming

C++: OOP, Modern C++, Data structures & Algorithms, Compilers, CMake/Makefile/Ninja, Design patterns, OpenCV C++
C
Git/GitHub
OpenCV (opencv-python)
Python: Deep learning, Graph plots, System scripts
Bash/Linux: ssh, CLI text editor/Vim/tmux
Concurrency: SIMD-SSE/AVX/Neon, OpenMP, CUDA
Mobile: Android (Java/Kotlin), iOS (Objective-C/Swift)
Maths library: Eigen, Ceres-solver/GTSAM/g2o
C++/Python interop: PyBind11, nanobind
Docker
C#: COLMAP, Unity AR, Microsoft Hololens
CI/CD: GitHub Actions, Apache Airflow
ROS/ROS2
Simulation: Gazebo, Isaac Sim

Image Processing

Keypoints → Detector/Descriptor
- SIFT, FAST, ORB, AKAZE
- Deep features: R2D2, Superpoint
Image pyramid, oFAST, rBRIEF

Local Feature Matching

Brute-Force, FLANN, Kd-Tree
LSH, Multi-probe LSH, HBST
Superglue

Global Feature Matching

Bag of Visual Words, NetVLAD
Deep image retrieval, Hierarchical localization

Feature Tracking

Optical flow, KLT Tracker

Multiple View Geometry

2D-2D correspondence: Essential/Fundamental, Homography
2D-3D correspondence: P3P, PnP, SVD
3D-3D correspondence: ICP

Outlier Rejection

RANSAC, PROSAC, M-Estimator, MAXCON, Convex relaxation

Least Squares Optimisation

Reprojection error, Bundle adjustment
Non-linear optimisation, Lie algebra
Lie groups: SO(3), SE(3)
Gauss-Newton, Levenberg-Marquardt
Pose graph optimization
Schur complement / Sparsity

Motion Model

Proprioceptive sensor: IMU, Wheel
Odometry (pose)

Observation Model

Exteroceptive sensor: Camera, LiDAR
Landmark (Map)
Joint optimisation, MLE & MAP

Factor Graph Optimisation

Mapping

Point cloud, Occupancy grid mapping, TSDF, Surfel, Voxel map

Sensors

Camera device: Wide/telecentric lens, Lens MTF, CCD/CMOS, Rolling/Global shutter, Exposure/ISO, Stereovision, RGB-D, Structured light, Active IR/ToF
LiDAR → Visual-LiDAR fusion
IMU → VIO
RADAR → Sensor fusion, Extended Kalman filter
Sonar
Multi-sensor calibration: Camera-IMU, Camera-LiDAR

Evaluation

Metrics: ATE (Absolute Trajectory Error), RPE (Relative Pose Error)
Datasets: KITTI, TUM RGB-D, EuRoC

Next Levels

Monocular SLAM · VIO/VINS · Stereo SLAM · Visual-LiDAR Fusion · RGB-D SLAM · Collaborative SLAM · Deep SLAM/Localization

Level 3: Monocular Visual-SLAM

Key Concepts

VO vs SLAM — VO is local (no loop closure), SLAM includes global map + loop closure
Scale ambiguity — Fundamental limitation of monocular SLAM; absolute scale is unrecoverable from images alone
Covisibility graph — Shared map point visibility between keyframes; core data structure in ORB-SLAM
Visual Place Recognition (VPR) — Recognising previously visited places for loop closure
Self-supervised depth — Learning monocular depth without ground truth (Monodepth2, Godard 2019)

Feature-based SLAM

System	Author/Year	Key Concepts
Visual Odometry	Nister 2004	Fundamental matrix, Triangulation, VO (local-only, no loop closure)
MonoSLAM	Davison 2007	First real-time monocular SLAM, EKF-based, single camera, sparse 3D map, probabilistic feature initialization
PTAM	Klein & Murray 2007	FAST feature, Tracking, Frontend/Backend separation, Parallel threads, Keyframe, Mapping, Bundle adjustment, Manual initialisation
Visual-SLAM why filter?	Strasdat 2012	Bundle adjustment, Scale-aware BA, Motion-only BA
ORB-SLAM	Mur-Artal 2015	ORB keypoint, Automatic initialisation (Homography vs Fundamental selection), Tracking thread, Sliding-window BA, Local mapping, Large-scale, Loop closure, Bag of visual words, Global optimisation, Covisibility graph, Map point management (culling, merging)
Pop-up SLAM	Yang 2016	Line/Plane features
PL-SLAM	Pumarola 2017	Point/Line features
ORB-SLAM2	Mur-Artal 2017	→ Stereo SLAM, → RGB-D SLAM
CubeSLAM	Yang 2019	Monocular 3D cuboid detection + SLAM, 9-DoF object representation
OpenVSLAM	Sumikura 2019	—
Stella-VSLAM	(fork) 2021	OpenVSLAM successor, license reboot
UcoSLAM	Munoz-Salinas 2019	Fiducial markers
DeepFusion	LaidLow 2019	—
ORB-SLAM3	Campos 2020	Monocular + Stereo + VIO, Multi-map, IMU integration
DXSLAM	Li 2020	Deep features for SLAM
PyCuVSLAM	NVIDIA 2026	Python + CUDA GPU-accelerated VSLAM toolkit (cuVSLAM wrapper)

Direct SLAM

System	Author/Year	Key Concepts
DTAM	Newcombe 2011	Dense mapping, Keyframe mapping, GPGPU
LSD-SLAM	Engel 2014	Photometric error minimisation, High gradient pixels/edges, Large scale, Loop closure, Pose graph optimisation
DSO	Engel 2016	Photometric bundle adjustment, Sliding window BA, No loop closure/global optimisation
LDSO	Gao 2018	DSO + Loop closure (BoW-based), addresses DSO's main weakness
CNN-SLAM	Tateno 2017	Depth from LSD-SLAM + deep depth, Semantic label
DVSO	Yang 2018	Deep single image depth estimation, StackNet
Basalt	Usenko 2020	Non-linear recovery (→ primarily VIO, see Level 6)
D3VO	Yang 2020	Deep single image depth estimation, Deep pose, Deep aleatoric uncertainty

Hybrid (Feature + Direct)

System	Author/Year	Key Concepts
SVO	Forster 2014	FAST feature detection, Direct-based feature tracking, Bundle adjustment
SVO2	Forster 2017	Multi-camera/Fisheye, Probabilistic depth estimation, Direct method convergence, Sparse method
Stereo DSO	Wang 2017	→ Stereo SLAM
VI-DSO	Gao 2018	→ VIO/VINS

Learning-based SLAM

System	Author/Year	Key Concepts
DROID-SLAM	Teed 2021	Differentiable BA, dense optical flow, end-to-end learned
TartanVO	Wang 2021	Generalizable visual odometry
DPV-SLAM / DPVO	Teed 2023	DROID-SLAM lightweight, patch-based visual odometry
MAC-VO	Qu 2024	Learning-based VO, metric-aware
VoT	Yugay 2025	Visual Odometry with Transformers

Foundation Model SLAM

System	Author/Year	Key Concepts
DUSt3R	Wang 2024	Pointmap regression from image pairs, no calibration needed
MASt3R	Leroy 2024	DUSt3R + local feature matching
MASt3R-SLAM	Leroy 2024	Real-time dense SLAM from MASt3R
VGGT	Wang (Meta) 2025	Feed-forward inference of poses, depths, pointmaps, tracks from N views (CVPR 2025 Best Paper)
VGGT-SLAM	2025	VGGT as frontend for real-time SLAM
VGGT-SLAM 2.0	2026	Improved VGGT-SLAM
VGGT-Geo	2026	Probabilistic geometric fusion of VGGT priors for dense indoor SLAM
IGGT	Li 2026	VGGT + VLM — language-grounded 3D geometry
AMB3R	Wang 2025	MASt3R frontend + Transformer backend for SfM/SLAM
MASt3R-Fusion	WHU 2025	MASt3R-SLAM + IMU + GNSS fusion

SfM Tools

System	Author/Year	Key Concepts
InstantSfM	2025	GPU-accelerated SfM pipeline, 40× faster than COLMAP

Neural Representation SLAM

NeRF-based

System	Author/Year	Key Concepts
iMAP	Sucar 2021	First NeRF-SLAM, single MLP, real-time tracking/mapping
BARF	Lin 2021	Bundle-Adjusting NeRF, coarse-to-fine positional encoding, joint pose+NeRF opt (not full SLAM — pose+NeRF co-optimization)
NICE-SLAM	Zhu & Peng 2022	Hierarchical feature grid (coarse/mid/fine), scalable
Co-SLAM	Wang 2023	Hash grid (Instant-NGP) + coordinate encoding, 5-10× faster than NICE-SLAM
ESLAM	Johari 2023	Tri-plane representation, O(N²) vs O(N³) memory
Point-SLAM	Sandström 2023	Neural point cloud based
NeRF-SLAM	Rosinol 2023	NeRF + classical SLAM pipeline
NICER-SLAM	Zhu 2024	RGB-only NeRF-SLAM (no depth sensor), monocular depth integration
vMAP	Kong 2023	Object-level NeRF-SLAM, per-object neural fields
GO-SLAM	Zhang 2023	Global optimization + NeRF-SLAM, loop closure + global BA

3DGS-based

System	Author/Year	Key Concepts
SplaTAM	Keetha 2024	First 3DGS-SLAM, RGB-D, silhouette-guided densification
MonoGS	Matsuki 2024	Monocular 3DGS-SLAM, depth network + triangulation fusion
GS-ICP SLAM	Yu 2024	Gaussian-to-Gaussian ICP (Mahalanobis distance), geometric tracking
Photo-SLAM	Huang 2024	Explicit geometry + implicit appearance (MLP color), anti-aliasing
RTG-SLAM	2024	Real-time focus, adaptive Gaussian budget, Jetson Orin 25 FPS
EGG-Fusion	ZJU 2025	Gaussian surfel fusion, information-filter-based, real-time 24 FPS
Online-Mono-3DGS (MODP)	2025	ORB-SLAM3 tracking + Hierarchical Gaussian Management
ActiveSplat	Li 2025	Active mapping with 3DGS + Voronoi-based path planning
Open-S3SLAM	2026	Open-set semantic 3DGS SLAM for smartphones (ICRA 2026)
LEGS	2025	Language Embedded Gaussian Splats, real-time language-queryable 3D

Semantic / Language-Grounded SLAM

System	Author/Year	Key Concepts
ConceptFusion	Jatavallabhula (MIT) 2023	CLIP features fused into 3D map, open-vocabulary language queries
LERF	Kerr 2023	Language Embedded Radiance Fields, DINO multi-scale, NeRF + CLIP
OpenScene	Peng (ETH) 2023	Language features back-projected to 3D point clouds
ConceptGraphs	Gu 2023	Open-vocabulary 3D Scene Graph, SAM + CLIP + LLM spatial relations
SpatialLLM	Mao 2025	Point cloud → LLM, structured indoor modeling as Python scripts

Also see: LEGS, Open-S3SLAM (3DGS-based section above); Open-YOLO 3D (Level 5 Object Detection)

Level 4: RGB-D Visual-SLAM

RGB-D Camera Devices

Intel RealSense D series
Microsoft Kinect v1/v2
Azure Kinect DK
Occipital Structure Core
Orbbec Astra

GPGPU Programming

CUDA, OpenGL GLSL

Systems

System	Author/Year	Key Concepts
ICP	Besl & McKay 1992	—
DTAM	Newcombe 2011	—
KinectFusion	Newcombe 2011	GPGPU, Tracking (project depth → 3D, surface normal, coarse-to-fine ICP), Mapping (volumetric integration, TSDF), Robust to small scene changes, Cannot model deformation, Map growth cubic, Room-size only
Double Window Optimisation	Strasdat 2011	—
Kintinuous	Whelan 2012	Volume shift, Geometric, Photometric, dBoW+SURF, Optimisation, Loop closure
RGBD-SLAM-V2	Endres 2013	Tracking (colour image, visual features, depth image, point cloud, transformation), Mapping (OctoMap 2013)
SLAM++	Salas-Moreno 2013	Object-oriented SLAM
DVO	Kerl 2013	Keyframe, Depth, Direct method, Optimisation, Loop closure
RTAB-Map	Labbé 2014	Loop closure, Map merge, Multi-session memory management
MRS-Map	Stuckler 2014	—
ElasticFusion	Whelan 2015	Active: frame-to-model tracking (photometric + geometric), joint optimisation, fused surfel-based model reconstruction · Inactive: local loop closure (model-to-model local surface, submodel separation), global loop closure (randomised fern encoding, non-rigid space deformation)
DynamicFusion	Newcombe 2015	6D motion field, Deformable scene
ORB-SLAM2	Mur-Artal 2016	Bundle adjustment, Sparse reconstruction
BundleFusion	Dai 2016	Local-to-global optimisation, Sparse RGB feature, Coarse global pose estimation, Fine pose refinement (geometric + photometric)
SemanticFusion	McCormac 2016	Deep Learning CNN, Deep Semantic SLAM
InfiniTAM v3	Prisacariu 2017	Tracking (scene raycast, depth image, RGB image), Relocalisation (random ferns), Mapping (TSDF reconstruction, voxel hashing, surfel reconstruction)
Fusion++	McCormac & Clark 2018	Deep Learning CNN, Mask-RCNN instance segmentation, Object-level SLAM, No prior, Object-level TSDF reconstruction
PointFusion / DenseFusion	Xu 2018 / Wang 2019	RGB-D object pose estimation, Tracking, Relocalisation, Loop closure detection
BAD SLAM	Schops 2019	Direct bundle adjustment, Deep Semantic SLAM
RTAB-Map v2	Labbé 2019	RGB-D/LiDAR, Light-source detection (2016)
MoreFusion	Wada & Sucar 2020	DL instance segmentation, Object-level volumetric fusion, Volumetric pose prediction, 3D scene reconstruction, Collision-based refinement, Semantic SLAM, Object pose estimation, CAD object fitting
NodeSLAM	Wada & Sucar 2020	Occupancy VAE, Object-level SLAM (→ also in Level 5 Latent Representation)
Kimera / 3D Dynamic Scene Graph	Rosinol 2020	Kimera-VIO, Kimera-Mesher, Kimera-PGMO, Kimera-Semantics, Kimera-DSG
DSP-SLAM	Wang (UCL) 2021	DeepSDF shape prior + ORB-SLAM2, object-level dense reconstruction (mono/stereo/LiDAR)

Level 5: Applying Deep Learning

Level 5 is organized into four pillars: A. Frontend — learned perception components replacing hand-crafted modules B. Backend — learned/certifiable optimization replacing classical solvers C. Systems — end-to-end deep VO/SLAM pipelines D. Scene Understanding — semantic, language, and relational reasoning on SLAM maps

A. Deep Frontend — Perception

Feature Detection & Matching

System	Author/Year	Key Concepts
NetVLAD	Arandjelovic 2016	VLAD, place recognition
SuperPoint	DeTone 2017	Homographic Adaptation, Self-supervised, VGG encoder + detector/descriptor heads
HardNet	Mishchuk 2017	Learned local descriptor
R2D2	Revaud 2019	Repeatable + Reliable detector/descriptor, explicit repeatability/reliability maps
KeyNet	Barroso-Laguna 2019	Learned keypoint detector
HF-Net	Sarlin 2019	Global feature, Local feature, Visual localization
SuperGlue	Sarlin 2020	Self/Cross-attention GNN, Sinkhorn optimal assignment, dustbin for outliers
DISK	Tyszkiewicz 2020	Policy gradient (RL) training, match success/failure as reward
Patch NetVLAD	Hausler 2021	Multi-scale patch-level VLAD
LoFTR	Sun 2021	Detector-free, Transformer coarse-to-fine dense matching
LightGlue	Lindenberger 2023	Adaptive depth/width, 5-10× faster than SuperGlue
XFeat	Potje 2024	0.3M params, 1400 FPS (RTX 4090), 64-dim descriptor, embedded-friendly
RoMA	Edstedt 2024	DINOv2 foundation feature + coarse-to-fine dense matching
DeDoDe	Edstedt 2024	Joint detect-and-describe in one stage
RoMA V2	Edstedt 2026	Improved RoMA

Depth Estimation

System	Author/Year	Key Concepts
MonoDepth	Godard 2016	Left-Right photometric consistency, self-supervised
MiDaS	Ranftl 2020	Multi-dataset mixing, scale-and-shift invariant loss, relative depth
DPT	Ranftl 2021	Dense Prediction Transformer (ViT backbone), global context
ZoeDepth	Bhat 2023	Zero-shot metric depth, Metric Bins Module
Metric3D	Yin 2023	Camera intrinsic-conditioned metric depth, Canonical Camera Space
Depth Anything	Yang 2024	62M images, foundation model for monocular depth
Depth Anything V2	Yang 2024	Improved with synthetic data, better edge preservation
Marigold	Ke 2024	Stable Diffusion for depth, fine detail, uncertainty via sampling
Align3r	Melou 2025	Video temporal consistency, DUSt3R-based, CVPR 2025 Highlight
Masked Depth Modeling (LingBot-Depth)	2026	Fixes RGB-D failures on glass/mirrors/metal

Optical Flow & Scene Flow

System	Author/Year	Key Concepts
FlowNet	Dosovitskiy 2015	First end-to-end deep optical flow (SimpleNet / CorrNet)
FlowNet 2.0	Ilg 2017	Stacked networks, classical-level accuracy
PWC-Net	Sun 2018	Pyramid-Warping-Cost volume, coarse-to-fine, 8.4M params
FlowNet3D	Liu 2019	Point cloud scene flow, PointNet++ based
RAFT	Teed 2020	All-Pairs Correlation + iterative ConvGRU update, ECCV Best Paper
RAFT-3D	Teed 2021	Scene flow (3D motion) from RAFT
FlowFormer	Huang 2022	Transformer on cost volume tokens, global context
SEA-RAFT	2024	Efficient RAFT variant for real-time

Camera Pose Regression & Relocalization

System	Author/Year	Key Concepts
PoseNet	Kendall 2015	CNN-based 6-DoF pose regression (APR), GoogLeNet backbone
DSAC	Brachmann 2017	Differentiable RANSAC, Scene Coordinate Regression (SCR)
DSAC++	Brachmann 2018	Self-supervision, RGB-D support
CNN Pose Regression Limitations	Sattler 2019	Pose regression ≈ image retrieval performance
LM-Reloc	von Stumberg 2020	Deep direct relocalization
DSAC*	Brachmann 2021	Improved learning stability
ACE	Brachmann 2023	Accelerated Coordinate Encoding, 5-min training per scene
ACE Zero	Brachmann 2024	Zero-shot SCR, no pre-built 3D map needed
ACE-G	Brachmann 2024	Generalizable SCR via cross-attention, new scenes without fine-tuning
ACE-SLAM	Tang 2024	Neural implicit real-time SLAM, network weights = map
hloc	Sarlin 2019+	Hierarchical Localization: coarse (NetVLAD) → fine (SuperGlue) pipeline

Object Detection & Segmentation for SLAM

System	Author/Year	Key Concepts
YOLO (v1→v11)	Redmon 2016→2024	Real-time object detection, Ultralytics ecosystem
DETR	Carion 2020	Transformer detection, anchor-free, no NMS
RT-DETR	Lv (Baidu) 2023	Real-time DETR, YOLO-speed + Transformer quality
SAM	Kirillov 2023	Segment Anything, prompt-based, Foundation Model
SAM 2	Meta 2024	Video segmentation, Memory Attention, temporal consistency
Grounding DINO	Liu 2023	Text-prompted detection → SAM pipeline (Grounded SAM)
Open-YOLO 3D	Benseddik 2025	2D open-vocab detection → 3D instance seg, 16× faster

B. Deep Backend — Optimization

Differentiable Bundle Adjustment

System	Author/Year	Key Concepts
BA-Net	Tang 2019	FPN + differentiable LM layer, end-to-end SfM (ICLR)
DROID-SLAM	Teed 2021	Dense optical flow + differentiable dense BA, all-pixels reprojection
DPVO	Teed 2023	Patch-based DROID-SLAM, 30+ FPS real-time
Theseus	Pineda (Meta) 2022	Differentiable nonlinear optimization library (PyTorch)
Lietorch	Teed 2021	Lie group operations for PyTorch (SE(3)/SO(3))

Certifiably Optimal Algorithms

System	Author/Year	Key Concepts
SE-Sync	Rosen 2019	Certifiable pose graph optimization via SDP + Riemannian opt
TEASER++	Yang 2020	Point cloud registration, 90%+ outlier robust, TLS + Max Clique (T-RO/RSS 2020)
GNC	Yang 2020	Graduated Non-Convexity, continuation from convex → robust cost
QUASAR	Yang 2022	Certifiable rotation averaging, SDP + robust cost

Gaussian Belief Propagation & Graph Processors

System	Author/Year	Key Concepts
FutureMapping 1	Davison 2018	Computational structure of Spatial AI, GBP for SLAM
FutureMapping 2	Ortiz 2019	GBP as core Spatial AI primitive, visual intro to GBP
BA on Graph Processor	Ortiz 2020	Bundle Adjustment on Graphcore IPU, tile-based parallelism
DANCeRS	2023	GBP-based distributed consensus in robot swarms

C. End-to-End Deep VO / SLAM Systems

Self-supervised & Learned VO

System	Author/Year	Key Concepts
DeepVO	Wang 2017	Supervised learning
SfM-Learner	Zhou 2017	Unsupervised, deep depth + deep pose
DeMoN	Ummenhofer 2017	Depth + Motion from two frames, encoder-decoder
UndeepVO	Li 2018	Stereo self-supervised, absolute scale recovery
DeepTAM	Zhou 2018	Deep tracking and mapping, cost volume based
DeepV2D	Teed 2018	Iterative depth from video, differentiable geometry layers
Depth from Video in the Wild	Gordon 2019	Unconstrained video depth, learned camera intrinsics
Neural Ray Surfaces	Vasiljevic 2020	Learned ray surface model, non-pinhole cameras
GradSLAM	Murthy 2020	Differentiable SLAM framework (PyTorch, supports multiple SLAM backends)
DeepSLAM	Wang 2020	TrackingNet, MappingNet, LoopNet
MonoRec	Wimbauer 2021	Self-supervised monocular 3D reconstruction, moving objects
TANDEM	Koestler 2021	Real-time tracking + dense mapping via MVS depth, DSO-based
DROID-SLAM	Teed 2021	Dense BA + correlation, SOTA on TartanAir/EuRoC (→ see Differentiable BA)
DPVO	Teed 2023	Patch-based lightweight DROID (→ see Differentiable BA)

Latent Representation SLAM

System	Author/Year	Key Concepts
CodeSLAM	Bloesch 2018	Depth as 128-dim latent code, photometric BA on codes + poses
SceneCode	Zhi 2019	Depth + semantic in single latent code, cross-modal constraints
DeepFactors	Czarnowski 2020	Probabilistic depth codes + factor graph, GPU 30+ FPS
NodeSLAM	Sucar 2020	Object-level DeepSDF codes, occupancy VAE per object
CodeMapping	Shao 2021	Sparse SLAM + learned dense mapping, hybrid approach

Neural Rendering (reference)

NeRF/3DGS-based SLAM systems → see Level 3: Neural Representation SLAM

System	Author/Year	Key Concepts
NeRF	Mildenhall 2020	Neural Radiance Fields, novel view synthesis (foundational)
DIFIX3D+	2026	Single-step diffusion for 3D reconstruction artifact removal (post-processing)

D. Scene Understanding

Benchmarks & Foundations

System	Author/Year	Key Concepts
EFM3D	Straub (Meta) 2024	Egocentric Foundation Model 3D benchmark, depth/surface/semantic from ego-video

3D Scene Graph

System	Author/Year	Key Concepts
Hydra	Hughes (MIT SPARK) 2022	Real-time hierarchical Scene Graph (mesh→objects→places→rooms→buildings)
Hydra-Multi	Hughes 2023	Distributed multi-robot 3D Scene Graph
Clio	Maggio (MIT SPARK) 2024	Open-set task-driven Scene Graph, CLIP embeddings per node
Khronos	Schmid (MIT SPARK) 2024	Spatio-temporal Scene Graph, dynamic object history tracking
ConceptGraphs	Gu 2023	Open-vocabulary 3D Scene Graph, SAM + CLIP + LLM relations (→ also in L3 Semantic)

Level 6: VIO / VINS

Key Concepts

Tightly-coupled vs Loosely-coupled — Joint vs separate optimization of visual and inertial measurements
Filter-based vs Optimization-based — EKF approaches vs nonlinear optimization (BA)
IMU preintegration — On-manifold IMU integration between keyframes (Forster 2015)
IMU noise model — Bias, random walk, Allan variance
Observability — Yaw and global position are unobservable in VIO

Foundations

Resource	Author/Year	Key Concepts
Introduction to Inertial Navigation	Woodman 2007	IMU fundamentals, coordinate frames, error sources — essential prerequisite
IMU Preintegration on Manifold	Forster 2015	On-manifold preintegration, bias correction without re-integration
Quaternion kinematics for error-state KF	Sola 2017	Quaternion math, error-state formulation

Filter-based

System	Author/Year	Key Concepts
MSCKF	Mourikis 2007	Multi-State Constraint KF, efficient VIO without landmarks in state
ROVIO	Bloesch 2015	Robocentric VIO, direct photometric tracking + EKF
OpenVINS	Geneva 2020	Open-source MSCKF, modular, extensible

Optimization-based

System	Author/Year	Key Concepts
OKVIS	Leutenegger 2015	Keyframe-based, tightly-coupled, sliding window optimization
VINS-Mono	Qin 2018	Tightly-coupled, relocalization, loop closure, pose graph optimization
VINS-Fusion	Qin 2019	Stereo + GPS fusion extension
MAPLAB	Schneider 2018	Multi-session visual-inertial mapping framework
Kimera-VIO	Rosinol 2020	Fast VIO frontend for Kimera pipeline, structureless vision factors
Basalt	Usenko 2020	Non-linear recovery, visual-inertial odometry + mapping
ORB-SLAM3	Campos 2020	VIO mode, multi-map, IMU initialization
DM-VIO	von Stumberg 2022	Deep monocular VIO, delayed marginalization
OKVIS2	Leutenegger 2022	Multi-session, improved marginalization
AirVO	Xu 2023	Point-line VIO, illumination-robust
OKVIS2-X	Boche & Leutenegger 2025	Multi-sensor SLAM (Visual+Inertial+Depth+LiDAR+GNSS), dense volumetric occupancy maps, submapping for large-scale (9km+), EuRoC/Hilti22 SOTA

Level 7: Stereo SLAM

Key Concepts

Stereo rectification — Epipolar alignment for efficient disparity search
Disparity vs Depth — d = f·B/Z, baseline determines depth range/accuracy
Scale observability — Stereo provides metric scale (unlike monocular)

Systems

System	Author/Year	Key Concepts
S-PTAM	Pire 2017	Stereo PTAM, ROS-compatible, real-time
ORB-SLAM2 (stereo)	Mur-Artal 2016	Stereo + RGB-D modes, loop closure, relocalization
StereoMSCKF	Sun 2018	MSCKF with stereo, efficient for resource-constrained platforms
RTAB-Map	Labbé 2019	Multi-sensor (stereo/RGB-D/LiDAR), memory management, large-scale
ORB-SLAM3 (stereo)	Campos 2020	Multi-map, Atlas, stereo + IMU
Stella-VSLAM	Community 2022	Open-source fork of OpenVSLAM, stereo support
LDSO	Gao 2018	Direct stereo odometry with loop closure (DSO extension)

Level 8: Collaborative / Multi-Robot SLAM

Key Concepts

Centralized vs Decentralized — Single server vs peer-to-peer map merging
Inter-robot loop closure — Place recognition across robots with different viewpoints
Communication constraints — Bandwidth-limited map sharing, sparse descriptors
Map merging — Aligning submaps from different robots into a global map

Systems

System	Author/Year	Key Concepts
C2TAM	Riazuelo 2014	Cloud-based collaborative monocular SLAM
CCM-SLAM	Schmuck & Chli 2019	Centralized collaborative monocular SLAM, robust to comm failures
DOOR-SLAM	Lajoie 2020	Distributed, outlier-resilient SLAM with pairwise consistency
Kimera-Multi	Tian 2022	Distributed multi-robot metric-semantic SLAM, mesh reconstruction
Swarm-SLAM	Lajoie 2024	Decentralized, sparse, scalable C-SLAM, supports LiDAR/stereo/RGB-D
CoPeD-Advancing	Stathoulopoulos 2024	Multi-robot collaborative perception for autonomous exploration
MAPLAB 2.0	Cramariuc 2023	Multi-session, multi-robot visual-inertial mapping

Level 9: LiDAR & Visual-LiDAR Fusion SLAM

Key Concepts

LiDAR-Visual-Inertial (LVI) — Triple fusion for robust outdoor SLAM
Tightly-coupled LiDAR-camera — Joint optimization of point cloud and visual features
Direct LiDAR-camera alignment — Photometric/geometric alignment without feature extraction
Degradation handling — Graceful fallback when one modality fails (e.g., LiDAR in rain, camera in darkness)
Range image — 2D projection of LiDAR scans for efficient processing (SuMa, RangeNet++)

LiDAR / LiDAR-Inertial SLAM

System	Author/Year	Key Concepts
LOAM	Zhang 2014	LiDAR odometry and mapping (foundational), edge + planar features
SuMa	Behley (Bonn) 2018	Surfel-based LiDAR SLAM, projective ICP on range images
SuMa++	Chen (Bonn) 2019	SuMa + RangeNet++ semantics, semantic ICP weighting, dynamic object filtering
LIO-SAM	Shan 2020	Tightly-coupled LiDAR-inertial, factor graph, GPS fusion
FAST-LIO2	Xu 2022	Direct LiDAR-inertial, ikd-Tree, extremely fast
PIN-SLAM	Pan (Bonn) 2024	Neural point cloud LiDAR SLAM, point-to-SDF registration, elastic map deformation for loop closure

Visual-LiDAR Fusion SLAM

System	Author/Year	Key Concepts
LVI-SAM	Shan 2021	LiDAR-Visual-Inertial via factor graph, LIO-SAM + VINS-Mono
R3LIVE	Lin 2022	Real-time LiDAR-Visual-Inertial, dense RGB point cloud map
R3LIVE++	Lin 2023	Improved R3LIVE with mesh reconstruction
FAST-LIVO	Zheng 2022	FAST-LIO + direct visual odometry, tightly-coupled LVI
FAST-LIVO2	Zheng 2024	Improved, sequential image processing, direct photometric fusion
OKVIS2-X	Boche 2025	Visual+Inertial+Depth+LiDAR+GNSS configurable (also in Level 6)

Resources

Resource	Key Concepts
LiDAR-Visual-Inertial Survey (Zheng 2024)	Comprehensive survey of LVI SLAM systems

Level 10: Event Camera SLAM

Key Concepts

Event cameras (DVS) — Asynchronous per-pixel brightness change detection, μs temporal resolution
Advantages — HDR (140dB+), no motion blur, low latency, low power
Challenges — No absolute intensity, sparse asynchronous output, requires new algorithms
Event representations — Event frames, time surfaces, voxel grids, spike tensors

Foundations

Resource	Author/Year	Key Concepts
Event-based Vision Survey	Gallego 2020	Comprehensive survey of event camera algorithms
Awesome-Event-based-SLAM	KwanWaiPang	Curated GitHub list of event-based SLAM papers

Systems

System	Author/Year	Key Concepts
EVO	Rebecq 2017	Event-based Visual Odometry, 3D reconstruction from events
ESVO	Zhou 2021	Event-based Stereo Visual Odometry
Ultimate-SLAM	Vidal 2018	Events + frames + IMU fusion
EKLT	Gehrig 2020	Event-based KLT feature tracking
ESVIO	Chen 2023	Event-based Stereo VIO
EDS	Hidalgo-Carrió 2022	Event-aided direct sparse odometry
DEVO	Pellerito 2024	Deep event-based visual odometry (DROID-SLAM style)
VIO-GO	2025	Event-based VIO with optimized parameters for HDR scenarios

Level 11: World Models & Spatial AI

World Models

System	Author/Year	Key Concepts
GAIA-1	Wayve 2023	Driving World Model, action-conditioned future scene generation
Sora / DiT	OpenAI 2024	Diffusion Transformer, spacetime patches, emergent 3D understanding
NVIDIA Cosmos	NVIDIA 2026	World Foundation Model platform for Physical AI, synthetic data for AV/robots
World Labs / Marble	Fei-Fei Li 2026	3D world generation from images/video/text ($1B funding)
WorldVLA	Alibaba 2025	Autoregressive action world model, learns physics for action generation
SceneDINO	2025	Feed-forward unsupervised semantic scene completion

Generative 3D

System	Author/Year	Key Concepts
DreamFusion	Poole 2023	Text-to-3D via Score Distillation Sampling (SDS) + NeRF

Vision-Language Models (VLM)

System	Author/Year	Key Concepts
CLIP	Radford (OpenAI) 2021	Contrastive image-text pretraining, 400M pairs, zero-shot
SigLIP	Zhai (Google) 2023	Sigmoid loss CLIP, more efficient, better at small model sizes
BLIP-2	Li (Salesforce) 2023	Q-Former bridges frozen LLM + image encoder
LLaVA	Liu 2023	LLaMA + vision, conversational VLM

Vision-Language-Action Models (VLA)

System	Author/Year	Key Concepts
RT-2	Brohan (DeepMind) 2023	Robot actions as text tokens, emergent generalization
OpenVLA	Kim 2024	Open-source VLA, SigLIP + Llama 7B + Action Head
Navila	2024	Navigation-specialized VLA, SLAM integration for localization

Resources

Resource	Key Concepts
Awesome-Transformer-based-SLAM	Curated GitHub list of Transformer-based SLAM methods

Study Resources

YouTube Lecture Series

Lecture	Instructor	Link
SLAM & Photogrammetry	Cyrill Stachniss (Uni Bonn)	YouTube Playlist
First Principles of Computer Vision	Shree Nayar (Columbia)	YouTube Channel
Multiple View Geometry	Daniel Cremers (TU Munich)	YouTube Playlist

Books

Book	Author	Key Topics
Introduction to Visual SLAM	Xiang Gao et al.	VO, optimization, Lie algebra, backend, loop closure — best entry-level SLAM book
Photogrammetric Computer Vision	Wolfgang Förstner & Bernhard Wrobel	Camera geometry, estimation, 3D reconstruction — mathematically rigorous
Multiple View Geometry in Computer Vision	Richard Hartley & Andrew Zisserman	Epipolar geometry, trifocal tensor, reconstruction — THE bible
Computer Vision: Algorithms and Applications	Richard Szeliski	Feature detection, stereo, motion, 3D — comprehensive reference (2nd ed. free PDF)

Code & Practice

Resource	Link
changh95/slam_lecture_codes	GitHub — Hands-on SLAM lecture code collection

Wrap Up

If you think any of the roadmaps can be improved, please do open a PR with any updates and submit any issues. Also, I will continue to improve this, so you might want to watch/star this repository to revisit.

Also, check out my GitHub and blog :smiley_cat:

Contribution

Open pull request with improvements
Discuss ideas in issues
Spread the word
Reach out to me directly at hyunggi.chang95[at]gmail.com.

Discussion

To discuss any topics or ask questions, please use the issue tab.

License

The class is licensed under the MIT License:

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

visual-slam-roadmap

About visual-slam-roadmap

Platforms

Links

README.md

Visual-SLAM Developer Roadmap - 2026

Purpose of these Roadmaps

Note to Beginners

Table of Contents

Level 1: Beginner

Programming

Mathematics

Projective Geometry

Camera Device

Image Data

Level 2: Getting Familiar with SLAM

Programming

Image Processing

Local Feature Matching

Global Feature Matching

Feature Tracking

Multiple View Geometry

Outlier Rejection

Least Squares Optimisation

Motion Model

Observation Model

Factor Graph Optimisation

Mapping

Sensors

Evaluation

Next Levels

Level 3: Monocular Visual-SLAM

Key Concepts

Feature-based SLAM

Direct SLAM

Hybrid (Feature + Direct)

Learning-based SLAM

Foundation Model SLAM

SfM Tools

Neural Representation SLAM

NeRF-based

3DGS-based

Semantic / Language-Grounded SLAM

Level 4: RGB-D Visual-SLAM

RGB-D Camera Devices

GPGPU Programming

Systems

Level 5: Applying Deep Learning

A. Deep Frontend — Perception

Feature Detection & Matching

Depth Estimation

Optical Flow & Scene Flow

Camera Pose Regression & Relocalization

Object Detection & Segmentation for SLAM

B. Deep Backend — Optimization

Differentiable Bundle Adjustment

Certifiably Optimal Algorithms

Gaussian Belief Propagation & Graph Processors

C. End-to-End Deep VO / SLAM Systems

Self-supervised & Learned VO

Latent Representation SLAM

Neural Rendering (reference)

D. Scene Understanding

Benchmarks & Foundations

3D Scene Graph

Level 6: VIO / VINS

Key Concepts

Foundations

Filter-based

Optimization-based

Level 7: Stereo SLAM

Key Concepts

Systems

Level 8: Collaborative / Multi-Robot SLAM

Key Concepts

Systems

Level 9: LiDAR & Visual-LiDAR Fusion SLAM

Key Concepts

LiDAR / LiDAR-Inertial SLAM

Visual-LiDAR Fusion SLAM