This article describes a method for registering 2D images from cameras with 3D LiDAR point clouds.
Problem Overview
Registration across modalities, specifically between 2D camera images and 3D LiDAR point clouds, is a key task in computer vision and robotics. Previous methods estimate 2D-3D correspondences by matching points and pixel patterns learned by neural networks, then use Perspective-n-Points (PnP) in a post-processing stage to estimate a rigid transformation. These approaches struggle to robustly map points and pixels into a shared latent space because points and pixels have very different characteristics and learn patterns in different ways. They also cannot directly build supervision on the transformation because PnP is non-differentiable, which leads to unstable registration results. To address these issues, this work proposes learning a structured cross-modal latent space that represents pixel and 3D features via a differentiable probabilistic PnP solver.
Approach Summary
The method designs a triplet network to learn VoxelPoint-to-Pixel matching, using voxels and points to represent 3D elements and pixels to represent 2D elements. The voxel and pixel branches are based on CNNs that perform convolutions on voxel/pixel grids, and an additional point branch is integrated to recover information lost during voxelization. The framework is trained end-to-end by directly supervising a probabilistic PnP solver. To capture distinctive cross-modal patterns, a novel loss with adaptive weighting is designed to describe cross-modal feature relationships. Experiments on the KITTI and nuScenes datasets show significant improvements over state-of-the-art methods.
Main Contributions
- Propose a novel framework that learns a structured cross-modal latent space and is trained end-to-end with a differentiable PnP solver using adaptive weight optimization to learn image-to-point-cloud registration.
- Represent 3D elements as a combination of voxels and points to bridge the modality gap between point clouds and pixels, and design a triplet network to learn voxel-point-to-pixel matching.
- Demonstrate improved performance through extensive experiments on KITTI and nuScenes.
Method Overview
The framework first details the VoxelPoint-to-Pixel matching used to learn the structured cross-modal latent space. A novel loss with adaptive weighting is introduced to learn distinctive cross-modal patterns. Finally, a differentiable probabilistic PnP solver is used to enable end-to-end training. The overall method is illustrated in Figure 1.

Figure 1: Overview of the method. Given an uncalibrated image I and point cloud P as input, (a) sparse voxels are generated from the sparse point cloud and a triplet network extracts patterns from three modalities. 2D patterns are represented as pixel features while 3D patterns are represented as a combination of voxel and point features, learned with an adaptive-weighted loss for distinctive 2D-3D cross-modal patterns. (b) Cross-modal feature fusion is used to detect intersection regions in 2D/3D space. (c) Outlier regions are removed based on intersection detection, and 2D-3D correspondences are built by matching 2D-3D features. A probabilistic PnP predicts a distribution over extrinsic poses and the framework is supervised end-to-end with the ground-truth pose distribution.
VoxelPoint-to-Pixel Matching Framework
- The framework uses a triplet network comprising Voxel, Point, and Pixel branches to extract 2D and 3D features.
- Sparse convolution is used in the voxel branch to efficiently capture spatial patterns.
- A point branch inspired by PointNet++ is introduced to recover detailed 3D patterns lost during voxelization.
- The pixel branch is based on a convolutional U-Net to extract global 2D image features.
2D-3D Feature Matching
- 3D elements are represented as a combination of voxels and points.
- 2D and 3D features are matched by mapping them into a shared latent space.
- VoxelPoint-to-Pixel matching creates a structured cross-modal latent space that yields a uniform feature distribution.
Cross Intersection Detection for Outlier Handling
- Because image and LiDAR acquisition differ, many outlier regions lack correspondences.
- Intersection regions are defined as the overlap between the 2D projection of LiDAR points using camera parameters and the reference image.
- A detection strategy predicts the probability that each 2D/3D element lies in the intersection region. This helps remove outlier regions on both modalities before inferring 2D-3D correspondences.

Figure 2: t-SNE visualization of the learned latent space for point-to-pixel (P2P) and voxel-point-to-pixel (VP2P) matching.
Adaptive Weighting Optimization Strategy
The adaptive weighting optimization addresses feature matching between 2D and 3D. Traditional contrastive or triplet losses struggle with 2D-3D feature matching. The proposed adaptive-weighted strategy assigns adaptive weight factors to positive and negative pairs within a set of 2D-3D paired samples, enabling more flexible optimization.

Figure 3: Illustration of adaptive weighting optimization.
Differentiable PnP
To establish 2D-3D correspondences, intersection detection first removes outlier regions on both modalities, then nearest-neighbor matching in the cross-modal latent space is used to match 2D and 3D features. The argmax operation searches the point coordinates with maximum similarity in the latent space. Although argmax is non-differentiable, gradients are obtained via the Gumbel estimator to enable end-to-end training. The probabilistic PnP interprets outputs as probability distributions to address the non-differentiability of PnP. Supervision is applied by minimizing the KL divergence between the predicted pose distribution and the ground-truth pose distribution. Additionally, a Gauss-Newton algorithm-based iterative PnP solver computes a refined pose and a pose loss; the iterative Gauss-Newton steps are differentiable and included in the optimization.
Experiments
The method is evaluated on the commonly used KITTI and nuScenes benchmarks for image-to-point-cloud registration. On both datasets, images and point clouds are captured simultaneously by 2D cameras and 3D LiDAR.
Quantitative and Qualitative Comparisons
Quantitative: The method shows strong performance on KITTI and nuScenes, with an RTE improvement of about 4x over the latest CorrI2P method. The end-to-end framework with a probabilistic PnP solver learns robust 2D-3D correspondences and yields more accurate pose predictions.

Qualitative: Visual comparisons in Figure 5 show improved registration accuracy across different road scenarios. In challenging cases, the method more accurately aligns projections of trees and vehicles to corresponding image pixels, whereas some other methods fail.

Figure 5: Visual comparison of image-to-point-cloud registration results on KITTI.
Feature Matching Accuracy
Figure 6 visualizes feature matching by computing bidirectional error maps from matching distances on both modalities. For 2D-to-3D matching, the method searches for the point with maximum similarity for each 2D pixel within the intersection region, then computes the Euclidean distance between the projected matched point and the 2D pixel. Results show the method significantly outperforms CorrI2P for both 2D-to-3D and 3D-to-2D matching. Most matches have minor errors under 2 pixels, indicating the learned shared latent space accurately discriminates cross-modal patterns. Larger errors can appear near image and point cloud edges where perfect intersection detection is difficult.
Runtime Efficiency
Efficiency comparisons were conducted on NVIDIA RTX 3090 GPU and Intel Xeon E5-2699 CPU. The method uses fewer parameters and achieves significantly better performance. Network inference and one-frame pose estimation take 0.19 seconds, about 50x faster than previous methods.

Ablation Studies
Ablation studies verify the effectiveness of each design choice and the impact of important parameters. Results report RTE/RRE/Acc on KITTI.
Framework design validation: Four variants were evaluated: removing the voxel branch, removing the point branch, replacing the adaptive-weighted loss, and removing end-to-end supervision driven by the differentiable PnP. The full model performs best across all variants, demonstrating the effectiveness of each design. Notably, removing the point branch had less impact than removing the voxel branch, indicating voxels play a more important role in image-to-point-cloud registration.

Input resolution impact: The effects of input image resolution and point cloud density were studied. Higher resolution on both modalities yields better performance, since low-resolution images can lose visual details and low-density point clouds may lose geometric structure. A balance between performance and efficiency is chosen.

Conclusion
This work proposes a framework that learns image-to-point-cloud registration via VoxelPoint-to-Pixel matching and an adaptive-weighted loss to learn a structured cross-modal latent space. Representing 3D elements as a combination of voxels and points helps close the domain gap between point clouds and pixels. By supervising the predicted pose distribution directly with a differentiable PnP solver, the framework is trained end-to-end. Extensive experiments on KITTI and nuScenes demonstrate the method's improved performance.