Archive
LHMap-loc: Cross-Modal Monocular Localization Using LiDAR Point Cloud Heat Map

LHMap-loc: Cross-Modal Monocular Localization Using LiDAR Point Cloud Heat Map

2024-11-26 III-B Offline lhMap Generation Network As shown in Fig. 2,we is use use the offline lhMap generation network to compress the pre - build lidar po

III-B Offline lhMap Generation Network

As shown in Fig. 2,we is use use the offline lhMap generation network to compress the pre – build lidar point cloud map . It is realize through two stage .

In the first stage ,we is realize realize the point selection to compress the dense map andpose supervision to refine the generate local map .
to satisfy the requirement for point selection andmap compression ,the project lidar depthdgโขtsubscript๐ท๐‘”๐‘กd_{gt}italic_d start_pOstsUBsCRIpt italic_g italic_t end_pOstsUBsCRIpt is used to construct an evaluation system for point cloud . It is calculate as :

dgโขtu,v=zgโขt,superscriptsubscript๐ท๐‘”๐‘ก๐‘ข๐‘ฃsuperscript๐‘ง๐‘”๐‘กd_{gt}^{u,v}=z^{gt},italic_d start_pOstsUBsCRIpt italic_g italic_t end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_u ,italic_v end_pOstsUpERsCRIpt = italic_z start_pOstsUpERsCRIpt italic_g italic_t end_pOstsUpERsCRIpt , (1)
(u,v,1)t=kโ‹…(xgโขt,ygโขt,zgโขt,1)t=kโ‹…tgโขtโˆ’1โ‹…(x,y,z,1)t.superscript๐‘ข๐‘ฃ1๐‘‡โ‹…๐พsuperscriptsuperscript๐‘ฅ๐‘”๐‘กsuperscript๐‘ฆ๐‘”๐‘กsuperscript๐‘ง๐‘”๐‘ก1๐‘‡โ‹…๐พsuperscriptsubscript๐‘‡๐‘”๐‘ก1superscript๐‘ฅ๐‘ฆ๐‘ง1๐‘‡(u,v,1)^{t}=k\cdot(x^{gt},y^{gt},z^{gt},1)^{t}=k\cdot{t_{gt}^{-1}\cdot(x,y,z,1% )^{t}}.( italic_u ,italic_v ,1 ) start_pOstsUpERsCRIpt italic_t end_pOstsUpERsCRIpt = italic_k โ‹… ( italic_x start_pOstsUpERsCRIpt italic_g italic_t end_pOstsUpERsCRIpt ,italic_y start_pOstsUpERsCRIpt italic_g italic_t end_pOstsUpERsCRIpt ,italic_z start_pOstsUpERsCRIpt italic_g italic_t end_pOstsUpERsCRIpt ,1 ) start_pOstsUpERsCRIpt italic_t end_pOstsUpERsCRIpt = italic_k โ‹… italic_t start_pOstsUBsCRIpt italic_g italic_t end_pOstsUBsCRIpt start_pOstsUpERsCRIpt – 1 end_pOstsUpERsCRIpt โ‹… ( italic_x ,italic_y ,italic_z ,1 ) start_pOstsUpERsCRIpt italic_t end_pOstsUpERsCRIpt . ( 2 )

here ,(x,y,z)โˆˆp๐‘ฅ๐‘ฆ๐‘ง๐‘ƒ(x,y,z)\in p( italic_x ,italic_y ,italic_z ) โˆˆ italic_p,k๐พkitalic_k represents the camera intrinsics,and tgโขtโˆˆsโขEโข(3)subscript๐‘‡๐‘”๐‘ก๐‘†๐ธ3t_{gt}\in sE(3)italic_t start_pOstsUBsCRIpt italic_g italic_t end_pOstsUBsCRIpt โˆˆ italic_s italic_E ( 3 ) represents the ground truth camera pose at each frame.
Additionally,the offline RGB image Ioโขfโขfโขlโขiโขnโขesubscript๐ผ๐‘œ๐‘“๐‘“๐‘™๐‘–๐‘›๐‘’i_{offline }italic_I start_pOstsUBsCRIpt italic_o italic_f italic_f italic_l italic_i italic_n italic_e end_pOstsUBsCRIpt andprojected lidAR depth diโขnโขiโขtsubscript๐ท๐‘–๐‘›๐‘–๐‘กd_{init}italic_d start_pOstsUBsCRIpt italic_i italic_n italic_i italic_t end_pOstsUBsCRIpt are used to perform pose supervision.
Based on the initial rough camera pose tiโขnโขiโขtโˆˆsโขEโข(3)subscript๐‘‡๐‘–๐‘›๐‘–๐‘ก๐‘†๐ธ3t_{init}\in sE(3)italic_t start_pOstsUBsCRIpt italic_i italic_n italic_i italic_t end_pOstsUBsCRIpt โˆˆ italic_s italic_E ( 3 ) at each frame,which can be acquired by Gps or visual odometry,diโขnโขiโขtsubscript๐ท๐‘–๐‘›๐‘–๐‘กd_{init}italic_d start_pOstsUBsCRIpt italic_i italic_n italic_i italic_t end_pOstsUBsCRIpt is calculate as :

diโขnโขiโขtu,v=ziโขnโขiโขt,superscriptsubscript๐ท๐‘–๐‘›๐‘–๐‘ก๐‘ข๐‘ฃsuperscript๐‘ง๐‘–๐‘›๐‘–๐‘กd_{init}^{u,v}=z^{init},italic_d start_pOstsUBsCRIpt italic_i italic_n italic_i italic_t end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_u ,italic_v end_pOstsUpERsCRIpt = italic_z start_pOstsUpERsCRIpt italic_i italic_n italic_i italic_t end_pOstsUpERsCRIpt , (3)
(u,v,1)t=kโ‹…(xiโขnโขiโขt,yiโขnโขiโขt,ziโขnโขiโขt,1)t=kโ‹…tiโขnโขiโขtโˆ’1โข(x,y,z,1)t.superscript๐‘ข๐‘ฃ1๐‘‡โ‹…๐พsuperscriptsuperscript๐‘ฅ๐‘–๐‘›๐‘–๐‘กsuperscript๐‘ฆ๐‘–๐‘›๐‘–๐‘กsuperscript๐‘ง๐‘–๐‘›๐‘–๐‘ก1๐‘‡โ‹…๐พsuperscriptsubscript๐‘‡๐‘–๐‘›๐‘–๐‘ก1superscript๐‘ฅ๐‘ฆ๐‘ง1๐‘‡(u,v,1)^{t}=k\cdot(x^{init},y^{init},z^{init},1)^{t}=k\cdot{t_{init}^{-1}(x,y,% z,1)^{t}}.( italic_u ,italic_v ,1 ) start_pOstsUpERsCRIpt italic_t end_pOstsUpERsCRIpt = italic_k โ‹… ( italic_x start_pOstsUpERsCRIpt italic_i italic_n italic_i italic_t end_pOstsUpERsCRIpt ,italic_y start_pOstsUpERsCRIpt italic_i italic_n italic_i italic_t end_pOstsUpERsCRIpt ,italic_z start_pOstsUpERsCRIpt italic_i italic_n italic_i italic_t end_pOstsUpERsCRIpt ,1 ) start_pOstsUpERsCRIpt italic_t end_pOstsUpERsCRIpt = italic_k โ‹… italic_t start_pOstsUBsCRIpt italic_i italic_n italic_i italic_t end_pOstsUBsCRIpt start_pOstsUpERsCRIpt – 1 end_pOstsUpERsCRIpt ( italic_x ,italic_y ,italic_z ,1 ) start_pOstsUpERsCRIpt italic_t end_pOstsUpERsCRIpt . (4)

here ,(x,y,z)tโˆˆpsuperscript๐‘ฅ๐‘ฆ๐‘ง๐‘‡๐‘ƒ(x,y,z)^{t}\in p( italic_x ,italic_y ,italic_z ) start_pOstsUpERsCRIpt italic_t end_pOstsUpERsCRIpt โˆˆ italic_p.
Both dgโขtsubscript๐ท๐‘”๐‘กd_{gt}italic_d start_pOstsUBsCRIpt italic_g italic_t end_pOstsUBsCRIpt anddiโขnโขiโขtsubscript๐ท๐‘–๐‘›๐‘–๐‘กd_{init}italic_d start_pOstsUBsCRIpt italic_i italic_n italic_i italic_t end_pOstsUBsCRIpt contain only the depth information of point clouds.

Firstly,feature maps FI,Fdโข1subscript๐น๐ผsubscript๐น๐ท1f_{i },F_{d1}italic_F start_pOstsUBsCRIpt italic_I end_pOstsUBsCRIpt ,italic_F start_pOstsUBsCRIpt italic_d 1 end_pOstsUBsCRIpt,and Fdโข2subscript๐น๐ท2F_{d2}italic_F start_pOstsUBsCRIpt italic_d 2 end_pOstsUBsCRIpt with different scales are extracted from Ioโขfโขfโขlโขiโขnโขesubscript๐ผ๐‘œ๐‘“๐‘“๐‘™๐‘–๐‘›๐‘’i_{offline }italic_I start_pOstsUBsCRIpt italic_o italic_f italic_f italic_l italic_i italic_n italic_e end_pOstsUBsCRIpt,dgโขtsubscript๐ท๐‘”๐‘กd_{gt}italic_d start_pOstsUBsCRIpt italic_g italic_t end_pOstsUBsCRIpt,and diโขnโขiโขtsubscript๐ท๐‘–๐‘›๐‘–๐‘กd_{init}italic_d start_pOstsUBsCRIpt italic_i italic_n italic_i italic_t end_pOstsUBsCRIpt respectively,through convolutional neural networks (CNN). Fdโข1subscript๐น๐ท1F_{d1}italic_F start_pOstsUBsCRIpt italic_d 1 end_pOstsUBsCRIpt,the CNN feature of the projected lidAR depth dgโขtsubscript๐ท๐‘”๐‘กd_{gt}italic_d start_pOstsUBsCRIpt italic_g italic_t end_pOstsUBsCRIpt is used to generate the heat feature hcsubscript๐ป๐‘h_{c}italic_h start_pOstsUBsCRIpt italic_c end_pOstsUBsCRIpt.
the point clouds are selected by evaluating heat value. heat value is calculated by heat feature hcsubscript๐ป๐‘h_{c}italic_h start_pOstsUBsCRIpt italic_c end_pOstsUBsCRIpt which is generated by Fdโข1subscript๐น๐ท1F_{d1}italic_F start_pOstsUBsCRIpt italic_d 1 end_pOstsUBsCRIpt. Each elementhki,jโˆˆhc(iโˆˆ{1,2,โ€ฆ,h},jโˆˆ{1,2,โ€ฆ,w}h_{k}^{i,j}\in h_{c}(i\in\{1,2,…,h\},j\in\{1,2,…,w\}italic_h start_pOstsUBsCRIpt italic_k end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_i ,italic_j end_pOstsUpERsCRIpt โˆˆ italic_h start_pOstsUBsCRIpt italic_c end_pOstsUBsCRIpt ( italic_i โˆˆ { 1 ,2 ,โ€ฆ ,italic_h } ,italic_j โˆˆ { 1 ,2 ,โ€ฆ ,italic_w }) is used to calculate heat valuehi,jsuperscriptโ„Ž๐‘–๐‘—h^{i,j}italic_h start_pOstsUpERsCRIpt italic_i ,italic_j end_pOstsUpERsCRIpt for point clouds evaluation as:

hi,j=Mโขaโขsโขki,jโ‹…โˆ‘k=1Chki,j,superscriptโ„Ž๐‘–๐‘—โ‹…๐‘€๐‘Ž๐‘ superscript๐‘˜๐‘–๐‘—superscriptsubscript๐‘˜1๐ถsuperscriptsubscriptโ„Ž๐‘˜๐‘–๐‘—h^{i,j}=Mask^{i,j}\cdot\sum_{k=1}^{C}h_{k}^{i,j},italic_h start_pOstsUpERsCRIpt italic_i ,italic_j end_pOstsUpERsCRIpt = italic_M italic_a italic_s italic_k start_pOstsUpERsCRIpt italic_i ,italic_j end_pOstsUpERsCRIpt โ‹… โˆ‘ start_pOstsUBsCRIpt italic_k = 1 end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_C end_pOstsUpERsCRIpt italic_h start_pOstsUBsCRIpt italic_k end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_i ,italic_j end_pOstsUpERsCRIpt , ( 5 )
Mโขaโขsโขki,j={0,Mgโขti,j=01,Mgโขti,jโ‰ 0.๐‘€๐‘Ž๐‘ superscript๐‘˜๐‘–๐‘—case0superscriptsubscript๐‘€๐‘”๐‘ก๐‘–๐‘—01superscriptsubscript๐‘€๐‘”๐‘ก๐‘–๐‘—0Mask^{i,j}=\left\{\begin{array}[]{ll}0,&M_{gt}^{i,j}=0\\ 1,&M_{gt}^{i,j}\neq 0\\ \end{array}.\right.italic_M italic_a italic_s italic_k start_pOstsUpERsCRIpt italic_i ,italic_j end_pOstsUpERsCRIpt = { start_ARRAY start_ROW start_CEll 0 ,end_CEll start_CEll italic_M start_pOstsUBsCRIpt italic_g italic_t end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_i ,italic_j end_pOstsUpERsCRIpt = 0 end_CEll end_ROW start_ROW start_CEll 1 ,end_CEll start_CEll italic_M start_pOstsUBsCRIpt italic_g italic_t end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_i ,italic_j end_pOstsUpERsCRIpt โ‰  0 end_CEll end_ROW end_ARRAY . (6)

here ,C๐ถCitalic_C represents the number of channels of hcsubscript๐ป๐‘h_{c}italic_h start_pOstsUBsCRIpt italic_c end_pOstsUBsCRIpt.
subsequently,points exhibiting the highest heat values are selected to constitute the coarse local lhMap,denoted as
Mcsubscript๐‘€๐‘M_{c }italic_M start_pOstsUBsCRIpt italic_c end_pOstsUBsCRIpt.

Mci,j=tโขoโขpโขNโข(hi,j),superscriptsubscript๐‘€๐‘๐‘–๐‘—๐‘‡๐‘œ๐‘๐‘superscriptโ„Ž๐‘–๐‘—M_{c }^{i,j}=topN(h^{i,j}),italic_M start_pOstsUBsCRIpt italic_c end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_i ,italic_j end_pOstsUpERsCRIpt = italic_t italic_o italic_p italic_N ( italic_h start_pOstsUpERsCRIpt italic_i ,italic_j end_pOstsUpERsCRIpt ) , ( 7 )
tโขoโขpโขNโข(hi,j)={dgโขti,j,iโขfโขhi,jโขrโขaโขnโขkโขiโขnโขgโขtโขoโขpโขN0,oโขtโขhโขeโขrโขs.๐‘‡๐‘œ๐‘๐‘superscriptโ„Ž๐‘–๐‘—casesuperscriptsubscript๐ท๐‘”๐‘ก๐‘–๐‘—๐‘–๐‘“superscriptโ„Ž๐‘–๐‘—๐‘Ÿ๐‘Ž๐‘›๐‘˜๐‘–๐‘›๐‘”๐‘ก๐‘œ๐‘๐‘0๐‘œ๐‘กโ„Ž๐‘’๐‘Ÿ๐‘ topN(h^{i,j})=\left\{\begin{array}[]{ll}d_{gt}^{i,j},&if\ h^{i,j}\ ranking\ % top\ N\\ 0,&others\\ \end{array}.\right.italic_t italic_o italic_p italic_N ( italic_h start_pOstsUpERsCRIpt italic_i ,italic_j end_pOstsUpERsCRIpt ) = { start_ARRAY start_ROW start_CEll italic_d start_pOstsUBsCRIpt italic_g italic_t end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_i ,italic_j end_pOstsUpERsCRIpt ,end_CEll start_CEll italic_i italic_f italic_h start_pOstsUpERsCRIpt italic_i ,italic_j end_pOstsUpERsCRIpt italic_r italic_a italic_n italic_k italic_i italic_n italic_g italic_t italic_o italic_p italic_N end_CEll end_ROW start_ROW start_CEll 0 ,end_CEll start_CEll italic_o italic_t italic_h italic_e italic_r italic_s end_CEll end_ROW end_ARRAY . (8)

during the generation of Mcsubscript๐‘€๐‘M_{c }italic_M start_pOstsUBsCRIpt italic_c end_pOstsUBsCRIpt,the pose supervision is adopted to guide the procedure. the pose supervision module incorporates two inputs: the heat feature hcsubscript๐ป๐‘h_{c}italic_h start_pOstsUBsCRIpt italic_c end_pOstsUBsCRIpt,and the optical flow embedding Edsubscript๐ธ๐ทE_{d}italic_E start_pOstsUBsCRIpt italic_d end_pOstsUBsCRIpt,which is derived from Fdโข2subscript๐น๐ท2F_{d2}italic_F start_pOstsUBsCRIpt italic_d 2 end_pOstsUBsCRIpt andFIsubscript๐น๐ผf_{i }italic_F start_pOstsUBsCRIpt italic_I end_pOstsUBsCRIpt based on the iterative optimization structure of pWCNet[ 29 ]. pose supervision is realized by pose calculation module,detailed in sec. III – C.

the single stage 1 learning fails to converge. therefore,we propose the second stage to refine lhMap.
In the second stage,we apply ฮ”โขt=tiโขnโขiโขtโˆ’1โ‹…tgโขtฮ”๐‘‡โ‹…superscriptsubscript๐‘‡๐‘–๐‘›๐‘–๐‘ก1subscript๐‘‡๐‘”๐‘ก\delta t=t_{init}^{-1}\cdot t_{gt}roman_ฮ” italic_t = italic_t start_pOstsUBsCRIpt italic_i italic_n italic_i italic_t end_pOstsUBsCRIpt start_pOstsUpERsCRIpt – 1 end_pOstsUpERsCRIpt โ‹… italic_t start_pOstsUBsCRIpt italic_g italic_t end_pOstsUBsCRIpt to the coarse local lhMap Mcsubscript๐‘€๐‘M_{c }italic_M start_pOstsUBsCRIpt italic_c end_pOstsUBsCRIpt to recover the initial localization results.
the initial coarse local lhMap Mciโขnโขiโขtsuperscriptsubscript๐‘€๐‘๐‘–๐‘›๐‘–๐‘กM_{c }^{init}italic_M start_pOstsUBsCRIpt italic_c end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_i italic_n italic_i italic_t end_pOstsUpERsCRIpt andthe offline RGB image Ioโขfโขfโขlโขiโขnโขesubscript๐ผ๐‘œ๐‘“๐‘“๐‘™๐‘–๐‘›๐‘’i_{offline }italic_I start_pOstsUBsCRIpt italic_o italic_f italic_f italic_l italic_i italic_n italic_e end_pOstsUBsCRIpt are used for further pose supervision.
Because both stages share the same offline RGB image,they share the same feature maps FIsubscript๐น๐ผf_{i }italic_F start_pOstsUBsCRIpt italic_I end_pOstsUBsCRIpt of the RGB image naturally,while the feature maps FMsubscript๐น๐‘€F_{M}italic_F start_pOstsUBsCRIpt italic_M end_pOstsUBsCRIpt of the initial coarse local lhMap Mciโขnโขiโขtsuperscriptsubscript๐‘€๐‘๐‘–๐‘›๐‘–๐‘กM_{c }^{init}italic_M start_pOstsUBsCRIpt italic_c end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_i italic_n italic_i italic_t end_pOstsUpERsCRIpt are regenerated. then,the heat feature hMsubscript๐ป๐‘€h_{M}italic_h start_pOstsUBsCRIpt italic_M end_pOstsUBsCRIpt is generated by FMsubscript๐น๐‘€F_{M}italic_F start_pOstsUBsCRIpt italic_M end_pOstsUBsCRIpt andthe flow embedding EMsubscript๐ธ๐‘€E_{M}italic_E start_pOstsUBsCRIpt italic_M end_pOstsUBsCRIpt is generated by FMsubscript๐น๐‘€F_{M}italic_F start_pOstsUBsCRIpt italic_M end_pOstsUBsCRIpt andFIsubscript๐น๐ผf_{i }italic_F start_pOstsUBsCRIpt italic_I end_pOstsUBsCRIpt. At last,they work together for pose supervision.
pose supervision is realized by the pose calculation module which is introduced in sec. III – C.
In this stage,we regress another set of 6-doF pose q1,t1subscript๐‘ž1subscript๐‘ก1q_{1},t_{1}italic_q start_pOstsUBsCRIpt 1 end_pOstsUBsCRIpt ,italic_t start_pOstsUBsCRIpt 1 end_pOstsUBsCRIpt. Both q0,t0subscript๐‘ž0subscript๐‘ก0q_{0},t_{0 }italic_q start_pOstsUBsCRIpt 0 end_pOstsUBsCRIpt ,italic_t start_pOstsUBsCRIpt 0 end_pOstsUBsCRIpt andq1,t1subscript๐‘ž1subscript๐‘ก1q_{1},t_{1}italic_q start_pOstsUBsCRIpt 1 end_pOstsUBsCRIpt ,italic_t start_pOstsUBsCRIpt 1 end_pOstsUBsCRIpt refine the local lhMap by optimising the heat feature hcsubscript๐ป๐‘h_{c}italic_h start_pOstsUBsCRIpt italic_c end_pOstsUBsCRIpt.

the output of this network is the lidAR point cloud heat Map (lhMap) combined by the refined local lhMap at each frame. though the local lhMap contains only the depth information,by taking the inverse of the projection formulation,we can obtain the 3d coordinates information pksubscript๐‘ƒ๐‘˜p_{k}italic_p start_pOstsUBsCRIpt italic_k end_pOstsUBsCRIpt at each frame k๐‘˜kitalic_k. With the knowledge of the ground truth camera posetkwsuperscriptsubscript๐‘‡๐‘˜๐‘ค{}^{w}t_{k}start_FlOAtsUpERsCRIpt italic_w end_FlOAtsUpERsCRIpt italic_t start_pOstsUBsCRIpt italic_k end_pOstsUBsCRIpt at frame k๐‘˜kitalic_k andthe pointpksubscript๐‘ƒ๐‘˜p_{k}italic_p start_pOstsUBsCRIpt italic_k end_pOstsUBsCRIpt of frame k๐‘˜kitalic_k,we can convert pksubscript๐‘ƒ๐‘˜p_{k}italic_p start_pOstsUBsCRIpt italic_k end_pOstsUBsCRIpt to the world frame:

pkw=tkwโ‹…pk.superscriptsubscript๐‘ƒ๐‘˜๐‘คโ‹…superscriptsubscript๐‘‡๐‘˜๐‘คsubscript๐‘ƒ๐‘˜{}^{w}p_{k}={{}^{w}t_{k}}\cdot p_{k}.start_FlOAtsUpERsCRIpt italic_w end_FlOAtsUpERsCRIpt italic_p start_pOstsUBsCRIpt italic_k end_pOstsUBsCRIpt = start_FlOAtsUpERsCRIpt italic_w end_FlOAtsUpERsCRIpt italic_t start_pOstsUBsCRIpt italic_k end_pOstsUBsCRIpt โ‹… italic_p start_pOstsUBsCRIpt italic_k end_pOstsUBsCRIpt . (9)

here ,pkwsuperscriptsubscript๐‘ƒ๐‘˜๐‘ค{}^{w}p_{k}start_FlOAtsUpERsCRIpt italic_w end_FlOAtsUpERsCRIpt italic_p start_pOstsUBsCRIpt italic_k end_pOstsUBsCRIpt represents the points at the frame k๐‘˜kitalic_k in the world coordinate system. the lhMap is constructed by uniting all the points pkwsuperscriptsubscript๐‘ƒ๐‘˜๐‘ค{}^{w}p_{k}start_FlOAtsUpERsCRIpt italic_w end_FlOAtsUpERsCRIpt italic_p start_pOstsUBsCRIpt italic_k end_pOstsUBsCRIpt together through an union operation โˆชksubscript๐‘˜\cup_{k}โˆช start_pOstsUBsCRIpt italic_k end_pOstsUBsCRIpt:

lโขhโขMโขaโขp=โˆชkpkw.๐ฟ๐ป๐‘€๐‘Ž๐‘subscript๐‘˜superscriptsubscript๐‘ƒ๐‘˜๐‘คlhMap=\cup_{k}{{}^{w}p_{k}}.italic_l italic_h italic_M italic_a italic_p = โˆช start_pOstsUBsCRIpt italic_k end_pOstsUBsCRIpt start_FlOAtsUpERsCRIpt italic_w end_FlOAtsUpERsCRIpt italic_p start_pOstsUBsCRIpt italic_k end_pOstsUBsCRIpt . (10)

the loss function of the offline heat map generation network is similar to CMRNet[ 16 ].
let qgโขtsubscript๐‘ž๐‘”๐‘กq_{gt}italic_q start_pOstsUBsCRIpt italic_g italic_t end_pOstsUBsCRIpt andtgโขtsubscript๐‘ก๐‘”๐‘กt_{gt}italic_t start_pOstsUBsCRIpt italic_g italic_t end_pOstsUBsCRIpt represent the ground truth camera pose. the angular distance lqsubscript๐ฟ๐‘žl_{q}italic_l start_pOstsUBsCRIpt italic_q end_pOstsUBsCRIpt between quaternions is used to evaluate the rotation loss. the l1-smooth loss ltsubscript๐ฟ๐‘กl_{t}italic_l start_pOstsUBsCRIpt italic_t end_pOstsUBsCRIpt is used to evaluate the translation loss,which is defined as:

โ„’qโข(q,qgโขt)=dโข(qโขโจ‚iโขnโขvโข(qgโขt)),subscriptโ„’๐‘ž๐‘žsubscript๐‘ž๐‘”๐‘ก๐ท๐‘žtensor-product๐‘–๐‘›๐‘ฃsubscript๐‘ž๐‘”๐‘ก\mathcal{l}_{q}(q,\ q_{gt})=d(q\bigotimes{inv(q_{gt})}),caligraphic_l start_pOstsUBsCRIpt italic_q end_pOstsUBsCRIpt ( italic_q ,italic_q start_pOstsUBsCRIpt italic_g italic_t end_pOstsUBsCRIpt ) = italic_d ( italic_q โจ‚ italic_i italic_n italic_v ( italic_q start_pOstsUBsCRIpt italic_g italic_t end_pOstsUBsCRIpt ) ) , ( 11 )
dโข(q)=arctanโก((b2+c2+d2,|a|)),๐ท๐‘žsuperscript๐‘2superscript๐‘2superscript๐‘‘2๐‘Žd(q)=\arctan((\sqrt{b^{2}+c^{2}+d^{2}},\ |a|)),italic_d ( italic_q ) = roman_arctan ( ( square-root start_ARG italic_b start_pOstsUpERsCRIpt 2 end_pOstsUpERsCRIpt + italic_c start_pOstsUpERsCRIpt 2 end_pOstsUpERsCRIpt + italic_d start_pOstsUpERsCRIpt 2 end_pOstsUpERsCRIpt end_ARG ,| italic_a | ) ) , (12)
โ„’tโข(t,tgโขt)=l1โขsโขmโขoโขoโขtโขhโข(tโˆ’tgโขt),subscriptโ„’๐‘ก๐‘กsubscript๐‘ก๐‘”๐‘กsubscript๐ฟ1๐‘ ๐‘š๐‘œ๐‘œ๐‘กโ„Ž๐‘กsubscript๐‘ก๐‘”๐‘ก\mathcal{l}_{t}(t,\ t_{gt})=l_{1}smooth(t-t_{gt}),caligraphic_l start_pOstsUBsCRIpt italic_t end_pOstsUBsCRIpt ( italic_t ,italic_t start_pOstsUBsCRIpt italic_g italic_t end_pOstsUBsCRIpt ) = italic_l start_pOstsUBsCRIpt 1 end_pOstsUBsCRIpt italic_s italic_m italic_o italic_o italic_t italic_h ( italic_t – italic_t start_pOstsUBsCRIpt italic_g italic_t end_pOstsUBsCRIpt ) , (13)

here ,{a,b,c,d}๐‘Ž๐‘๐‘๐‘‘\{a , b , c , d\ }{ italic_a ,italic_b ,italic_c ,italic_d } are the components of quaternion q๐‘žqitalic_q andโจ‚tensor-product\bigotimesโจ‚ is the multiplicative operation between two quaternions.
the pose loss is defined as:

โ„’p=โ„’t+ฮปโขโ„’q,ฮปโ‰ฅ1.formulae – sequencesubscriptโ„’๐‘subscriptโ„’๐‘ก๐œ†subscriptโ„’๐‘ž๐œ†1\mathcal{l}_{p}=\mathcal{l}_{t}+\lambda\mathcal{l}_{q},\lambda\geq 1.caligraphic_l start_pOstsUBsCRIpt italic_p end_pOstsUBsCRIpt = caligraphic_l start_pOstsUBsCRIpt italic_t end_pOstsUBsCRIpt + italic_ฮป caligraphic_l start_pOstsUBsCRIpt italic_q end_pOstsUBsCRIpt ,italic_ฮป โ‰ฅ 1 . (14)

the pose t0,q0subscript๐‘ก0subscript๐‘ž0t_{0},q_{0}italic_t start_pOstsUBsCRIpt 0 end_pOstsUBsCRIpt ,italic_q start_pOstsUBsCRIpt 0 end_pOstsUBsCRIpt regressed by the pose supervision module in stage 1 andthe pose t1,q1subscript๐‘ก1subscript๐‘ž1t_{1},q_{1}italic_t start_pOstsUBsCRIpt 1 end_pOstsUBsCRIpt ,italic_q start_pOstsUBsCRIpt 1 end_pOstsUBsCRIpt regressed by the pose supervision module in stage 2 are both taken into account for better supervision. therefore,the total loss is defined as:

โ„’โข๐’ชโข๐’ฎโข๐’ฎ1=ฮฑโขโ„’pโข0+ฮฒโขโ„’pโข1,ฮฑ+ฮฒ=1..formulae – sequenceโ„’๐’ช๐’ฎsubscript๐’ฎ1๐›ผsubscriptโ„’๐‘0๐›ฝsubscriptโ„’๐‘1๐›ผ๐›ฝ1\mathcal{lOss}_{1}=\alpha\mathcal{l}_{p0}+\beta\mathcal{l}_{p1},\ \alpha+\beta% =1..caligraphic_l caligraphic_O caligraphic_s caligraphic_s start_pOstsUBsCRIpt 1 end_pOstsUBsCRIpt = italic_ฮฑ caligraphic_l start_pOstsUBsCRIpt italic_p 0 end_pOstsUBsCRIpt + italic_ฮฒ caligraphic_l start_pOstsUBsCRIpt italic_p 1 end_pOstsUBsCRIpt ,italic_ฮฑ + italic_ฮฒ = 1 . . (15)

III – C Online pose Regression Network

this network is used for real-time monocular localization. the inputs are the online RGB image Ioโขnโขlโขiโขnโขesubscript๐ผ๐‘œ๐‘›๐‘™๐‘–๐‘›๐‘’I_{online }italic_I start_pOstsUBsCRIpt italic_o italic_n italic_l italic_i italic_n italic_e end_pOstsUBsCRIpt andthe real-time lhMap Mrsuperscript๐‘€๐‘ŸM^{r}italic_M start_pOstsUpERsCRIpt italic_r end_pOstsUpERsCRIpt. the Mrsuperscript๐‘€๐‘ŸM^{r}italic_M start_pOstsUpERsCRIpt italic_r end_pOstsUpERsCRIpt is constructed by projecting ptsuperscript๐‘ƒ๐‘ก{}^{t}pstart_FlOAtsUpERsCRIpt italic_t end_FlOAtsUpERsCRIpt italic_p at each local lhMap stored in the first network to the image plane according to the function in (4).

Firstly,feature maps FIrsuperscriptsubscript๐น๐ผ๐‘Ÿf_{i }^{r}italic_F start_pOstsUBsCRIpt italic_I end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_r end_pOstsUpERsCRIpt andFMrsuperscriptsubscript๐น๐‘€๐‘ŸF_{M}^{r }italic_F start_pOstsUBsCRIpt italic_M end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_r end_pOstsUpERsCRIpt are extracted from both inputs Ioโขnโขlโขiโขnโขesubscript๐ผ๐‘œ๐‘›๐‘™๐‘–๐‘›๐‘’I_{online }italic_I start_pOstsUBsCRIpt italic_o italic_n italic_l italic_i italic_n italic_e end_pOstsUBsCRIpt andreal-time local lhMap Mrsuperscript๐‘€๐‘ŸM^{r}italic_M start_pOstsUpERsCRIpt italic_r end_pOstsUpERsCRIpt through convolutional neural network ( CNN ) .

then,the feature maps FMrsuperscriptsubscript๐น๐‘€๐‘ŸF_{M}^{r }italic_F start_pOstsUBsCRIpt italic_M end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_r end_pOstsUpERsCRIpt are used to calculate the 2d flow embedding EMrsuperscriptsubscript๐ธ๐‘€๐‘Ÿe_{m}^{r }italic_E start_pOstsUBsCRIpt italic_M end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_r end_pOstsUpERsCRIpt along with the RGB image feature maps FIrsuperscriptsubscript๐น๐ผ๐‘Ÿf_{i }^{r}italic_F start_pOstsUBsCRIpt italic_I end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_r end_pOstsUpERsCRIpt andto generate the heat feature hMrsuperscriptsubscript๐ป๐‘€๐‘Ÿh_{M}^{r}italic_h start_pOstsUBsCRIpt italic_M end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_r end_pOstsUpERsCRIpt alone. EMrsuperscriptsubscript๐ธ๐‘€๐‘Ÿe_{m}^{r }italic_E start_pOstsUBsCRIpt italic_M end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_r end_pOstsUpERsCRIpt here is calculated the same as pWCNet[ 29 ]. the usage of the heat feature hMrsuperscriptsubscript๐ป๐‘€๐‘Ÿh_{M}^{r}italic_h start_pOstsUBsCRIpt italic_M end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_r end_pOstsUpERsCRIpt enables the pose regression to focus on effective features. therefore,the supervision of the 2d flow embedding EMrsuperscriptsubscript๐ธ๐‘€๐‘Ÿe_{m}^{r }italic_E start_pOstsUBsCRIpt italic_M end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_r end_pOstsUpERsCRIpt andthe regression of 6-doF pose can achieve better performance. the cost volume v๐‘‰vitalic_v is then calculate by feedhMrsuperscriptsubscript๐ป๐‘€๐‘Ÿh_{M}^{r}italic_h start_pOstsUBsCRIpt italic_M end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_r end_pOstsUpERsCRIpt into the softmax layer to generate the coefficients andmultiplying the coefficients with EMrsuperscriptsubscript๐ธ๐‘€๐‘Ÿe_{m}^{r }italic_E start_pOstsUBsCRIpt italic_M end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_r end_pOstsUpERsCRIpt. the cost volume v๐‘‰vitalic_v is calculate as :

๐’ฑ=โˆ‘hร—wEMrโŠ™sโขoโขfโขtโขmโขaโขxhร—wโข(hMr),๐’ฑsubscriptโ„Ž๐‘คdirect-productsuperscriptsubscript๐ธ๐‘€๐‘Ÿ๐‘†๐‘œ๐‘“๐‘ก๐‘š๐‘Žsubscript๐‘ฅโ„Ž๐‘คsuperscriptsubscript๐ป๐‘€๐‘Ÿ\mathcal{v}=\sum_{h\times w}{e_{m}^{r }\odot softmax_{h\times w}(h_{M}^{r})},caligraphic_v = โˆ‘ start_pOstsUBsCRIpt italic_h ร— italic_w end_pOstsUBsCRIpt italic_E start_pOstsUBsCRIpt italic_M end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_r end_pOstsUpERsCRIpt โŠ™ italic_s italic_o italic_f italic_t italic_m italic_a italic_x start_pOstsUBsCRIpt italic_h ร— italic_w end_pOstsUBsCRIpt ( italic_h start_pOstsUBsCRIpt italic_M end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_r end_pOstsUpERsCRIpt ) , (16)

whereโŠ™direct-product\odotโŠ™ means element-wise product,sโขoโขfโขtโขmโขaโขxhร—w๐‘ ๐‘œ๐‘“๐‘ก๐‘š๐‘Žsubscript๐‘ฅโ„Ž๐‘คsoftmax_{h\times w}italic_s italic_o italic_f italic_t italic_m italic_a italic_x start_pOstsUBsCRIpt italic_h ร— italic_w end_pOstsUBsCRIpt means apply softmax to height andwidth dimensions of hMrsuperscriptsubscript๐ป๐‘€๐‘Ÿh_{M}^{r}italic_h start_pOstsUBsCRIpt italic_M end_pOstsUBsCRIpt start_pOstsUpERsCRIpt italic_r end_pOstsUpERsCRIpt.

At last,the cost volume is fed into separate Mlps for pose regression:

qo=Mโขlโขpqโข(v),to=Mโขlโขptโข(v).formulae – sequencesubscript๐‘ž๐‘œ๐‘€๐ฟsubscript๐‘ƒ๐‘ž๐‘‰subscript๐‘ก๐‘œ๐‘€๐ฟsubscript๐‘ƒ๐‘ก๐‘‰q_{o}=Mlp_{q}(v),\ \ t_{o}=Mlp_{t}(v).italic_q start_pOstsUBsCRIpt italic_o end_pOstsUBsCRIpt = italic_M italic_l italic_p start_pOstsUBsCRIpt italic_q end_pOstsUBsCRIpt ( italic_v ) ,italic_t start_pOstsUBsCRIpt italic_o end_pOstsUBsCRIpt = italic_M italic_l italic_p start_pOstsUBsCRIpt italic_t end_pOstsUBsCRIpt ( italic_v ) . (17)

the pose regression is realized by the pose calculation module as shown in Fig. 3. the resolution of the flow feature may be different from that of the heat feature. therefore,the flow feature is transferred to the up-sampled layers to maintain the same resolution as the heat feature before being multiplied with it. the multiplication result then accumulates all the elements across the height andwidth dimensions before being fed into fully connected layers,which are denoted as Mโขlโขpq๐‘€๐ฟsubscript๐‘ƒ๐‘žMlp_{q}italic_M italic_l italic_p start_pOstsUBsCRIpt italic_q end_pOstsUBsCRIpt andMโขlโขpt๐‘€๐ฟsubscript๐‘ƒ๐‘กMlp_{t}italic_M italic_l italic_p start_pOstsUBsCRIpt italic_t end_pOstsUBsCRIpt. the outputs of this network are 6-doF poses to,qosubscript๐‘ก๐‘œsubscript๐‘ž๐‘œt_{o},q_{o}italic_t start_pOstsUBsCRIpt italic_o end_pOstsUBsCRIpt ,italic_q start_pOstsUBsCRIpt italic_o end_pOstsUBsCRIpt.

the loss function used here follows [ 30 ]. Adding two trainable parameters wxsubscript๐‘ค๐‘ฅw_{x}italic_w start_pOstsUBsCRIpt italic_x end_pOstsUBsCRIpt andwqsubscript๐‘ค๐‘žw_{q}italic_w start_pOstsUBsCRIpt italic_q end_pOstsUBsCRIpt,the loss function is defined as:

โ„’โข๐’ชโข๐’ฎโข๐’ฎ2=eโˆ’wxโขโ„’t+wx+eโˆ’wqโขโ„’q+wq.โ„’๐’ช๐’ฎsubscript๐’ฎ2superscript๐‘’subscript๐‘ค๐‘ฅsubscriptโ„’๐‘กsubscript๐‘ค๐‘ฅsuperscript๐‘’subscript๐‘ค๐‘žsubscriptโ„’๐‘žsubscript๐‘ค๐‘ž\mathcal{lOss}_{2}=e^{-w_{x}}\mathcal{l}_{t}+w_{x}+e^{-w_{q}}\mathcal{l}_{q}+w% _{q}.caligraphic_l caligraphic_O caligraphic_s caligraphic_s start_pOstsUBsCRIpt 2 end_pOstsUBsCRIpt = italic_e start_pOstsUpERsCRIpt – italic_w start_pOstsUBsCRIpt italic_x end_pOstsUBsCRIpt end_pOstsUpERsCRIpt caligraphic_l start_pOstsUBsCRIpt italic_t end_pOstsUBsCRIpt + italic_w start_pOstsUBsCRIpt italic_x end_pOstsUBsCRIpt + italic_e start_pOstsUpERsCRIpt – italic_w start_pOstsUBsCRIpt italic_q end_pOstsUBsCRIpt end_pOstsUpERsCRIpt caligraphic_l start_pOstsUBsCRIpt italic_q end_pOstsUBsCRIpt + italic_w start_pOstsUBsCRIpt italic_q end_pOstsUBsCRIpt . (18)

LHMap-loc: Cross-Modal Monocular Localization Using LiDAR Point Cloud Heat Map
Figure 3: the details of the regression part. Multiply flow embedding E๐ธEitalic_E andup-sampled heat feature h๐ปhitalic_h as inputs,and then calculate weighted features. the result is fed into Mโขlโขpq๐‘€๐ฟsubscript๐‘ƒ๐‘žMlp_{q}italic_M italic_l italic_p start_pOstsUBsCRIpt italic_q end_pOstsUBsCRIpt andMโขlโขpt๐‘€๐ฟsubscript๐‘ƒ๐‘กMlp_{t}italic_M italic_l italic_p start_pOstsUBsCRIpt italic_t end_pOstsUBsCRIpt to regress 6-doF poses