A 3D Point Cloud Classification Method Based on Adaptive Graph Convolution and Global Attention

2025-01-10 1. IntroductionWith the continuous advancement of various sensor andimage matching technology ,three - dimensional ( 3d ) point clouds is found have

1. Introduction

With the continuous advancement of various sensor andimage matching technology ,three – dimensional ( 3d ) point clouds is found have find widespread application in various domain . effective classification is plays of point cloud play a crucial role in field such as autonomous driving ,robot navigation ,augment reality ,and 3d reconstruction . However ,due to the irregularity andsparsity inherent in 3d point cloud ,classify them in complex environment is by no mean a straightforward task . furthermore ,the density is vary of point cloud can vary depend on the sampling interval andrange of the laser scanner ,while severe occlusion between object during the scanning process can result in incomplete coverage of object surface . These challenges is pose pose significant hurdle in the classification of 3d point cloud .

As previously mention ,apply standard convolutional neural network directly to three – dimensional point cloud is infeasible due to their unordered andunstructured nature . some researchers is started have start to regularize point cloud to draw insight from the experience of two – dimensional semantic segmentation network . In the literature [

],the authors presented the groundbreaking work Pointnet [

],which operates directly on irregular point clouds,utilizes shared Multi-layer Perceptrons (MLPs) to learn point features,and employs symmetric pooling functions to capture global features. Building upon Pointnet [

],subsequent scholars have proposed a series of point-wise MLP methods such as Pointnet++ [

] ,Frustum – Pointnet [

] ,PCnn [

],dgCnn [

],and PointWeb [

]. However,the use of shared MLPs for extracting 3d point cloud features may not adequately capture local geometric characteristics within the point cloud andoverlooks interactions between points. Zhang [

] introduced an interpretable point cloud classification learning method,PointHop,which primarily employs spatial partitioning to address the data challenges in unordered point clouds andexplores ensemble methods to enhance classification performance. Ben-shabat [

] introduced an intuitive three-dimensional point cloud representation called Fisher vectors (3dmFv) using grids to design novel network architectures for real-time point cloud classification. 3dpointCapsnet [

] propose a 3d point capsule network that preserve the spatial arrangement of input datum anddesign a 2d latent space ,bring improvement to several common point cloud – relate task .

nonetheless,the conventional Multilayer Perceptron (MLP) approach is subject to inherent limitations when addressing global feature interactions between points,owing to the mutual independence of neurons. Moreover,MLP exhibits suboptimal modeling efficacy in the context of long-range dependency relationships. The pioneering Transformer model,introduced by vaswani [

],initially garnered remarkable success in the domain of natural Language Processing (nLP). subsequently,Wang [

] introduced the innovative Point-Transformer,effectively managing variable length data andglobal information,resulting in enhanced classification accuracy andgeneralization capabilities.notably,it achieved a notable stride in modeling point-to-point interaction. He [

] engineered the PointCloudTransformer,harnessing Transformer’s self-attention mechanisms to capture the global information of point cloud data,while employing Convolutional neural networks for handling local information,thus achieving highly efficient classification. However,Transformers prove less effective in capturing the topological structural characteristics of point clouds.

To enable each point to capture a broad context andobtain rich local hierarchy ,some scholars is proposed have propose utilize graph structure for point cloud analysis . graphCnn [

] represents point clouds as graph data based on spatial/feature similarities between point andextends 2d convolution on images to 3d data. To handle unordered point sets with varying neighborhood sizes,standard graph convolution employs shared weight functions for each pair of points to extract corresponding edge features. This results in a fixed/isotropic convolution kernel that is applied to all pairs of points,overlooking their distinct feature correspondences. Intuitively,for points from different semantic parts of a 3d point cloud (such as adjacent points in

figure 1

),the convolution kernel should be able to differentiate them anddetermine their varying contributions. To address this limitation,several dedicated networks have been introduced,including a neighborhood feature pooling-based approach [

],attention-based aggregation [

],and local global feature fusion methods [

]. By assigning appropriate attention weights to neighboring points,these approaches attempt to identify their varying importance during convolution. However,these methods still fundamentally rely on fixed kernel convolutions since attention weights are applied to similar features obtained (as indicated by the black arrows in

figure 1

b). As illustrated in

figure 1

a,standard graph convolution applies a fixed andisotropic kernel (black arrows) to compute features for each point. Part b Based on these features,several attention weights are assigned to determine their importance. In contrast to the previous two,‘

$c$

’,generates an adaptive kernel ‘

$\hat{e_{i}}$

’ ,unique to learn feature for each point .

To address this,we propose a novel deep learning model called Att-Adaptnet (

figure 2

). In this paper,featuring attention-based global feature masking andchannel weighting,corresponding to the global attention module andadaptive graph convolution (see

figure 2

) . The entire end – to – end model is takes take 768 to 1408 point cloud as input for classification learning . There are two primary branch in this model . The first branch is focuses focus on the influence of each local point ,thus produce a global mask at the branch ’s end that weight the contribution of each point to the point cloud feature . To capture fine – grain region on the point cloud ,the global feature are multiply by the mask to obtain the final attention – base feature . The other branch is employs employ adaptive graph convolution to generate adaptive kernel ,replace the aforementioned isotropic kernel ( see

figure 1

c ) . The adaptive kernels is achieve achieve adaptivity during convolution operation ,as oppose to merely assign different weight to adjacent point .

The experiments demonstrate that,on the widely used Modelnet40 benchmark dataset,Att-Adaptnet outperforms many existing models. To ensure a fair comparison,following the practice of most deep learning papers,the proposed approach is benchmarked against other models on Modelnet40. The key reason for the superiority of the Att-Adaptnet lies in its innovative introduction of attention mechanisms into point cloud feature extraction,where each point plays a unique role in describing the overall structure. Thus,the model assigns individual weights to each point during the feature integration stage,while also emphasizing crucial feature channels representing intrinsic geometric information in high-dimensional space. The main contributions of this chapter are summarized as follows:

(1): We propose a novel 3d point cloud classification method,named Att-Adaptnet,based on attention andadaptive graph convolution. This method can directly process raw point clouds andemploys attention mechanisms through global feature masking andadaptive graph convolution to focus on feature regions.
(2): We utilize adaptive graph convolution to extract global features from 3d point clouds,effectively andprecisely capturing diverse relationships among points from different semantic parts.
(3): The Att-Adaptnet is trained andtested on the Modelnet40 benchmark dataset,achieving a classification accuracy of 93.3%. It demonstrates significant improvements in performance compared to other methods.

2. related Works

self – attention networks is garnered have garner significant attention for their ability to extract discriminative feature of interest ,allow model to identify the focal point . Thus far ,self – attention – base models is found have find wide application in task such as machine translation ,caption generation [

],speech recognition [

] ,and adversarial network [

],among others. The self-attention mechanism is designed to enable the network to learn context beyond the receptive field. One of the initial successful incorporations of this mechanism into Cnns was witnessed in the squeeze-and-Excitation network [

Petar veličković introduced the graph Attention Mechanism andconstructed the corresponding graph Attention network (gAT) [

]. It primarily utilizes self-attention to obtain attention coefficients,normalizes them,and then linearly combines them with the corresponding feature vectors,resulting in the final output features. PCAn [

] proposed an attention mechanism for local feature aggregation to distinguish positively contributing local features. However,this method mainly employs a point-wise structure to extract local features,which does not particularly focus on local geometric structures. gAC [

] introduced an attention mechanism based on the Pointnet architecture,where attention weights learned from neighboring points can capture discriminative features,and this method achieved good performance. Chen et al. [

] presented the gAPnet model,which aggregates attention features for each point in the neighborhood using a multi-head attention mechanism andapplies stacked MLP layers to capture local geometric features from the original point cloud,achieving promising results. Yang et al. [

] developed the Point-Attention Transformer (PAT) to model interactions between points,employing parameter-efficient group shuffle Attention (gsA) instead of expensive multi-head attention mechanisms.

Influenced by attention mechanisms andpyramid pooling,several methods have been proposed to better capture local geometric information,ggM-net [

] introduced a graph geometry Moment Convolutional neural network that learns local geometric features from the geometric moment representations of local point sets to better capture local geometric information. AgCn [

] avoid the use of share spectral kernel andinstead assign a customize laplacian graph to each sample ,provide an objective description of its graph convolution topology . Li [

] aimed to extract precise pixel-level attention from high-level features obtained from Cnns. They proposed the Feature Pyramid Attention (FPA) module,which effectively increases the receptive field andaids in the classification of small objects by embedding context features of different scales in a pixel prediction framework based on FCn. Pyramnet [

] primarily designed two new operators,the graph Embedding Module (gEM) andthe Pyramid Attention network (PAn). gEM projects point clouds onto graphs andutilizes covariance matrices to explore relationships between points,enhancing the model’s ability to represent local features. PAn assigns strong semantic features to each point,preserving fine-grained geometric features as much as possible. Wang et al. [

] introduce gACnn ,an end – to – end encoder – decoder network that capture multi – scale feature of point cloud ,achieve more accurate point cloud classification .

3. Model Construction

In recent years,deep neural networks have emerged as a primary tool for image analysis. deep learning,due to its capacity for large-scale learning,has also gained popularity in the realm of 3d point cloud classification. since the introduction of Pointnet [

],recent works have focused on extracting global features of point sets by grouping andaggregating features of all individual points. However,these approaches are limited to detecting structural differences between different objects. Therefore,this paper proposes a novel deep learning model called Att-Adaptnet.

3.1 . adaptive graph Convolution Module

The adaptive graph convolution is an extension of graph convolution,and the configuration of the adaptive convolution module in this paper is the same as that in AdaptConvnet [

]. The structure of this module is illustrated in

figure 3

. Let

$X = (x_{i} | i = 1, 2, \dots, n) \in ℝ^{n \times 3}$

be the input point cloud ,with corresponding feature define as

$F = (f_{i} | i = 1, 2, \dots, n) \in ℝ^{n \times d}$

. Here,

$x_{i}$

represents the (

$x$

$y$

$z$

) coordinates of the

$i$

-th point,and in general,it can be augmented with vectors of other attributes such as normals andcolors. Then,a graph is constructed for each point,including self-loops,by considering the k-nearest neighbors (knn) for each point,resulting in a directed graph g(v,E) where

$v = (1, 2, \dots, n)$

and

$E \subseteq v \times v$

represents a set of edges. given the input feature dimensions,the AdaptConv [

] layer aims to generate a new set of M-dimensional features with the same number of points while attempting to more accurately reflect local geometric features than previous graph convolutions.

The adaptive kernel,denoted as ${\hat{e}}_{i j m}$ ,is generated from the input features $Δ f_{i j}$ of a pair of points on the edge. It is then convolved with the corresponding spatial input $Δ x_{i j}$ to produce the corresponding edge feature $h_{i j m}$ . All dimensions of $h_{i j m}$ are concatenate to produce the edge feature $h_{i j}$ ,which is finally pooled to output the feature $f_{i}^{'}$ of the central point. What sets AdaptConv apart from other graph convolutions is that the convolution kernel for each pair of points is unique. Here, $x_{i}$ represents the central point in the graph convolution,and $n (i) = (j : (i, j) \in E)$ is a set of points in its neighborhood due to the irregularity of point clouds,previous methods often used a fixed kernel function for $x_{i}$ ’s neighbors to capture the geometric information of the patch. However,different neighborhoods reflect different features of $x_{i}$ ,especially when $x_{i}$ is located in prominent regions such as corners or edges. A fixed kernel may lead to geometric representations generated by graph convolution that are not well-suited for classification.

Therefore,this chapter aims to capture unique relationships between each pair of points using an adaptive kernel. For each channel in the output M-dimensional features,AdaptConv dynamically generates a kernel based on the point features

$(f_{i}, f_{j})$

,as follows Equation (1):

$\hat{e_{i j m}} = g_{m} (Δ f_{i j}), j \in ℕ (i)$

(1)

Here,

$m = 1, 2, \dots, M$

represent one of the M output dimension correspond to a single filter define in AdaptConv . To combine the global shape structure capture in the local neighborhood [

] with feature differences,this chapter defines

$Δ f_{i j} = (f_{i}, f_{j} - f_{i})$

as the input feature for the adaptive kernel,where [·,·] denotes concatenation operation

$g (\cdot)$

is a feature mapping function,and in this case,a multi-layer perceptron is used.

similar to the computation of 2d convolution,convolution is performed by taking d input channels andtheir respective filter weights to obtain one of the M output dimensions. Then,convolution is applied between the adaptive kernel andthe corresponding points

$(x_{i}, x_{j})$

,as shown in Equation (2):

$h_{i j m} = σ < \hat{e_{i j m}}, Δ x_{i j} >$

(2)

In Equation (2),

$Δ x_{i j}$

is define as

$(x_{i}, x_{j} - x_{i})$

,<·,·> denotes the inner product of two vectors,and

$h_{i j m} \in ℝ$

is subject to a non – linear activation function σ . As show in

figure 3

,the m-th adaptive kernel

${\hat{e}}_{i j m}$

combine with the spatial relation

$Δ x_{i j}$

of the corresponding point

$x_{j} \in ℝ^{3}$

. The size of the kernel should match in the dot product,meaning the feature mapping

$g_{m} : ℝ^{2 d} \to ℝ^{6}$

,as mentioned earlier. This allows spatial positions in the input space to be effectively incorporated into each layer andcombined with features extracted dynamically from the kernel. The

$h_{i j m}$

from each channel is summed together,generating edge features

$h_{i j} = (h_{i j 1}, h_{i j 2}, \dots, h_{i j m}) \in r^{M}$

between point

$(x_{i}, x_{j})$

. Finally,the output feature of the central point is defined by applying an aggregation function to all edge features in the neighborhood:

$h_{i j m} = σ < \hat{e_{i j m}}, Δ x_{i j} >$

(3)

In Equation (3),max represents a channel-wise maximum pooling function. To summarize,the convolutional weights for AdaptConv are defined by Equation ( 4 ):

$θ = (g_{1}, g_{2}, \dots, g_{M})$

( 4 )

In this experiment,AdaptConv generates an adaptive kernel for each pair of points based on their respective features $(f_{i}, f_{j})$ . Then,this kernel,denoted as ${\hat{e}}_{i j m}$ ,is applied to point pairs $(x_{i}, x_{j})$ to describe their spatial relationship in the input space. In other cases,the input can be $x_{i} \in ℝ^{E}$ ,which includes additional dimensions representing other valuable point attributes,such as point normals andcolors. By modifying the adaptive kernel to $g_{m} : ℝ^{2 d} \to ℝ^{2 E}$ ,AdapConv can capture relationships between feature dimensions andspatial coordinates from different domains. In this chapter’s experiments,spatial positions are used as the default input in the convolution. Instead of using $Δ x_{i j}$ ,∆ $f_{i j}$ is employed,and a pair of points’ adaptive kernels are designed to establish relationships between their current features $(f_{i}, f_{j})$ at each layer. This allows the kernel to adapt to the features from the previous layer,extracting feature relationships. It is a more direct solution,similar to other convolutional operators,as it generates a new set of learned features from the features of the previous layer in the network.

After two layers of AdaptConv andtwo layers of graph convolution,specifically following the output of the final layer,the model further utilizes a shared MLP (MLP

$h_{θ}^{g}$

) andan sE-1d block to obtain global feature representation

$g$

. The computation process is illustrated in Equation (5):

$g = F_{s E} (h_{θ}^{g} f_{i}^{'}) \in r^{n \times C^{o u t}}$

(5)

3.2. global Attention

For each

$x_{i}$

,a subset is defined with

$x_{i}$

as the center,and k − 1 of the closest points excluding the center

$x_{c}$

are selected. Thus,the knn query for

$x_{c}$

can be calculated as shown in Equation (6):

$F (x_{c}) = {x_{j} | ∥ x_{j} - x_{c} ∥_{2} \leq ∥ x_{c} - x_{i j} ∥} \in r^{k \times c}$

(6)

where

$x_{k}$

represents the k-th closest point to

$x_{c}$

,calculated using a knn query. Thus,the grouped input can be represented as shown in Equation ( 7 ).

${F (x_{i}) | x_{i} \in x} \in r^{n \times k \times C}$

( 7 )

The input to this module differs from the AdaptConv module. The global Attention Module has additional geometric features,and this additional output is represented in the following form as shown in Equation (8):

$x_{i}^{i n p u t} = {x_{i}, x_{j}, x_{j} - x_{i}, ∥ x_{j} - x_{i} ∥_{2}} \in r^{k \times 10}$

(8)

where

$x_{i} \in x$

$((\cdot))$

denotes the Euclidean distance,and

$k$

represents a set of points’ count. The structure of the global Attention Module is depicted in

Figure 4

In this module,similar to channel attention in sEnet [

],two 1 × 1-sized 2d convolutional layers are used to reduce the dimensionality of the grouped features (the input to this module),and a sigmoid function is employed to generate a soft attention mask. For a specific point cluster

$F (x_{i})$

centered at

$x_{i}$

,the calculation of the importance of

$x_{i}$

is defined by Equation (9):

$x_{i}^{g A} = \max_{j \in [1, k]} s i g m o i d (h_{θ} (x_{i}^{i n p u t})) \in r^{1 \times 1}$

(9)

wherethe output channel of

$h_{θ}$

is 1,and the activation function

sigmoid

is define as

$\frac{1}{1 + e^{- x}} \in (0, 1)$

. Finally,the module outputs the learned soft mask

$x^{g A} = {x_{i}^{g A} | i \in (1, n)}$

The reason for designing a global attention mechanism is quite straightforward. given that each object class possesses distinct feature patterns that may include subtle points such as guitar strings or airplane wings,it’s possible for these feature patterns to be overlooked during the aggregation process,which extracts numerous features. Hence,there is a need to measure the importance of each group $F (x_{i})$ denote as ${x_{i}}^{g A}$ ,and weight the global feature $g$ using a learned soft mask ${x_{i}}^{g A}$ .

Furthermore,the reason for incorporating more crucial geometric information (namely,

${((x_{j} - x_{i}))}_{2}$

) into the global attention module is to expedite andenhance the learning of the global soft mask

${x_{i}}^{g A}$

. While MLPs can theoretically approximate any nonlinear function,such as high-order information andsquared Euclidean distance (2nd order:

${((x_{j} - x_{i}))}_{2}^{2}$

),the literature suggests that models with high-order convolutional filters

$(ω_{1} x + ω_{2} x 2 + ω_{3} x)$

can achieve higher classification accuracy in several benchmarks [

]. To address the same issue in the proposed model in this paper,additional crucial geometric information (namely,

${((x_{j} - x_{i}))}^{2}$

) was also chosen to assist the shared MLP in effectively discovering feature patterns anddetermining the importance of each input point

$x_{i}$

denote as

${x_{i}}^{g A}$

3.3. The structure of Att-Adaptnet

After obtaining the mask,denoted as

$x^{g A}$

,from the global Attention Module andthe global features,this paper performs element-wise multiplication on them andgenerates new global features using the reLU activation function. Following the principles of Pointnet for 3d point cloud data classification,most models use max-pooling instead of average-pooling layers. Intuitively,max-pooling should be superior to avg-pooling,as the strongest activation might represent the most prominent feature of a class. However,the results of avg-pooling can also reflect important class features; otherwise,models using average pooling would yield unreasonable results. To gather more valuable information,the experiment chooses to aggregate all points in the global feature regularization using both max-pooling andaverage-pooling simultaneously. The results of the avg-pooling layer andmax-pooling layer are concatenated into a complete classification vector with a dimension of 2048. Finally,a 3-layer MLP is employed to output the classification scores,where C,C/r,and C represent the dimensions of the three neural layers in the MLP,with r being a reduction factor to reduce parameter complexity,as illustrated in

figure 5

4. Experimental results andAnalysis

To assess the effectiveness androbustness of the designed Att-Adaptnet network presented in this paper,a comprehensive set of experiments andcorresponding analyses has been conducted in this section. Initially,the proposed Att-Adaptnet network for 3d point cloud classification is primarily validated on the Modelnet40 dataset. It is evaluated by comparing it with other 3d point cloud classification methods on the Modelnet40 benchmark to assess the effectiveness of the approach presented in this chapter. subsequently,an analysis of the details of the Att-Adaptnet network architecture is performed. various experiments with different model parameter settings are conducted to determine the optimal parameter configuration that yields the best results.

4.1. datasets

In this study,the Att-Adaptnet is evaluated using the publicly available Modelnet40 3d point cloud dataset. This dataset comprises 12,311 meshed CAd models from 40 different categories,with 9843 models allocated for training and2468 models designated for testing purposes. A uniform sampling approach is employed to extract 768 points from each object. Only the (x,y,z) coordinates of these sampled points are used as input data.

figure 6

provides illustrative examples from the Modelnet40 dataset.

4.2. Experimental Environment andParameter Configuration

The Att-Adaptnet architecture,as illustrated in

Figure 4

,dynamically recalculates the graph based on feature similarity at each layer,with a fixed neighborhood size of 20 for all layers. This method incorporates shortcut connections andaggregates multi-scale features using a shared fully connected layer (1024). The global features are obtained using the max-pooling function. detailed experimental settings are presented in

table 1

4.3. Analysis of different ‘k’ values

In The Adaptive graph Convolution Module,the neighborhood size (k) is a critical parameter for extracting local geometric features. In this section,we conduct experiments to investigate the influence of different values of ‘k’ on classification accuracy using the Modelnet40 dataset.

Table 2

displays the accuracy performance of the model for ‘k’ values of 5,10,15,20,25,and 30.

Figure 7

provides more detail,illustrating the variation in the model’s overall andaverage accuracy as ‘k’ values range from 5 to 30. In

Figure 7

,the purple points represent central points,while the red points denote the points surrounding the central points. The attention of central points to their surrounding points is depicted for different values of ‘k’. For example,when k = 5,the central point focuses on the nearest 5 points in its vicinity. similarly,with k = 10,the central point pays attention to the surrounding 10 points.

As show in

Table 2

,the results are notably better when k is set to 20 compared to other values,indicating that the algorithm performs optimally with k = 20. It is worth noting that reducing the number of neighboring points decreases the computational complexity of the algorithm. However,due to the limited receptive field,this reduction negatively impacts the algorithm’s performance. Conversely,larger values of k introduce more noise into the neighborhood. since local information becomes diluted within larger neighborhoods,it hampers the learning of local geometric features. Consequently,increasing k does not lead to improved performance. Even when k is reduced to 10,the network still achieves relatively good results. But,it can be seen from

Figure 8

that when the ‘k’ value is 20,the model effect is optimal,so ‘k’ = 20 is selected as the premise in the following other hyperparameter experiments.

4.4. Analysis of different Point Cloud numbers

The performance of deep learning models is often correlated with the number of features in the used data,generally exhibiting a positive relationship. However,the challenge lies in the fact that for a given dataset andmodel,this relationship tends to display a trend of initially increasing andthen slightly decreasing as the number of individual data features grows. Therefore,identifying the optimal point for this data size is crucial to fully unleash the model’s performance. This paper assesses the robustness of the Att-Adaptnet model on the Modelnet40 dataset. sparse point clouds with 256,384,512,768,896,1024,1152,1280,and 1408 points are employed as input to explore the optimal feature count that can fully unleash the model’s performance. during testing,the neighborhood size for all networks is fixed at k = 20.

Figure 9

show an image of the number of cloud at different point . The result of these experiment are present in

figure 10

illustrates the significant robustness of the Att-Adaptnet across different point cloud densities,demonstrating its strong resilience. notably,even with a point count as low as 256,its classification performance surpasses that of Pointnet in terms of robustness,achieving an overall accuracy of 91.53% andan average accuracy of 88.17%.

Analysis reveals that as the number of points increases,there is a corresponding rise in both the overall andaverage accuracy rates of the Att-Adaptnet. With 256 points in the cloud,the model’s overall accuracy hovers around 90%. When the point cloud numbers reach 384,512,and 768,the overall accuracy consistently exceeds 90% in the middle to later stages of iteration,peaking at 93.57%. notably,when the model processes 1024 points,its performance is fully realized,achieving the highest overall andaverage accuracy rates of 93.81% and90.80%,respectively. However,when the number of point clouds used for training exceeds 1024,specifically at 1152,1280,and 1408 points,there is a slight decline in the model’s performance. specifically,when the point cloud counts are 1152,1280,and 1408,the accuracies are 93.62%,93.60%,and 93.58%,respectively. In other words,Att-Adaptnet reaches performance saturation at an input feature count of 1024,and an excessive number of features can introduce noise interference to the model. The occurrence of this phenomenon is attributed to the existence of a performance saturation point in the deep learning model. When the number of features becomes excessive,the model may encounter the issue of overfitting,where it overly adapts to the training data,consequently losing its ability to generalize. Additionally,augmenting the number of features can lead to an increase in the computational complexity of the model,prolonging training time,and potentially necessitating more data to mitigate overfitting.

Here,we construct a straightforward model to conduct an empirical validation experiment on the existence of performance saturation points. We opt for a single-layer dense anddouble-layer dense as the model andrandomly generate 10,000 data points with lengths ranging from 20 to 1000. The experimental results are illustrated in the following

figure 11

4.5. The Impact of Perceptron Layer depth on Model Performance

In the global attention module,attention masks are generated via a multilayer perceptron in conjunction with normalization layers. The global capacity of the model is,to some extent,contingent upon the number of layers in the perceptron,indicating that the model’s fitting ability is influenced by the depth of the perceptron layers. Thus,this paper has selected 3,4,5,and 6 as the layer counts for the perceptron to determine the optimal layer configuration for model performance. The experimental outcomes are presented in

figure 12

As show in

figure 12

,when the number of MLP layers is three,the overall accuracy reaches 93.81%,and the average accuracy is 90.80%. However,as the number of layers increases,there is a gradual decrease in overall accuracy,dropping to 93.16% with 6 layers. On the other hand,the average accuracy remains relatively stable within a certain range,albeit with a slight downward trend. This phenomenon can be attributed to the increased complexity of the model structure due to the addition of MLP layers,leading to potential underfitting during training. This prevents the model from fully learning the data distribution patterns andthus limits its performance. The following

figure 13

effectively illustrates the relationship between model complexity andperformance. When deep learning models become excessively complex,implying a larger number of training parameters anddeeper gradient backpropagation,they may experience underfitting,thereby failing to generalize well to new data andlosing their ability to generalize.

4.6. Effectiveness of the Proposed Algorithm in 3d Point Cloud Classification

To validate the efficacy of the Att-Adaptnet,this chapter has chosen to compare it with other representative point cloud classification models under identical experimental conditions using the Modelnet40 dataset. The evaluation is primarily based on overall classification accuracy andaverage classification accuracy,with the precision of classifying 3d point cloud shapes as the evaluation criterion. Brief information on each model is presented as follows.

(1): Pointnet: It is comprised of Multi-Layer Perceptrons (MLPs),maxpooling layers,and fully connected layers,capable of directly processing point clouds andextracting spatial features for classification tasks.
(2): Pointnet++: It is an advanced model that builds upon the original Pointnet architecture,introducing hierarchical neural networks andutilizing a set abstraction layer to capture local structures at multiple scales,enabling more effective processing of spatially distributed data in point clouds.
(3): PCnn: The framework consists of two operators: extension andrestriction,mapping point cloud functions to volumetric functions andvise versa. A point cloud convolution is defined by pull-back of the Euclidean volumetric convolution via an extension-restriction mechanism.
( 4 ): ggM-net: The central component of ggM-net revolves around extracting features through geometric moments,a process known as ggM convolution. This method involves learning point-specific features andlocal characteristics from the first- andsecond-order geometric moments of a point andits immediate neighbors. These learned features are then integrated using an additive approach.
(5): gAPnet: Local geometric representations are learned by embedding a graph attention mechanism within stacked MLPs layers.
(6): Fatnet: Presents a new neural network layer,known as the FAT layer,designed to integrate both global point-based andlocal edge-based features,thereby producing more effective embedding representations.
( 7 ): CT-BLOCk: In the CT-block,two distinct branches are integrated: the ‘C’ branch,signifying the convolution aspect,and the ‘T’ branch,representing the transformer aspect. The convolution branch focuses on executing convolutions on gathered neighboring points to derive local features. Concurrently,the transformer branch applies an offset-attention mechanism to the entire point cloud,facilitating the extraction of global features.
(8): dI-PointCnn: The feature extractor obtains high-dimensional features,while the feature comparator aggregates anddisperses homogenous andheterogeneous point clouds in the feature space,respectively. The feature analyzer then completes the task.
(9): dgCnn: A novel neural network module named EdgeConv is proposed,which incorporates local neighborhood information andcan be stacked to learn global shape attributes. In a multi-layered system,the affinities in the feature space capture semantic features that may span long distances in the original embeddings.
( 10 ): AgCnn: A graph-based neural network with an attention pooling strategy,termed Agnet,is proposed,capable of extracting local feature information through the construction of topological structures.
(11): Point-Transformer: The Point Transformer model introduces dot-product andpoint convolution operations,overcoming the limitations of traditional 3d Cnns in processing point cloud data,and offers enhanced flexibility andscalability.
(12): UFO-net: An efficient local feature learning module is employed as a bridging technique to connect diverse feature extraction modules. UFO-net utilizes multiple stacked blocks to better capture the feature representations of point clouds.
(13): APEs: An attention-based,non-generative point cloud edge sampling method (APEs),inspired by the image Canny edge detection algorithm andaided by attention mechanisms.
( 14 ): ULIP + Pointnet++: ULIP employs a pre-trained visual-language model,which has already learned a common visual andtextual space through extensive training on a vast number of image-text pairs. subsequently,ULIP utilizes a small set of automatically synthesized triplets to learn a 3d representation space aligned with the public image-text space.

table 3

presents a comparison of our model with other state-of-the-art (sOTA) models. It is evident that with the widespread application of deep learning in point cloud tasks,the performance in point cloud classification has improved significantly over time. Initially,point cloud data was processed using multi-layer perceptrons,but in recent years,different sampling methods have been utilized. Pointnet andPointnet++ marked the beginning,achieving overall accuracies of 89.2% and91.9% respectively. However,subsequent models like PCnn,ggM-net,gAPnet,and FATnet have achieved even more advanced results. recent models such as UFO-net andAPEs have reached overall accuracies of 93.5% andabove. Att-Adaptnet also demonstrates excellent performance,with an overall accuracy of 93.8% andan average accuracy of 90.8%.

4.7. The Effects of various Attention Mechanisms

To validate the enhancement of model performance by the global attention mechanism proposed in this paper,this chapter selects self-attention andmulti-head attention,as reference objects. Additionally,a version of the model without global attention is also set up for comparative experimentation. The experimental results are presented in the following

table 4

reveals that the self-Attention has a subtle impact on Adaptnet,but Multihead-Attention,conversely,has an adverse effect. This is attributed to the fact that in the Multihead-Attention module,the number of parameters increases multiplicatively with the number of heads,which is not favorable for experiments without massive data volumes. This can lead to insufficient learning in the model,preventing it from fully realizing its potential. In contrast,the global attention mechanism,with its simple linear structure andfewer parameters,demonstrates its advantages. It effectively learns andcomplements Adaptnet,thus achieving commendable performance on Modelnet40 dataset.

4.8. Ablation Experiments

To validate the superiority of the Att-Adaptnet model,an ablation experiment was conducted. given the relatively simple modular structure of this model,three sets of experiments were chosen for ablation research,namely Adapt,Adapt-MLP,and Att-Adapt. The research results are presented in

Table 5

,clearly demonstrating that performance of Att-Adaptnet significantly outperforms the other two.