Dense gradient-based features (DEGRAF) for computationally efficient and invariant feature extraction in real-time applications

We propose a computationally efficient approach for the extraction of dense gradient-based features based on the use of localized intensity-weighted centroids within the image. Whilst prior work concentrates on sparse feature derivations or computationally expensive dense scene sensing, we show that Dense Gradient-based Features (DeGraF) can be derived based on initial multi-scale division of Gaussian preprocessing, weighted centroid gradient calculation and either local saliency (DeGraF-α) or signal-to-noise inspired (DeGraF-β) final stage filtering. We present two variants (DeGraF-α / DeGraF-β) of which the signal-to-noise based approach is shown to perform admirably against the state of the art in terms of feature density, computational efficiency and feature stability. Our approach is evaluated under a range of environmental conditions typical of automotive sensing applications with strong feature density requirements.

Against the backdrop of a standardized set of feature stability metrics proposed from [8], the wider applicability of newer sparse feature point contenders to the field has narrowed to variants that either expunged strong orientational invariance (e.g.BRIEF [15]), concentrated on computationally robust sparse matching (FREAK [17], BRISK [18]) or instead improved stability by introducing additional computation (e.g.ASIFT, [14]).Although with their merits, the rise of autonomous platform driven applications [19] and the rise of sensing tasks in the space of real-time scene classification [20], visual odometry [21], monocular depth estimation [22], scene flow [23] and alike instead gave rise to the need for dense features (Figure 1).The immediate response of simply using dense feature grids (e.g.dense SIFT, [24]) found itself falling foul of the recent trend in feature point optimization improved feature stability at the expense of computational cost Vs.reduced computational cost at the expense of feature stability.Such applications found themselves instead requiring real-time high density features that were in themselves stable to the narrower subset of metric conditions typical of the automotive-type application genre (e.g.lesser camera rotation, high image noise, extreme illumination changes).
Whilst contemporary work aimed to address this issue via dense scene mapping [23,22] (computationally enabled by GPU) or a move to scene understanding via 3D stereo sensing [25,26], here we specifically consider applicability to future low-carbon (possibly electric), long-duration autonomous vehicles where Size, Weight and (computational) Power (SWaP) requirements will heavily govern the design process.Using a wide range of metrics proposed by Mikolajczyk et al. [8], our evaluation (Section 3) shows that there is a gap for a computationally efficient feature detection approach that produces high density invariant features (unlike the FREAK / BRISK contenders [18,17]).To these ends, we present a computationally lightweight feature detection approach that is shown to be highly invariant (stable) under conditions typical of both automotive and wider autonomous sensing.

DENSE GRADIENT-BASED FEATURES (DEGRAF)
A novel feature detection approach, denoted as DeGraF (Dense Gradient Features), is presented based on the calculation of gradients using intensity-weighted centroids [27].

Gradients from Centroids (GraCe)
Firstly, we identify a novel method for calculating dense image gradients using intensity-weighted centroid as an altern- where Spos =  2 illustrates, this approach has several advantages over conventional Sobel, Gabor and Lagrange gradients [28] in that it is invariant to region size, offers sub-pixel accuracy and has significantly higher noise resistance.Computationally it remains more efficient than Gabor and Lagrange approaches and only marginally more complex than Sobel mask convolution.

From Gradients to Features
Dense gradient-based features (DeGraF) are extracted, based on our GraCe gradient map, using a three step process.
Difference of Gaussian (DoG) is used as an (optional) illumination invariance image representation (akin to [31]).This is derived using an Inverted Gaussian Di-pyramid concept (Fig. 3).A di-pyramid conceptually comprises two pyramids symmetrically placed base-to-base, whereas an inverted di-pyramid comprises two pyramids symmetrically placed peak-to-peak (Fig. 3).The original image is used to perform bottom-up construction of a classical (downsampling) Gaussian pyramid, D, constructed with n levels (Fig. 3, lower pyramid).Subsequently, the peak of this pyramid, the n th level D n , is used to perform top-down construction of a second (up-sampling) Gaussian pyramid, U with base U 0 ,from a starting point of U n = D n (Fig. 3 where Choice of gradient matrix parameters depend on required computational performance, feature density and noise resistance.Figure 4 illustrates varying gradient vectors with different cell size and overlap.Feature extraction is the subsequent process of downselecting those gradient vectors that are suitable for sequential frame-to-frame tracking.Global feature extraction using αgradients uses an approach similar to CenSurE [12] and SIFT [4], where local minima and extrema are identified but here within the gradient matrix derived by GraCe (Section 2.1) An α-gradient is defined as being locally salient with either a consistently stronger or weaker gradient magnitude in comparison to its immediate spatial neighbours in G. Local feature extraction using β-gradients is an alternative to extracting a salient subset of gradients and instead looks to identify a subset of gradients where the Signal-to-Noise (SNR) ratio is low (SN R = 20log 10 signal RM S noise μsig σ bg ).By returning to Eqn. 1 we can see that our positive and negative centroid, C pos and C neg , give rise to both a measure of the maximal signal, S pos μ sig , and minimal signal, S neg σ bg , when the gradient matrix calculation is itself performed on the DoG image (high μ sig ; low σ bg , see Fig. 3).By extension, with the DoG image itself being akin to a second-order saliency derivative of the original input, this also holds for input image itself.To these ends, we can define gradient quality metrics such that gradient vectors, − −−−−− → C neg C pos , with corresponding magnitude, r, less than a given threshold, r < τ r , are discarded as are those with a low centroid ratio, R, measuring signal-to-noise ratio for each of the centroids (Eqn.4).
Using either feature extraction approach, the proposed DeGraF approach derives stable features from a gradient matrix using intensity-weighted centroids.This enforces uniformity of feature distribution across the image with density configurable via the parameters of G (see Figure 4).This approach is primarily designed for deriving noise resistant and high density features within environments where conventional feature extraction approaches fail.GraCe (Section 2.1) introduces a novel way of handling noise by separately estimating the signal-to-noise ratio for each gradient.In this context, a gradient is defined as the combination of a positive and a negative centroid.Since the weakest of the two centroids is more vulnerable to noise, only the dominant centroid is considered.The weakest centroid is then redefined as the anti-symmetric opposite to the positive centroid (Section 2.1).This approach of calculating gradients is the key contributing factor to the subsequent performance of DeGraF features.Furthermore, extracting DeGraF features from a DoG image adds additional illumination invariance with the possibility of extension across multiple DoG pyramids levels for higher feature density and scale-invariance.Crucially, dense DeGraF features retain low and tunable computational complexity facilitating real-time computation.Furthermore, the grid-wise computational approach readily facilitates parallelization (multi-core, GPU, FPGA and alike).
Quantitative evaluation is based on six statistical measures:feature density (as % of image resolution, Table 1), tracking accuracy (as % of total features tracked successfully between density (%) execution (ms.) frame-rate (fps) AGAST [13] 4.35 33  subsequent frames, Table 2), feature matching repeatability under variable illumination (% error in features illumination variant matching, Table 2), feature matching repeatability under variable rotation (% error in rotational features matching, Table 3), feature matching repeatability under additive Gaussian noise (% error in features matching introduced, Table 2) and execution time (mean per frame (milliseconds) / frame-rate (frames per second, fps): single-core 2.4 Ghz CPU on 640×480 resolution images, Table 1).These are a variant on the criteria of Mikolajczyk et al. [11], adapted to the requirements of feature sparse environments typical of automotive visual sensing applications.For example, an onvehicle camera is expected to vibrate or rotate slightly but this will not result in significant affine transformations as in the general case of [11].Our statistical evaluation is performed on the Enpeda EISATS dataset [32,23] facilitating accurate ground truth data independent of scene noise.Table 1 shows comparative feature density and execution time where we can see DeGraF-β produces the highest density (to be expected since the entire gradient matrix is used to produce one feature per positive centroid) and second lowest execution time.The second highest feature density is shown by DeGraF-α which is comparable to AGAST in terms of density and execution time.Other feature detectors produced lower feature density responses with FAST, DeGraF-β and CenSurE providing the lowest execution times.
Feature tracking accuracy is tested within an automotive context with the use of artificially added camera vibration, of amplitude ±v pixels, in the vertical direction.Pyramidal Lucas-Kanade tracking [33] is then used to track features between frames with mean feature tracking error defined as: where v is the predefined vibration amplitude, s is the measured displacement of each feature, k denotes the number of detected features and n denotes the total number of frames in the test image sequence.Table 2 shows the results, based on Eqn. 5, and shows that most prior approaches demonstrated similar performance under vibration with both DeGraF-α and DeGraF-β offering superior performance.Illumination invariance is evaluated using image sequences with variable brightness settings.Given a set of features detected in an original image, the detection error is derived by measuring the repeatability of features in images adjusted for a 25%, 50%, 75% and 100% higher illumination level.The resulting repeatability error can be expressed as follows: based on the difference in magnitude between the union of all the spatial features detected before (A) and after (B) the change in image conditions and their spatial intersection (based on position).Table 2 shows the mean repeatability error caused by illumination variance normalized as a percentage of all features.DeGraF-β outperforms all other methodologies with a consistently low error, while DeGraF-α shows comparable performance at higher brightness levels.
Notably, ORB has the third lowest error rate and is the only other approach apart from DeGraF that is based on intensity centroids, albeit at low density.Our noise stability evaluation adds noise to each image with n% of pixels having Gaussian distribution (σ = 1) noise added and feature repeatability measured as per Eqn. 6.Table 2 clearly illustrates that DeGraF-β outperforms all other approaches by a significant margin.Comparing DeGraF-β to the second best approach (ORB) shows a performance gap between 28% and 62% when image noise incidence increases from 5% and 20% respectively.Again, the fact that ORB and DeGraF-α are second/third in the performance ranking is notable, since they are the only other approaches that use intensity-weighted centroids for feature extraction.
Overall we see strong relative performance of both the DeGraF-α / DeGraF-β methods, strongly correlated by the performance of the ORB intensity-weighted centroid approach, with the DeGraF-β outperforming other state of the art methods based on the standard feature stability and repeatability tests [11] against ground truth.Qualitative evaluation is shown in Figures 1 and 4.

CONCLUSION
This work introduces DeGraF (Dense Gradient-based Features) approach for the efficient computation of dense scene features based localized intensity centroids.This approach is shown to significantly outperform a range of state of the art approaches in the field ( [3,4,9,12,6,5,15,16]) in terms of feature density, computational efficiency (translating to high frame-rates) and overall feature stability under variation in terms of frame-to-frame tracking, orientation, noise and changes in illumination.Future work will consider further aspects of recent advances analogous sparse features [17,18].

Fig. 2 :Fig. 3 :
Fig. 2: Comparison of gradient calculation operators: (A-D [28]) m − I(i, j)) and m = max (i,j) I(i, j).Here the negative centroid is defined as the weighted average of the inverted pixel values (Eqn. 1) with pixel values normalized (1 → 256) to avoid division by zero.To avoid instability within the resulting gradient vector, − −−−−− → C neg C pos , in noisy image regions we ensure S pos > S neg .If not, we subsequently redefine both symmetrical centroids, C pos and C neg , such that C pos = C neg () (from Eqn. 1) and C neg is the symmetric point about the spatial centroid I(x c , y c ) (Eqn. 2) such as to ensure S pos > S : C neg (x neg , y neg ) = (2x c − x pos , 2y c − y pos ) (2) This correction dramatically increases the accuracy of both the resulting orientation, φ = tan −1 (dy, dx), and magnitude, r = dx 2 + dy 2 , as calculated based on dx dy = xpos−xneg ypos−yneg in 2D image space, from resulting gradient vector − −−−−− → C pos C neg (or − −−−−− → C pos C neg ) under noise.As Figure , upper inverted pyramid).Within this inverted di-pyramid arrangement, calculating the absolute difference between the two pyramid bases gives a DoG image, I DoG = |U 0 − D 0 |, as an intermediate saliency based scene representation.Gradient matrix calculation is performed using a w G × h G dimension grid G overlain onto the DoG (or original) image.A GraCe gradient vector, − −−−−− → C neg C pos , is calculated for each grid cell C, with cell dimension w C × h C .Cells overlap by δ x pixels horizontally and δ y pixels vertically such that the dimensions of the resulting gradient matrix G are as follows:

Table 1 :
Feature density, execution time and frame-rate.