Real-time classification of vehicles by type within infrared imagery

Real-time classification of vehicles into sub-category types poses a significant challenge within infra-red imagery due to the high levels of intra-class variation in thermal vehicle signatures caused by aspects of design, current operating duration and ambient thermal conditions. Despite these challenges, infra-red sensing offers significant generalized target object detection advantages in terms of all-weather operation and invariance to visual camouflage techniques. This work investigates the accuracy of a number of real-time object classification approaches for this task within the wider context of an existing initial object detection and tracking framework. Specifically we evaluate the use of traditional feature-driven bag of visual words and histogram of oriented gradient classification approaches against modern convolutional neural network architectures. Furthermore, we use classical photogrammetry, within the context of current target detection and classification techniques, as a means of approximating 3D target position within the scene based on this vehicle type classification. Based on photogrammetric estimation of target position, we then illustrate the use of regular Kalman filter based tracking operating on actual 3D vehicle trajectories. Results are presented using a conventional thermal-band infra-red (IR) sensor arrangement where targets are tracked over a range of evaluation scenarios.


INTRODUCTION
We address the problem of the real-time classification of vehicles into sub-category types within infra-red imagery.Due to the high levels of intra-class variation in thermal vehicle signatures caused by aspects of design, current operating duration and ambient thermal conditions this poses a significant challenge.However, aspects of all-weather operation, invariance to visual camouflage techniques and accepted suitability for the the analogous task of pedestrian detection make sensing within thermal-band infra-red (IR) imagery very attractive.
Within the context of automated visual surveillance from infra-red imagery, our prior work on pedestrians [1,2] demonstrated that reasonable performance can practically be achieved through the combined use of infra-red imagery (thermal-band, spectral range: 8-12µm) and the application of real-time photogrammetry.A key advantage of such thermal-band infrared (IR) imagery for pedestrian localization is robust detection of human shape signatures within the scene [3][4][5].As such, the principles of photogrammetry can be used to recover 3D pedestrian position within the scene based on a known camera projection model and an assumption that variance in human height is in fact quite small (statistically supported by [6,7]).In [1] we experimentally investigated the accuracy of classical photogrammetry, within the context of current target detection and classification techniques [3][4][5], as a means of recovering the true 3D position of pedestrian targets within the scene.A real-time approach for the detection, classification and localization of pedestrian targets via thermal-band (infra-red) sensing was presented with supporting statistical evidence underpinning the key photogrammetric assumptions.Subsequent work in [2] explicitly addressed the remaining issue of correcting for pedestrian posture variation within this localization context.By contrast, here we present an approach for the automatic classification of vehicles by sub-type, such that a similar photogrammetric localization and tracking strategy can be employed.Identifying vehicle sub-type is a key governing factor in determining the suitable height assumption for use in such photogrammetric localization (supported by prior work of [8]) in addition to providing a higher granularity of target reporting within a deployed multi-sensor network.
Overall, despite extensive work in ground-based sensor networks [9][10][11][12], the use of photogrammetry within this context has received only limited attention [1,13,14].The visible-band work of [14] uses a similar approach within a Bayesian 3D tracking framework but does not explicitly address issues of accuracy or its use within a detection filtering framework such as [1].
Prior work on vehicle type classification is dominated by work in visible-band imagery [15] where colour and texture features most often provide the primary conduit to classification by vehicle type [16,17] and often make/model [18].Recent work uses a range of feature driven classification approaches [16,17] and the topic is well established within the domain of urban traffic surveillance [15].Within this context, the consistency of vehicle appearance (shape outline, colour, texture) albeit under varying illumination conditions is a contrasting challenge to the thermal variations within our task.However, in many contexts it is desirable to perform both pedestrian detection, to which thermal-band sensing is highly suited, and vehicle detection/classification from the same deployed allweather sensor, operating passively with strong invariance to visual camouflage, within a wider automated surveillance sensor network [19].
Prior work explicitly dealing with thermal-band (IR) imagery within an automated surveillance context is presently largely focused upon pedestrian detection [3,5,[20][21][22] and tracking [23,24].The work presented in this paper is a direct extension of [1,2] that demonstrated photogrammetric pedestrian localization within thermal-band imagery incorporating a lightweight tracking solution akin to that of [4].Building directly on this framework presented in [1], here we present a method that additionally facilitates the passive localization of vehicles within thermal-band (IR) imagery based on prior classification vehicle type.Specifically we evaluate a range of feature detector/descriptor combinations with a traditional featuredriven bag of visual words architecture (akin to [1]), the use of histogram of oriented gradient features within a similar classification framework (building on [2]) and finally two modern convolutional neural network (CNN) architectures (AlexNet [25], GoogLeNet [26]).
A number of classification approaches are compared for this challenging subcategory classification task with results presented within a wider context of photogrammetric target localization and tracking as an enabler to spatio-temporal target reporting in an operational context [19,27].

APPROACH
Our approach is illustrated against the backdrop of classical two stage automated visual surveillance [1].First we detect initial candidate regions within the scene (Section 2.1), thus facilitating efficient feature extraction over isolated scene regions, to which an identified target type is assigned via secondary object classification (Section 2.2) [3].

Candidate Region Detection
In order to facilitate overall real-time performance, initial candidate region detection identifies isolated regions of interest within the scene facilitating localized feature extraction and classification.By leveraging the stationary position of our sensor, this is achieved using a combination of two adaptive background modeling approaches [28,29] working in parallel to produce a single robust foreground model over varying environmental conditions and notably within varying ambient thermal/infra-red illumination conditions within complex, cluttered environments.
Within the first model, a Mixture of Gaussian (MoG) based adaptive background model, each image pixel is modeled as a set of Gaussian distributions, commonly termed as a Gaussian mixture model, that capture both noise related and periodic (i.e.vibration, movement) changes in pixel intensity at each and every location within the image over time [28,30].This background model is adaptively updated with each frame received and each pixel is probabilistically evaluated as being either part of the scene foreground or background following this methodology.The second model comprises the use of Bayesian classification in a closed feedback loop with Kalman filtered predictions of foreground component position [29].Within this model, each pixel is similarly probabilistically classified as either foreground or background but this is further reinforced via Kalman predictions for the positions of foreground objects (i.e.connected component foreground regions [31]) present in the previous time-step.This object-aware model significantly aids in the recovery of fast moving foreground objects under varying illumination conditions such as the thermal gradients inherent within infra-red imagery.Overall this combined approach provides a slowly-adapting background model in the traditional sense [28], that can be robust to rapid illumination gradients, whilst similarly providing foreground consistency to fast moving scene objects [29].The binary output of each foreground, based on a probabilistic classification threshold, is combined conjunctively to provide robust detection of both static and active scene objects.For illustrative examples and further discussion the reader is directed to [1].

Vehicle Classification
We evaluate several variations for the initial vehicle target classification (i.e.vehicle vs. nonvehicle) and subsequent type classification as one of type = {car, SU V, LGV, HGV } such that car covers city/small/family type saloon cars, Sports Utility Vehicle (SU V ) covers conventional (4 × 4) capable (including pickups) and similarly styled vehicles, Light Goods Vehicle (LGV ) covers vans (including small wheelbase trucks) and Heavy Goods Vehicle (HGV ) covers articulated trucks (lorries).Based on our detected candidate region (from Section 2.1), we specifically evaluate the use of sparse feature point descriptors using a bag of visual words encoding (Section 2.2.1), the use of dense Histogram of Oriented Gradient (HOG) features (Section 2.2.2) and an end-to-end deep Convolutional Neural Network (CNN) based on the use of transfer learning [25,32].

Bag of Visual Words
Following a bag of visual words (or code-book) methodology [33], which has been empirically shown to be suited towards thermal infra-red imagery [1,3,22,34], we evaluate a number of feature point detection and descriptor approaches as multi-dimensional features.Image representations based on local feature descriptors are widely applied in image classification and object recognition frameworks due to their robustness to partial occlusion and variations in object layout and viewpoint.Distinctive features of objects are detected at interest point locations which generally correspond to local maxima of a saliency measure calculated at each location in an image.The intensity patterns around these interest points are encoded using a descriptor vector.The most widely followed work in the area of local feature extraction has been Lowe's method of the Scale Invariant Feature Transform (SIFT) [35] which introduced a feature descriptor that is invariant to translation, scale and rotation and robust to image noise.Bay et al.'s later work [36] proposed the Speeded Up Robust Features (SURF) algorithm for feature detection and description that is loosely based on SIFT.The computational cost associated with SIFT is dramatically reduced without significant deterioration in performance (as used in prior work on infra-red pedestrian detection [1-3, 22, 34]).
More recently, research in this area led to industrious efforts to optimize sparse feature stability against computational performance leading to a range of local feature and detector variants.A standalone feature detector FAST (Features from Accelerated Segment Test) [37] provides significant number of candidate points for extraction while maintaining low computational cost.The detector-extractor frameworks BRIEF (Binary Robust Independent Elementary Features) [38], and BRISK (Binary Robust Invariant Scalable Key-points) [39] offer integer-space representations, avoiding the floating point operation of earlier SURF/SIFT variants, for faster extraction and subsequent computation on embedded platforms.ORB [40] (Oriented FAST and Rotated BRIEF) extends such methods to address issues of rotation invariance.A recent pairing of floatingpoint and integer space feature frameworks KAZE [41] and AKAZE [42] aim to improve uniqueness and robustness of features by describing them based on a non-linear model of an image.More recently FREAK (Fast Retina Key-point) [43], following from the earlier DAISY [44], represent feature extractors specifically inspired by retinal sampling in the human visual system originally designed for multiple image matching (i.e.image registration, stereo matching and alike).
Following the bag of visual words methodology, we perform feature extraction and clustering over all of the example training imagery (for all object classes) to produce a set of general feature descriptor clusters that characterise the overall feature space.Commonly this set of feature clusters is referred to as a code-book or vocabulary as it is subsequently used to encode the features detected on specific object instances (vehicle or non-vehicle) as fixed length vectors for input to both the initial off-line classifier training and on-line classification phase of such machine learning driven classification approaches.Here we perform clustering using the common-place k-means clustering algorithm in N -dimensional space (e.g.SURF feature descriptor length, N = 128 [36]) into k v clusters.A given object instance is encoded as a fixed length vector based on the membership of the features detected within the object to a given feature cluster based on nearest neighbour (hard) cluster assignment.Essentially the original variable number of features detected over each training image or candidate region is encoded as a histogram, of fixed length k v representing the membership of these features to each of these clusters.This fixed length distribution of features forms a feature vector that is then used to differentiate between labeled instances of a given class based on a trained classifier.Specifically we evaluate a range of such feature point detection and descriptor approaches of varying complexity for this task (namely: FREAK [43], DAISY [44], BRISK [39], ORB [40], KAZE [41], AKAZE [42]) against the mainstay of prior work in the field (i.e.SIFT [35] / SURF [36]) with the default parameter settings from the original works.From this bag of visual words feature encoding of feature descriptors, we have an overall feature vector, − → v BoV W of dimension k v (the number of visual code words used in our earlier bag of visual words vocabulary) which forms the input to our classification approach (Section 2.2.3).

Histogram of Oriented Gradient
The Histogram of Oriented Gradient (HOG) feature descriptor [45] ) as an input to subsequent classification (Section 2.2.3).

Feature Classification
Support Vector Machine (SVM) [46] classifiers are trained using each of these feature vector representations (Sections 2.2.1 and 2.2.2) over a corpus of exemplar imagery (see training examples in Figure 2).We use 7122 images for initial vehicle target classification (i.e.vehicle vs. non-vehicle) and 3410 images for subsequent type classification as one of type = {car, SU V, LGV, HGV } using a randomized 66% to 33% training set to validation set split for training and validation.The distribution of vehicle types within the data set used is {car = 2158, SU V = 315, LGV = 431, HGV = 506} representing a basic a priori likelihood of occurrence.SVM are trained using Radial Basis Function (RBF) kernel {SV M RBF } with a grid search over kernel parameter, γ = 2 x : x ∈ {−15, 3}, and model fitting cost, c = 2 x : x ∈ {5, 15}, using k−fold cross validation (k = 5).The results for the best performing parameter set are reported for each feature configuration in Section 3. Two stages of classification are performed on each feature vector:-primary classification as {vehicle, non − vehicle}, with non-vehicle encompassing both pedestrians (as per [1,2]) and other scene objects, then secondary vehicle type classification as one of type = {car, SU V, LGV, HGV }.Examples of the training images used for this task are shown in Figure 2.

Convolutional Neural Network
Motivated by the work of [25] and current trends in convolutional neural networks (CNN), we evaluate a full CNN pipeline for this task.Unlike traditional featuredriven approaches (Section 2.2.1 & 2.2.2) that rely on a secondary stage of generic classification (Section 2.2.3) (so called "shallow architectures"), we employ a CNN approach for the entire feature extraction, representation and classification process (denoted as "deep architectures").More specifically, with the use of a transfer learning approach [47], we optimize the CNN structures designed by Krizhevsky et al. [25] and Szegedy et al. [26] by fine-tuning its convolutional and fullyconnected layers for the full end-to-end feature extraction to classification pipeline within this problem domain.
Unlike the traditional neural networks with conventionally one or two hidden layers, modern CNN can include many more hidden layers [26,48,49] comprising varying characteristics: convolutional layers (feature extraction), fully connected layer (intermediate representation), pooling layer (dimensionality reduction) and non linear operators (sigmoid, hyperbolic functions and rectified linear units).This complex of parametrization, and hence representational capacity, make CNN susceptible to over-fitting in the traditional sense.To overcome this issue, a number of techniques are employed to ensure generality of the learned parameterization of the target problem.Within the network, convolutional layers are usually interleaved by pooling layers which down-sample the current representation (image) and hence reduces the number of parameters in-addition to improving overall computational efficiency.Furthermore the use of drop out, whereby hidden neurons are randomly removed during the training process, and shared weights are used to avoid overfitting such that performance dependence on individual network elements is reduced in favor of collective error reduction.In addition, with the use of the generalized technique called transfer learning, initial CNN parameterization (training) towards a generalized object classification task can then be further optimized (fine tuned) towards a domain specific classification task.
Presently, such CNN are designed manually with the resulting parametrization of the networks performing training using a stochastic gradient descent approach with varying parameters such as batch size, weight decay, momentum and learning rate over a huge data set (typically 10 6 in size).Current state of the art CNN models as such designed by Krizhevsky et.al.
[25], Zeiler and Fergus [50], Szegedy et.al. [26], Simonyan and Zisserman [49] are trained on a huge data-set such as ImageNet [51] which contains approximately a million of data samples and 1000 distinct class labels.However, the limited applicability of such training and parameter optimization techniques to problems where such large data sets are not available gives rise to the concept of transfer learning [52,53].The work of [54] illustrated that that each hidden layer in a CNN has distinct feature representation related characteristics among of which the lower layers provide general features extraction capabilities (akin to Gabor filters and alike), whilst higher layers carry information that is increasingly more specific to the original classification task.This finding facilitates the verbatim re-use of the generalized feature extraction and representation of the lower layers in a CNN, whilst higher layers are fine tuned towards secondary problem domains with related characteristics to the original.Using this paradigm, we can leverage the a priori CNN parametrization of an existing fully trained network, on a generic 1000+ object class problem (from [55]), as a starting point for optimization towards to the specific problem domain of limited vehicle type classification.Instead of designing a new CNN with random parameter initialization we instead adopt a pre-trained CNN and fine tune its parameterization towards our specific classification domain.Specifically, we make use of the CNN configuration designed by Krizhevsky et al. [25], having 5 convolutional layers, 3 fully-connected layer with ∼60 million parameters, ∼650,000 neurons, and trained over the ImageNet data set on an image classification problem in the ILSVRC-2012 competition (denoted as AlexNet).We also employ the network structure proposed by Szegedy et al. [26], which won the ILSVRC 2014 competition (denoted as GoogLeNet).This second network is designed using many more layers (22) but with 12 times fewer network parameters compared to AlexNet to reduce the computational complexity of training a wide and deep network, while achieving promising performance results.Their approach is to first convolve each input by 1 × 1, 3 × 3, 5 × 5 filters in parallel (named as the inception module) to perform dimensionality reduction before being fed into subsequent more computationally expensive convolutional layers.
From this point we then perform the fine-tuning (transfer learning) approach to both networks to train over the infra-red vehicle type data set (as detailed in Section 2.2.3) using backpropagation via stochastic gradient descent [25].

Photogrammetric Position Estimation
Firstly, we present a brief recap of our baseline localization approach as presented in [1] and subsequently show how this can be extended to address type variation within detected vehicle targets.
Based on automated detection (Section 2.2), target position is initially known within "sensor space" (i.e.pixel position within the image).Consequently, target position is estimated based on the principles of photogrammetry together with knowledge of the perspective transform under which targets are imaged and an assumption on the physical (real-world) dimension of a target in one plane [1].All targets are imaged under a standard perspective projection [31] as follows: where real-world object position, (X, Y, Z), in 3D scene co-ordinate space is imaged at image pixel position, (x, y), in pixel co-ordinate space for a given camera focal length, f .We assume both positions are the centroid of the object with (x, y) being the centre of the bounding box, of the image sub-region, for a target (object) detected in the scene (Section 2.1, e.g. Figure 1).
With knowledge of the camera focal length, f , the original object (target) position, (X, Y, Z), can be recovered based on (assumed) knowledge of either ob-ject width, X, or object height, Y (i.e. the difference in minimum and maximum positions in each of these dimensions for the object).From the bounds of the detected targets (Section 2.2) we can readily recover the corresponding object width, x, and object height, y, in the image.Based on this knowledge, rearranging and substituting into Eqn. 1 we can recover the depth (distance to target, Z) of the object position as follows: Knowing Z via Eqn.2, we can now substitute back into Eqn. 1 and with knowledge of the object centroid in the image, (x, y), we can recover both X and Y resulting in full recovery of real-world target position, (X, Y, Z), relative to the camera.In Eqn. 2, f ' represents focal length, f , translated from standard units, mm, to focal length measured in pixels:- where width image represents the width of the image (pixels), width sensor represents the camera digital (CCD) sensor width (mm).
Crucially, if we now assume a fixed width, X, or height, Y , for our object we can recover complete 3D scene position relative to the camera.For vehicle targets we can assume an average height for a given vehicle type determined from earlier vehicle type classification (as projected vehicle height, y, does not varying with viewing angle of the vehicle in the plane).Despite commonly held beliefs, empirical study has shown height variation within a given type classification of vehicle to be minimal [8].In this study we use Y = {height car , height SU V , height LGV , height HGV } for {height car = 1.5m, height SU V = 1.8m, height LGV = 2.1m, height HGV = 2.9m} based on statistical evaluation of a moderate pool of vehicles.Following in a similar vein to the argument presented in [1] with regard to human height for pedestrians, this translates into a Z position error, attributable to vehicle height variation within a given type class, that is within GPS error tolerances (±5m, [56]) for at least ranges up to 60m from the sensor.

3D Tracking
Unlike conventional tracking approaches that track 2D position, (x, y), within the image itself [57], our photogrammetric recovery of target position within the scene, (X, Y, Z) (Section 2.3) facilitates 3D tracking within scene space.This can be accomplished as tracking "within the plane" based on horizontal target position within the scene, X, and distance to target, Z, or full 3D scene space tracking including target elevation (vertical position), Y.
For each candidate region identified as a new foreground object (Section 2.1), we initially created a new 2D track-let based on localized frame to frame connectivity derived from sparse optic flow [58,59].If one of the frame samples for this object is subsequently classified as vehicle (via the approach outlined in Section 2.2), this target transitions from a 2D tracked instance within image space to a 3D tracked vehicle within scene space.The tracked position, based on photogrammetric position recovery (Section 2.3) can then be propagated, over earlier instances of the same object similarly transitioning the motion history of this instance from 2D image position to 3D scene position.If an identified foreground object is not classified as being a vehicle its tracking remains within 2D image space until either its spatio-temporal filtered classification returns a vehicle classification (as per [1,3] or it leaves the scene.Tracking within 3D scene space is performed using Kalman filter based tracking [60] on either a state vector comprising position and velocity "within the plane", s = (X, Z, vX, vZ) T , or within R 3 scene space, s = (X, Y, Z, vX, vY, vZ) T .Scene and measurement noise within the Kalman formulation are estimated empirically.

EVALUATION
Our results are presented using both quantitative measures of classification accuracy (Table 1 and 2) and qualitative assessment classification performance over a range of exemplar scenarios (Figures 1, 4).All evaluation imagery is captured using an un-cooled infra-red camera (Thermoteknix Miricle 307k, spectral range: 8-12µm) with statistical performance measured using validation test set of 2351 vehicle/non-vehicle images and 1126 vehicle sub-type images drawn from the same variation and environmental conditions as used for training (random 33% validation, as detailed in Section 2.2.3).Evaluation was performed around a variety of urban/industrial (cluttered) and suburban environments as part of work carried out in [19].Within the feature detector, descriptor and classification variants outlined, we consider the comparison of True Positives Rate (TP), False Positives Rate (FP) (as percentages) together with the Precision (P), accuracy (A) and Fscore (F) (harmonic mean of precision and true positive rate) for primary vehicle target classification (Table 1) and mean average precision (mAP) (mean of precision across all possible class labels) in addition to both mean accuracy (A) and F-score (F) for secondary vehicle type classification (type = {car, SU V, LGV, HGV }, Table 2).
From Table 1 we can see that CNN offer the best performance for primary vehicle classification (GoogLeNet, F-score of 0.993 and FP of only 1.0% followed closely by AlexNet with slightly higher FP, 1.3%).Traditional HOG with SVM classification also gives very strong results (F-Score of 0.98, FP of 2.2%) with the best bag of visual words approach (FAST feature detection with (slow) SIFT feature descriptor) coming in 4% lower across all vocabulary sizes (k v ) explored.The next best bag of visual words approach, FAST feature detection with DAISY feature descriptor, gives a 2.5% lower score despite the density of DAISY features.It can be observed that variation in vocabulary size generally appears to make negligible difference to performance.
From Table 2 we can see that the more difficult task of recognizing vehicle sub-types leads to a greater spread of performance between varying approaches.Again, we see that CNN offer the best performance (GoogLeNet, mAP of 0.94 / accuracy of 0.95) but that traditional HOG with SVM classification (mAP of 0.94 / accuracy of 0.93 / F-score of 0.88) outperforms the CNN AlexNet architecture (mAP of 0.85).However, all three approaches (GoogLeNet, HOG-SVM and AlexNet) significantly outperform the best bag of visual words approach (FAST feature detection with SIFT feature descriptor, mAP of 0.78).Within this bag of visual words approach (and some other) we can see that increasing vocabulary size appears to make notable difference to performance.The normalized inter-class confusion matrices presented in Figure 3 show the greatest cross-label confusion for the {SU V, LGV } type vehicles against the car vehicle type and additionally between the LGV and HGV vehicle types with the CNN approach (Figure 3 left) notably outperforming the HOG-SVM combination (Figure 3 right) in these cases.
Overall we see the prevalence of dense features (i.e.CNN, HOG) over the traditional bag of visual words approaches for these two classification tasks with the best performing bag of visual words approach also using FAST feature detection which is known to produce a higher density of feature points within the image.Within the two stage automated visual surveillance framework used here (Section 2.1), with features extracted only within the isolated candidate regions of the scene, all are achievable within the bounds of real-time operation [19,27]  This quantitative statistical evaluation (Table 1 and  2) is further supported by the qualitative results presented in Figures 1, 4-6 which illustrate extracts from vehicle type classification using HOG features with SVM classification and subsequent tracking sequences (using only the CPU computation available with the deployed sensor nodes [27]).These images are sequentially sub-sampled from the test scenarios with tracking and spatio-temporal detection performed as outlined in [1].Within each sub-figure (Figures 1, 4-6 A-H) we present the detected vehicle(s) using a bounding box, associated 2D image projection of the track (A-H insets, right), the planar view of the {Y /Z} tracked position relative to the camera (A-H insets, left) and the resulting temporally filtered vehicle type classification distribution (A-H, inset bottom).
From Figures 1 and 4 we can see that the accuracy and continuity of the {Y /Z} position localization of the vehicle from standard photogrammetric techniques [1] (shown in A-H left, Figures 1 & 4) is consistent over varying vehicle types.Variation in vehicle viewing angle to the sensor in Figure 1 (e.g.transitions A →B, C →F and G →H) and Figure 4 (e.g.transitions A →D, E →F) show no significant erroneous jumps in the spatial locality of vehicle target when the planar view of the {Y /Z} tracked position history is considered.This is further illustrated in Figures 5 and 6 where we see two sequences of consistent HGV type vehicle tracking from differing viewpoints (Figure 5, transitions A →E, F →H) and consistent tracking of a larger HGV over an extended distance including change in viewpoint (Figure 5, transitions A →E, F).As shown in Figure 4 (transition E →F, G) and Figure 6 (G, H) vehicle type miss-classification (confusion) largely occurs between the {car, SU V, LGV } vehicle types dependent on viewpoint and distance to target in the scene.Intraclass variation between these classes is clearly visible in the vehicle configurations of Figure 4 (transition E →F, G) and Figure 6 (G, H) where the configuration of the vehicle type is either ambiguous due to viewpoint (Figure 4) or unusual to any such vehicle type (Figure 6).Overall, our use of a vehicle classification by type is shown to facilitate effective compensation for variations in both vehicle dimension and viewing angle for the purposes of photogrammetric based localization (Figures1, 4-6).Under evaluation conditions GPS accuracy locally was found to be ±5m, based on a consumer GPS unit [56] and secondary verification of vehicle position from a concurrently deployed active range sensor [61] SV MRBF , kv = 500 SV MRBF , kv = 1000 SV MRBF , kv = 1500 SV MRBF , kv = 2000 SV MRBF HOG [45] 97.9 2.2 0.98 0.98 0.98 AlexNet [25] GoogLeNet [26] CNN 99.9 1.3 0.99 0.99 0.99 99.7 1.0 0.99 0.99 0.99  (as part of [27]) showed the photogrammetric localization recovered to be within this bound in the majority of test cases.

CONCLUSIONS
Overall we have shown that the use of Convolutional Neural Networks (CNN) or Histogram of Oriented Gradient (HOG) feature based classification facilitate the most effective determination of vehicle type to enable improved 3D localization and tracking within infra-red imagery based on the principles of photogrammetry.This directly advances the generality of prior work in field for pedestrian localization in the presence of pos-ture variation [1][2][3] by additionally facilitating vehicle localization from the same infra-red sensing modality within a deployed sensor network [19].Within the context of passive target localization in infra-red thermal imagery, and the general use of passive sensing for geolocated target tracking in wide-area sensor networks [19], this work similarly extends the argument in favour of passive sensor utilization within the bounds of acceptable accuracy.This is supported by a strong statistical evaluation over a number of variations on current state of the art classification approaches with CNN and HOG features outperforming traditional bag of visual words based approaches for this task.This work further strengthens the application of generalized target tracking within 3D scene-space that facilitates the ready dis-    ambiguation of multiple target tracking scenarios using low-complexity approaches with reduced computational overheads [1].Our approach is demonstrated over multiple scenarios in cluttered environments where a clear capability in vehicle type classification is clearly illustrated as an enabler to the passive localization of vehicles.
Future work will look to investigate the extension of this approach to the recovery of vehicle and pedestrian interactions for inform human/vehicle activity classification [4,59,62] and also the applicability within the context of mobile platform navigation [63][64][65][66], driver assistance systems [67,68] and for multi-platform, multi-modal wide-area search and surveillance tasks [5,69,70].

Figure 1 .
Figure 1.Examples of real-time vehicle detection, type classification and tracking in infra-red imagery with associated geo-referenced 3D track (based on using HOG features with SVM classification).

Figure 2 .
Figure 2. Training data examples -vehicle types {car, SU V, LGV, HGV } (∼10fps+) based on CPU computa-tion for the bag of visual words / HOG techniques and GPU-based computation for CNN based techniques.

Figure 3 .
Figure 3. Normalized inter-class confusion matrices for vehicle type classification for both CNN (GoogLeNet, left) and HOG features with SVM classification (right).

Figure 4 .
Figure 4. Examples of real-time vehicle detection, type classification and tracking in infra-red imagery with associated geo-referenced 3D track (based on using HOG features with SVM classification).

Figure 5 .
Figure 5. Examples of real-time vehicle detection, type classification and tracking in infra-red imagery with associated geo-referenced 3D track (based on using HOG features with SVM classification).

Figure 6 .
Figure 6.Examples of real-time vehicle detection, type classification and tracking in infra-red imagery with associated geo-referenced 3D track (based on using HOG features with SVM classification).

Table 1 .
Results of feature and classification variants for primary vehicle classification.