Evaluation of a Dual Convolutional Neural Network Architecture for Object-wise Anomaly Detection in Cluttered X-ray Security Imagery

X-ray baggage security screening is widely used to maintain aviation and transport secure. Of particular interest is the focus on automated security X-ray analysis for particular classes of object such as electronics, electrical items and liquids. However, manual inspection of such items is challenging when dealing with potentially anomalous items. Here we present a dual convolutional neural network (CNN) architecture for automatic anomaly detection within complex security X-ray imagery. We leverage recent advances in region-based (R-CNN), mask-based CNN (Mask R-CNN) and detection architectures such as RetinaNet to provide object localisation variants for specific object classes of interest. Subsequently, leveraging a range of established CNN object and fine-grained category classification approaches we formulate within object anomaly detection as a two-class problem (anomalous or benign). Whilst the best performing object localisation method is able to perform with 97.9% mean average precision (mAP) over a six-class X-ray object detection problem, subsequent two-class anomaly/benign classification is able to achieve 66% performance for within object anomaly detection. Overall, this performance illustrates both the challenge and promise of object-wise anomaly detection within the context of cluttered X-ray security imagery.


I. INTRODUCTION
X-ray baggage security screening is widely used to maintain aviation and transport security, itself posing a significant image-based screening task for human operators reviewing compact, cluttered and highly varying baggage contents within limited time-scales.With both increased passenger throughput in the global travel network and an increasing focus on broader aspects of extended border security (e.g.freight, shipping postal), this poses both a challenging and timely automated image classification task.
To facilitate effective screening, threat detection via scanned X-ray imagery is increasingly employed to provide a nonintrusive, internal view of scanned baggage, freight, and postal items.This produces colour-mapped X-ray images which correspond to the material properties detected via the dualenergy X-ray scanning process [1].While current automatic threat detection within X-ray security screening concentrates on material discrimination for explosive-related threats [1], a growing body of work illustrates the potential of CNN architectures for broader object based threat detection [2]- [4].In both occurrences, threat detection performance must be characterised by high detection and low false alarm rates for operational viability.Within this context, of particular interest are electronics, electrical items and liquids [5].Not only do these items come in many evolving variants but they are additionally packed in complex and cluttered surroundings leading to a complex X-ray image interpretation problem.
Whilst existing security scanners use dual-energy X-ray for materials discrimination, and highlight specific image regions matching existing threat material profiles [6], [7], the detection of generalized anomalies within complex items remains challenging [8] (e.g.Fig. 1).
Within machine learning, anomaly detection involves learning a pattern or distribution of normality for a given data source and thus detecting significant deviations from this norm [9].Anomaly detection is an area of significant interest within computer vision, spanning biomedical imaging [10] to video surveillance [11].In our consideration of Xray security imagery, we are looking for abnormalities that indicate concealment or subterfuge whilst working against a real-world adversary who may evolve their strategy to avoid detection.Such anomalies may present (or conceal) themselves within appearance space in the form of an unusual shape, texture or material density (i.e.dual-energy X-ray colour) [12].Alternatively, they may present themselves in a semantic form, where the appearance of unfamiliar objects either globally or locally within the X-ray image [13].
Considering the notable challenge of detecting such subtle anomalies globally within the image, we instead follow a human-like approach to illustrate an automated pipeline for the object-wise screening of such items -focus on locating the object within the scene (image) first, then determine if the object is anomalous or not ?
By leveraging recent advances in object detection and classification in X-ray security imagery [2]- [4], we propose a dual CNN architecture to firstly isolate liquid and electrical objects by type and subsequently screen them for abnormalities.The main contribution of this work is a dual CNN architecture for object-wise anomaly detection, which jointly leverages state-of-the-art joint object detection and segmentation [14] for first stage object localisation and subsequently considers second stage anomaly detection as a simple two-class CNN classification problem within cluttered X-ray security imagery.

II. RELATED WORK
There has been a steady increase in research work considering object based threat detection in X-ray baggage security imagery.Rogers et al. [1] performs a comprehensive review of the field, including baggage and cargo imagery.In this section, we will focus on supervised learning for automated threat detection and anomaly detection within X-ray security imagery.

A. Automated Threat Detection in X-ray Imagery
Early work on X-ray security images is based on handcrafted features such as Bag-of-Visual-Words (BoVW), which is applied together with a classifier such as a Support Vector Machine (SVM), achieving a performance of 0.7 recall, 0.29 precision, and 0.57 average precision [15].Turcsany et al. [16] extend the approach by using BoVW with SURF descriptor and SVM classifier yields 0.99 true positive and 0.04 false positive rates.Subsequently, BoVW is further evaluated for the single and dual-view X-ray images [17], with optimal average precision achieved for firearms (0.95) and laptops (0.98).The various feature point descriptors within BOVW is explored thoroughly in the work of [18], where the best the combination achieves 94.0% accuracy with two classes of firearm detection using an SVM classifier.
Recent CNN-based deep learning architectures [19]- [22] have significantly improved the object detection in X-ray security imaging [4], [23], [24].Earlier work on CNN in X-ray imaging [2] explore the use of transfer learning from another network trained on a classification task.Experiments show that CNN with transfer learning achieves superior performance, 99.26% on true positive and 4.08% on false positive, only by fine-tuning the network.Broader experimentation in [4] empirically proves the superiority of fine-tuned CNNs over the classical machine learning algorithms.

B. Automated Anomaly Detection in X-ray Imagery
Sterchi et al. [25] show that security officers are able to detect the abnormality better when they focus on detecting each object in the bag as benign rather than concentrating on threat items.By the same analogy, the anomaly detection algorithms proposed in the field are trained on benign samples to learn what is normal and tested on both normal and abnormal images to detect threats.
Prior work on appearance and semantic anomaly detection, has considered unique feature representation as a critical component for detection within cluttered X-ray imagery [26].Early work on anomaly detection in X-ray security imagery [27], implements block-wise correlation analysis between two temporally aligned scanned X-ray images.More recently [28], anomalous X-ray items within freight containers have been detected using auto-encoder networks, and additionally via the use convolutional neural network (CNN) extracted features as a learned representation of normality across stream-ofcommerce parcel X-ray images [26].Andrews et al. [29] propose representational-learning for anomaly detection within cargo container imagery.In a similar vein, the work of [30] focuses on the use of a novel adversarial training architecture to detect anomalies based on high reconstruction errors produced by a generator network adversarially trained on non-anomalous (benign) stream-of-commerce X-ray imagery only.In followup work, [31] proposes another unsupervised anomaly detection approach, whereby the use of skip connected layer design allows to train much higher resolution images and optimising latent space within the discriminator network leads to significantly better results.
By contrast, here we consider a two-stage approach that first isolates potential objects of interest within the X-ray security image, as an object detection and classification problem (Section III-A), prior to secondary anomaly detection via application of CNN based image classification (Section III-B).

III. PROPOSED APPROACH
Our dual CNN architecture performs two stages of analysis:-(a) primary object detection within the X-ray image (Section III-A); and (b) secondary classification of each detected object via a two-class, {anomaly, benign}, classification formulation (Section III-B).An overview of our overall dual CNN architecture is shown in Fig. 2.

A. Detection Strategy
We consider a number of contemporary CNN frameworks for our primary object detection strategy to explore their applicability and performance for generalised object detection within the context of X-ray security imagery.Namely we consider Faster R-CNN [32], Mask R-CNN [14] and RetinaNet [33] with internal architectures as illustrated in the Fig. 3.These are evaluated over a six class object detection and localisation problem comprising of {bottle, hairdryer, iron, toaster, mobile, laptop} items packed within cluttered X-ray security baggage imagery.Faster R-CNN is based on a two-stage internal architecture [32], as shown in Figure 3(A).The first stage consists of a Region Proposal Network (RPN) that proposes regions of interest to a secondary classification stage.The RPN consists of convolutional layers that generate set of anchors with different scales and aspect ratios, and predict their bounding box coordinates together with a probability score denoting whether the region is an object of interest or background.Anchors are generated by using a fixed set of nine standard axis-aligned bounding boxes in three different aspect ratios and three scales, which are defined at every location of the feature maps.These features are then fed into objectness classification and bounding box regression layers.Within the second stage, the objectness classification layer classifies whether a given region proposal is an object or a background region while a bounding box regression layer predicts object localisation, at the end of the overall detection process.
Mask R-CNN is an extension of the Faster R-CNN architecture for combined object localisation and instance segmentation of image objects [14].Mask-RCNN similarly relies on a region proposals which are generated via a region proposal network.Mask-RCNN follows the Faster-RCNN model of a feature extractor followed by this region proposal network, followed by an operation known as ROI-Pooling to produce standard-sized outputs suitable for input into a secondary classifier.The main differences between Mask-RCNN and Faster-RCNN rely on three factors.Firstly, Mask-RCNN replaces the ROI-Pooling operation used in Faster-RCNN with an operation called ROI-Align that allows very accurate instance segmentation masks to be constructed.Secondly, Mask-RCNN adds a network head (a small fully convolutional neural network) to produce the desired instance segmentation, as in Fig. 3(B).Finally, segmentation and classification label predictions are decoupled; the mask network head predicts the instance segmentation independently from the network head predicting the classification label for the object that is being segmented.
RetinaNet is a one-stage object detector proposed by Lin et al. [33], where the author identified that class imbalance are the critical reasons why the performance of single stage detector architectures such as YOLO [34] and SSD [35] lag behind two-stage detector architectures such as Faster R-CNN and Mask R-CNN.To improve the performance, RetinaNet employs a novel loss function called Focal Loss, which allows it to focus more on class imbalance samples.Using a one-stage network architecture with Focal Loss, RetinaNet achieves state-of-the-art performance in terms of accuracy and running time.Figure 3(C) depicts the overall architecture of RetinaNet, which is composed of a backbone network and two subnetworks.The backbone network is responsible for computing a convolutional feature map using the Feature Pyramid Network (FPN) over an entire input image.Subsequently, the first subnet performs label classification on the backbones output, while the second subnet performs convolution bounding box regression (i.e.localisation).The focal loss is applied as the loss function as shown in Fig. 3(C).

B. Classification Strategy
After detecting the candidate objects of interest within Xray security imagery based on our detection strategy (Section III-A), our secondary classification strategy determines whether the object localised within the image is {anomaly, benign} as a two-class classification problem.
We specifically adopt fine-grained classification for anomaly detection, where we define benign and anomalous as subcategories (sub-classes) of the primary object type detected (Section III-A).Within the literature, fine-grained classification usually aims to distinguish subordinate visual categories to the main object class such as determining natural categories such as species of birds [40] [41], dogs [42] and plants [43].
In the case of our {anomaly, benign} classification problem, the key to successful fine-grained classification lies in devel- oping an automated method to accurately identify informative regions in an anomalous item, and whether each such region belongs to an anomalous region or benign region of the overall object.
However, labelling the discriminate regions requires significant manual annotation and is therefore difficult to scale effectively.To avoid this issue, we specifically utilise finegrained classification learning a discriminative filter bank within a CNN framework in an end-to-end manner without the need for explicit additional object annotation [39].This approach enhances mid-level representational learning within the CNN architecture, by learning a set of convolution filters such that each is initialised and discriminatively trained in order to capture highly discriminative sub-image patches.Based on the the VGG-16 network architecture [37], filters are additionally added at the 10 th convolutional layer representing image patches as small as 92 × 92 with a stride of 8 [39].

IV. EXPERIMENTAL SETUP
In this section, we introduce the dataset, evaluation criteria and CNN training details used in this work.

A. Dataset
We construct our dataset using single-view conventional Xray imagery with associated false colour materials mapping from dual-energy [44].Our X-ray images consist of benign and anomalous items, such as a laptop, mobile, toaster, iron, hairdryer and bottle.To introduce anomalies, we insert marzipan, screws, metal plates, knives and alike inside these objects as depicted in Fig. 1.All X-ray imagery is gathered locally used a Gilardoni dual-energy X-ray scanner (FEP ME 640 AMX, [45]).
Each of the anomalous items is placed inside various cluttered baggage items, which cover the full range and dimensions found in aviation cabin baggage, ensuring the set of bags is a good representation of such items typically presented at the aviation checkpoint security.In total, the number of Xray images after scanning of each bags is 3534 images.

B. Evaluation Criteria
For object detection, the performance of the models is evaluated by mean average precision (mAP), as used in the seminal object detection benchmark work of [46].In order to calculate mAP, we calculate the area of intersection over union for the given ground truth and detected bounding box for each detection as: where B gti and B dti are ground truth and detected bounding box for detection i, respectively.Assuming each detection as unique, and denoting the area as Ψ(B gti , B dti ), we then threshold it by the range of θ = .50: .05: .95giving the logical b i , where: Given both true positive and false positive as t i and f i , where: The precision p i and recall r i curves can be calculated as: where n p is the number of positive samples.We can calculate average precision (AP) based on the area under the curve of precision versus recall: Subsequently, we can get the value of mAP by averaging AP values for all classes, C: For anomaly detection via classification, our model performances are evaluated in terms of Accuracy (A), Precision (P), Recall (R), F-score (F1%), True Positive (TP%), and False Positive (FP%).

C. Training Details
In our experiments, we use the CNN implementation of [47] for our primary object detection approach.Our models are trained on a GTX 1080Ti GPU, optimised by Stochastic Gradient Descent (SGD) with a weight decay of 0.0001, the learning rate of 0.0 and termination at 180k iterations.
The ResNet 50 and ResNet 101 [20] are used as a network backbone for detection.When using ResNet 101 as the backbone, we drop the learning rate by half and double the length of the training schedule to facilitate training within the memory footprint of the available GPU.We split the datasets into training (60%), validation (20%) and test sets (20%) such that each split has similar class distribution.We also perform scaling and horizontal flipping to each sample to augment the datasets during training.All experiments are initialised with ImageNet pre-trained weights for their respective model [48].For Faster R-CNN and Mask R-CNN, the batch size is set to 512 for the RPN.

V. RESULTS AND DISCUSSION
Results are presented for each states of our dual CNN architecture:-(a) primary object detection (Table I); and (b) secondary classification of each detected object via two-class, {anomaly, benign}, classification (Table II)

A. Object Detection Results
Table I presents object detection results for six object types (classes) using Faster R-CNN, Mask R-CNN and RetinaNet applied to the X-ray security imagery dataset outlined in Section IV-A.The highlighted AP/mAP signifies the maximal results obtained for each object class.
However, there are where certain CNN based detection fails to classify threats, i.e., RetinaNet fails to detect laptop (as illustrated in Fig. 4C), while Faster R-CNN and Mask R-CNN can detect the laptop with high confidence (Figs.4A and 4B).Overall, the Mask R-CNN architecture gives superior performance, with a ResNet 101 backbone network giving superior performance across all architectures.The Mask R-CNN with ResNet 101 yields the highest mAP of 97.9% which establishes a new benchmark for object instance segmentation within Xray security imagery.

B. Anomaly Detection Results
Table II presents a side-by-side comparison of our secondary anomaly detection strategy (Section III-B) operating on objects that have been pre-localized (segmented) via our earlier object detection approach (dual CNN, Table II -upper) against a simplistic approach of processing the full X-ray image containing the object without any prior object localization (full image, Table II -lower).Our use of the proposed finegrain classification approach is additionally presented for the former case, where it represents a fine-grained sub-object classification problem.
From Table II we can observe superior performance in terms of statistical accuracy (A), precision (P), recall (R) and true positive (TP) with the dual CNN architecture that offers prelocalization of the objects from the image.This observation demonstrates that dual CNN model can satisfactorily leverage the mutual benefits of the two complementary networks.By having object localisation in the first stage, it effectively makes the feature representation more meaningful (i.e.focused) for the secondary {anomaly, benign} classification task.
However, despite this success, we additionally note a high number of false positives (FP) presented across the Table II.Interestingly, the lowest FP comes from a full image rather than a dual CNN processing pipeline (albeit with lower overall performance than others).
Qualitative examples of the detection and classification of the various objects are presented in Fig. 5. Our approach benefits from the performance of some classes which usually easy to distinguish by size and shape (e.g.hairdryer) while performance is lessor on smaller items, as shown in Table I.
Overall, these results illustrate the challenge of anomaly detection within X-ray security imagery at both an image or object level.

VI. CONCLUSION
In this work, we evaluate the effectiveness of dual CNN architecture for anomaly detection in the multiple-class item, {bottle, hairdryer, iron, toaster, mobile, laptop} in cluttered X-ray security imagery.We focus on two sub-problems: firstly, to leverage recent advances object detection to provide object localisation of threat item and secondly, leveraging established CNN and fine-grained classification to determine anomaly or benign objects.Experimentation demonstrates that fine-tuning of Mask R-CNN with ResNet 101 for X-ray imagery yields 97.9% mAP for the first stage of object detection.However, while experimental results on secondary anomaly detection via a two-class classification problem, {anomaly, benign} show the benefits of a dual CNN architecture (TP: 76.86% Accuracy: 66%) false positive detection remains a significant issue (FP 10%).Overall this illustrates the challenges of considering anomaly detection as an object-wise classification problem, even with recent advances in object detection within X-ray security imagery [4], remain significant when considering existing non-specialised CNN architectures for this task.

Fig. 1 .
Fig. 1.X-ray security imagery of exemplar electronics items with a highlighted (red box) concealed anomalous region in (A) laptop and (B) toaster.

Fig. 2 .
Fig. 2. Our dual CNN architecture for object-wise anomaly detection in complex X-ray security imagery.

Fig. 4 .
Fig. 4. Exemplar image cases where RetinaNet (C) fails to detect an object (laptop in blue dashed box) in X-ray image, while Faster R-CNN (A) and Mask R-CNN (B) are able to detect the object.

Fig. 5 .
Fig. 5. Examples of detection and classification of anomalous objects in X-ray security imagery using Faster R-CNN, Mask R-CNN and RetinaNet.

TABLE I OBJECT
DETECTION RESULTS FOR FASTER R-CNN, MASK R-CNN AND RETINANET FOR DUAL CNN ARCHITECTURE.CLASS NAMES INDICATES CORRESPONDING AVERAGE PRECISION (AP) OF EACH CLASS, AND MAP INDICATES MEAN AVERAGE PRECISION OF THE CLASSES.

TABLE II ANOMALY
CLASSIFICATION VIA VARYING CNN ARCHITECTURES (SQUEEZENET, VGG, AND RESNET) WITH AND WITHOUT THE PRE-LOCALIZATION OFFERED BY THE PROPOSED DUAL CNN ARCHITECTURE.