Enhanced detection of movement onset in EEG through deep oversampling

A deep learning approach for oversampling of electroencephalography (EEG) recorded during self-paced hand movement is investigated for the purpose of improving EEG classification in general and the detection of movement onset during online Brain-Computer Interfaces in particular. Learning from self-paced EEG data is challenging mainly due to the highly imbalance nature of the data reducing the generalisation power of the classification model. Oversampling of the movement class enhances the overall accuracy of an onset detection system by over 17%, p < 0.05, when tested on 12 subjects. Modelling the data using a deep neural network not only helps oversampling the movement class but also can help build a subject independent model of movement. In this work we present initial results on the applicability of this model.


I. INTRODUCTION
The Brain-Computer Interface (BCI) is an alternative communication medium between human and machine where direct brain signals are used to control devices in the surrounding environment [1], [2].This technology has a wide range of applications from assistive living [3], [4], communicating with locked-in patients [5], car control [6], and gaming [7], [8].
A BCI user can perform several well-studied mental tasks (e.g.imagining a limb movement) [9], [10], [11] to induce changes in brain activity detectable via non-invasive imaging technique such as electroencephalography (EEG).Such a system should be able to distinguish between the EEG patterns produced by these mental tasks within a time frame suitable for control of an external device, e.g.wheelchair, game controller.One approach is based on motor-imagery, where the subject imagines moving their limbs [12], [13], [14].Motor imagery tasks are commonly applied in BCI due to their spatial separability and widespread understanding of the underlying physiological properties.Event-related desynchronization/synchronization (ERD/ERS) studies [13], [14] demonstrated that motor imagery tasks within a synchronous paradigm (i.e. the timing is controlled by the system) go through three consecutive phases: preparation, execution and after execution [15].
Previous research on event related ERD/ERS has shown that during real movements relevant EEG activity can be found in both contralateral and ipsilateral hemispheres, but in the case of imagined movements only the contralateral hemisphere is activated [14].This justifies the use of real movements to test new methods, because the experiments are easier to conduct and the labelling is much more reliable in the selfpaced configuration (i.e. when the timing of the system is controlled by the user).
Early BCI researchers faced a challenging problem of knowing when to switch on/off the system and how to detect the idle from active states.In [16] the first brain-actuated switch was presented for self-paced BCIs, using wavelet features and a LVQ network.An unsupervised approach to onset detection was presented in [17] using Gaussian mixture models.An onset detection system was used in [18] to predict intention of performing a movement for subjects who had a prosthetic arm.
For onset detection to be practical the false positive rate, i.e. percentage of incorrectly classified onsets, must be as low as possible, to increase the reliability of the system especially when safety is an issue.This is particularly difficult due to the highly imbalanced nature of self-paced recorded data.To overcome this issue, researchers either use a synchronous, cued, protocol to record training data where equal time windows are given for both baseline and motor activity or downsample the majority class by taking windows of baseline data equal to the movement windows [18], [19].The downside of this approach is that downsampling will inherently reduce the information available for learning the baseline.Alternatively, in [20] we modelled the temporal information as a means to better understand the temporal dynamics of EEG during the self-paced motor imagined, or real, movements.However, even with the enhanced classification accuracy of EEG achieved through temporal modelling the problem of bias to baseline persists.
To keep our terms consistent, we refer to the recorded EEG data as "samples", while "events" are the time windows when movement happens.In self-paced onset detection we are interested in the accuracy of detecting these events with minimum false positive, i.e. instances when an onset event is wrongly detected.Hence the assessment is based on the performance of the system in detecting events, rather than the accuracy of classifying samples.In Section II-E we discuss how the predicted samples are processed to detect events.

A. Learning from imbalanced data
Imbalanced datasets are those where one class is overrepresented in relation to the other class(es).This is usually due to intrinsic factors of the dataset [21] (e.g.rare medical conditions, difficult and expensive acquisition of data from one class).In [22], the authors argued that the dataset complexity is the major factor behind the deterioration of classification accuracy, but go further to say that it is exacerbated by the inter-class imbalance.Data complexity is a loosely defined term that comprises: inter-class overlapping, lack of representative data, non-linear boundaries, time-variant data, and others.EEG driven BCI data is notorious for having the above mentioned characteristics of complexity [23], [24].The problem is especially challenging when operating in a self-paced paradigm where obtaining equal number of action (e.g.imagery movements) and baseline windows is almost impossible to achieve. Figure 1 demonstrates the challenge of classifying self-paced BCI data with overlapping imbalanced classes.
An added challenge to the BCI data classification is the high dimensionality of the extracted features (in many cases exhibiting hundreds of features) in comparison to the available samples from the minority class (usually generated by tens of events).This leads to poor generalisation of the learning algorithm especially when it is presented with imbalanced data sets leading to over-fitting.Feature selection and dimensionality reduction can be used to mitigate the effect of high dimensionality [25].
Tackling the problem of imbalanced data is a growing research field within machine learning [22], [26].Intuitively speaking the problem can be solved either by finding a way to equalise the number of samples of all the classes or by introducing a new cost function of the learning algorithm that takes the imbalance of the data into consideration.Cost sensitive methods include AdaBoost motivated methods [27], Decision Trees [28], neural networks [29], or using feature selection with an imbalance sensitive cost measure [30], [31].Sampling, however, is the arguably the most commonly used method to enhance accuracy with imbalanced data [32], [33].Sampling can be by either randomly over-sample /undersample the minority or the majority classes accordingly.In [19] the baseline was under-sampled by taking a window of data of equal size to proceeding the movement window.Synthetic Minority Over-sampling Technique (SMOTE) [34], [35] and its variants are one of the commonly used methods in the literature and is briefly described in the next section.
In this work we address the imbalance of self-paced data using oversampling of the active class.To achieve this goal a Generative Moment Matching Networks (GMMN) [36] is used.GMMN is a deep generative model of the data that is built by minimising the difference between the distribution of the generated and the original data.The model is utilised to synthesize independent samples via a single feedforward pass through the layers of the neural network.The use of GMMN is advantageous not only for oversampling but also as a tool to build a subject independent model of EEG.In this work we present tentative results of this approach and we discuss its future use.The methods are tested on selfpaced movement of a real finger EEG data collected from 12 subjects.Electromyography (EMG) data, which records the muscle activity, is used as accurate labels to better quantify the performance of the different methods.
The next section briefly describes the two oversampling and classification methods used here.The experimental design and data pre-processing are described in Section III.The results are presented in SectionIV, while Section V concludes the paper.

II. METHODS
To circumvent the problem of imbalanced data and before classifying the data into baseline and movement, the movement data is oversampled using an unsupervised deep generative neural network.To compare with a non-generative oversampling model, we use SMOTE.To compare with a cost sensitive method, we use a feature selection based approach.All the methods use the same linear discriminant analysis (LDA) based classifier, and are described in the following.

A. GMMN
The motivation behind using deep learning is to evaluate if we can build a model of the minority class that could be used to synthesize minority data.To be able to build such a model, unsupervised deep learning can be used as it is capable of learning manifolds where there is high density of the data rather than maximising the margin among classes [37].Generative models have the ability to evaluate the generalisation in the feature space.In [36] a generative network for unsupervised deep learning, generative moment matching network (GMMN) was proposed.GMMN uses a feedforward neural network to create a mapping from an easy to sample distribution space to the data space.GMMN starts with a simple prior of the parameters of the neural network making it easy to draw samples.The priors are propagated through the network in a deterministic manner to produce a sample of the data as the output of the network.In contrast to the complicated Markov Chain Monte Carlo (MCMC) methods required by Restricted Boltzmann Machines (RBM) [38], [39], samples can easily be drawn from a GMMN network.Also unlike the recently developed Generative Adversarial Networks (GAN) [40], GMMNs are trained on a straightforward loss function using backpropagation.
For a GMMN to work, it depends on a statistical hypothesis testing framework: maximum mean discrepancy (MMD) [41].By training the model, and minimising the discrepancy we can match all moments of the model distribution to the distribution of the modelled data.A kernel is used to simplify the loss function keeping the training efficient.
The top hidden layer h ∈ R H contains H hidden units with a simple prior, e.g.uniform, on each unit independently, where U (h j ) is a uniform distribution.h is then passed through the neural network and then deterministically mapped to a vector d ∈ R D in the data space.
where f is the mapping function representing the neural network and w is the network parameters.The network can contain a number of nonlinear layers (e.g.ReLu, sigmoid).Given the prior p(h) and the mapping f (h, w) a new sampled set in the data space can be generated.The advantage of GMMN is that training the parameters of the network can be done using a standard backpropagation to minimise MMD as an objective.Using a Gaussian kernel the objective function can be written as: (3) where x i is the generated sampled data, y l is the original training data.N is the number of generated samples and M is the number of original data samples.k is the Gaussian kernel: and σ is the bandwidth parameter.The gradient of the objective function can easily be calculated analytically and hence can easily be back propagated to update the weights of the network.
In this study a two-layer ReLU network was built.First Layer contained 200 nodes, while the second layer had 150.σ was set to 3 and 5 for the first and second layers respectively.The maximum iterations was set to 10000 and 100 mini batch size was used.

B. SMOTE
Synthetic Minority Over-sampling Technique (SMOTE) is a simple and very effective approach of over-sampling which has proven to be superior in many applications [22], [34], [35].The minority class is over-sampled by creating samples in the feature space between each minority class sample and a k nearest neighbour samples of the same class along the line segments joining any/all of the neighbours.Depending on the desired amount of over-sampling a subset of the neighbours are randomly selected, e.g. to achieve 300% oversampling 3 nearest neighbours are randomly chosen.Synthetic samples are generated as follows: Take the difference between the feature vector (sample) under consideration and its selected nearest neighbour.This difference is multiplied by a random number between 0 and 1, and then added to the feature vector, i.e. interpolate a sample point between the sample point and its neighbour.This causes the selection of a random point along the line segment between two related samples.The effectiveness of this approach is credited to the fact it forces the decision region of the minority class, within the decision trees framework, to become more general.The synthetic samples allows the classifier to create larger and less specific decision regions [34].

C. Feature Selection
Sequential Forward Floating Search (SFFS) was used to select up to 10 features [25].The method starts by using only one feature and selecting the feature that results in the highest value of F1-measure (see II-F).Once this feature is selected the method is repeated to find the second feature which in combination with the previously selected feature produces the largest F1-measure.Then a pruning step is performed where a feature is removed sequentially from the selected features to check if the evaluation measure is enhanced.Expansion and pruning goes into iterations until a maximum number of features is selected or a finite number of cycles have been executed.

D. Classification
The data is assumed independent in time during the training of an LDA classifier, but the data sequence is maintained during testing.Over-sampling is only applied on the movement data during training.The generated samples are added to the original data and a 10-fold cross validation is performed with the condition of having samples of both classes in each fold.LDA is used as it is one of the most commonly used classifiers for BCI [5], [7].

E. Post Processing
Regardless of the sampling algorithm, or lack thereof, the output of the LDA classifier is smoothed using a 5-sample temporal window.The class of the window is selected using majority voting.To detect onset events, i.e. moving from baseline to movement, another larger overlapping decision window is used.Due to the variability of the duration a subject sustains continuous movement vs baseline, these decision windows were optimised per subject to increase the number of available events.An onset is detected if within one decision window there is a continuous set of samples classified as baseline, which is at least 40% the size of the window which is followed by a 40% continuously classified as movement.If an onset is detected a 2 seconds debounce/refractory window is applied, where no decision is made, complying with the nature of EEG and our understanding of the neuro motor system, which therefore reduces the false positives.

F. Evaluation
The evaluation was conducted by 10-fold cross-validation.The number of training/testing events varied depending on the number of all events per participant, however the overall number of samples is the same.
To take the imbalance of the data into consideration on the level of events, we use the standard F1-measure and true-false difference (TF) [42].
Given (E) is the number of onsets, the number of truepositive (TP) detections, the number of false-positive (FP) detections, and the number of false-negative (FN) combined from all the folds.F1-measure is defined as: where TF is defined as:

III. DATA COLLECTION A. Subjects and Motor Task
Data was recorded from 12 right handed subjects, three subjects were female, ages ranged from 23 to 28.Subjects 3 and 8 were experienced users of a BCI system based on self-paced movement.Subjects 6, 9, and 11 had previous experience in online BCI experiments, the remaining subjects were naive to BCI systems.As the protocol used here was un-cued the number of trials performed within each run was variable.Each subject performed three runs in a single session.A run lasted 610 seconds.After a five second waiting period a fixation cross appeared on the screen.The fixation cross remained on the screen for 10 minutes during which EEG data was acquired.A five second post waiting period was used, to give the user some time to relax.Each subject performed 4 sessions (12 runs).
Within each run subjects were instructed to perform selfpaced flexion /extension of the left index finger whilst the fixation cross was visible.Subjects were requested to perform the movement for between 5 and 10 seconds and to rest for at least 10 seconds between movements.Instructions were given to concentrate on the fixation cross as much as possible during each run.After each run EMG recordings were assessed to ensure subjects understood requirements and could moderate actions accordingly.

B. Data Acquisition
Five bipolar EEG channels were recorded over the motor cortex at locations C3, C1, Cz, C2 and C4 as illustrated in Figure 2. EMG was recorded from the flexors of the right forearm.A right mastoid reference channel was used.Signals were acquired using a Guger Technologies g.BSamp.EMG and EEG were acquired at 256 Hz and later down sampled to 25Hz.EMG was used to record muscle activity for establishing correct onset and offset time points of self-paced movements.This allows training data to be correctly labeled according to the real movement activities.
No artifact rejection or EOG correction was employed as visual inspection did not find significant artifacts in the recorded EEG signals.In addition, the filtering applied before feature extraction (common average reference and band-pass filtering) can play a role in removing some artifacts.

C. Feature Extraction
A common average reference is used to reduce the common noise.Similar to previous work [43] narrow power band features were extracted per channel.The μ, β, and lower γ bands are divided into even finer bands, so that feature selection method can be applied more efficiently.90 features were used in total.For SMOTE and GMMN all the features are used, while feature selection is applied for comparison as described in Section II-C.Fig. 2. Layout of the electrodes used to record the data using the standard 10-20 system.Image adapted from [44].

IV. RESULTS
Figure 3 presents the precision vs recall results of the three compared methods applied on the 12 dataset as discussed above.If the method is performing well for both classes precision and recall should have comparable values, i.e. lying around the diagonal line.Each participant has a unique shape so the results of applying the method to their data can be compared.Results of oversampling using GMMN is represented in red, LDA with feature selection (termed No Over-sampling) in blue, and green for oversampling with SMOTE.The results show high correlation between precision and recall for GMMN ( 0.9798 with p < 0.05), and SMOTE (0.9448 with p < 0.05), and No Over-sampling (0.9343, p < 0.05).
The F1 results in Figure 4 show a relative advantage of oversampling methods compared to no sampling.Most importantly to onset detection, TF shows a significant improvement of oversampling with t-test resulting of a p < 0.05 for both GMMN and SMOTE against No Over-sampling.
The results suggest that the advantage of oversampling is in its ability to help sustaining a continuous quality of the output, which results in higher onset detection accuracy after temporal smoothing as described above.This is further clarified in Figure 5.The figure shows the accuracy of movement and baseline classes.If the method is performing similarly to both classes the symbols would be expected on the diagonal line.Any deviation from it is interpreted as a bias to either class.It is clear without over-sampling there is a strong bias to the majority baseline class compared to the over-sampling methods.The dotted lines represent the chance level.
Figure 6 provides evidence that the increase in accuracy is mostly the result of the over-sampling.In fact by looking at the correlation of imbalance in the data ( measured as the ratio of the number of movement samples and the number of baseline samples with smaller values reflect higher imbalance) and the enhancement of accuracy of the movement class by over-sampling, it is clear there is a very strong negative correlation (-0.9671, p < 0.05) between them.This means the more imbalance in the data the more we benefit from the over-sampling is.In the same figure, there is less negative correlation between TF and the imbalance measure (-0.5581, p < 0.05) which is most likely due to the post-processing steps we take which help reduce the false positives.

V. DISCUSSION AND CONCLUSION
The work here presents a novel approach to solve the problem of imbalanced data in onset detection from real hand movement as a tool to enhance the onset detection in selfpaced brain-computer interfaces.Unsupervised deep learning using a generative model is used to model the minority class, movement, and then synthetic samples were generated.The samples are then used to build an LDA classifier from now balanced data set allowing for higher classification accuracy.
The results are compared with those obtained using a nongenerative over-sampling method, SMOTE, with comparable accuracies.Another alternative to over-sampling is to perform feature selection with an F1 measure as a cost function, termed LDA in the figures above.Feature selection performed worse than over-sampling especially when using TF, a custom designed metric for onset detection which account for any bias to either classes.Statistical t-tests confirm these conclusions.
Although SMOTE and GMMN perform similarly on the data, GMMN has the advantage of building a model of the data.SMOTE on the other hand is only interested in local topography.This gives GMMN the advantage of building subject-independent models, which is referred to as BCI illiteracy [45].Enabling people to use BCI with minimum or no training is one of the biggest challenges of the wide adaptation of BCI.Only a few studies have tried to tackle this issue though the build of "feature" bank that are used to reduce the amount of training necessary for a new user [46].As a proof of concept we present here some preliminary results of using GMMN to build subject independent model.The model is trained on data combined from 11 subjects and tested on the remaining subject in a cross-validation scheme.Over-sampling, classification, and post-processing is carried out similar to what has been described above.Figure 7 compares the subject independent GMMN results to those  The results clearly show that without the GMMN model the TF accuracy is well below 50% for most subjects, while GMMN consistently performs significantly above chance (ttest, p < 0.05).More work would be necessary to better explore the subject-independent model and test it on an online system, however the results provide a strong incentive for the use of deep generative models in BCI.Fig. 7. TF results of using subject-independent model.The x-axis is the results obtained using GMMN and LDA results on the y-axis.

Fig. 1 .
Fig. 1.Sample of self-paced EEG data projected using PCA on a two dimensional space.The figure demonstrates two challenges of classifying self-paced BCIs: I) the overlap between the two classes II) the imbalance in number of samples between the classes apparent in the histograms.

Fig. 3 .
Fig. 3. Precision and Recall values for the three methods under comparison.Results from each subject are plotted with a unique shape.Colors represent the methods.

Fig. 4 .
Fig. 4. F1 and TF values for the three methods under comparison.Results from each subject are plotted with a unique shape.Colors represent the methods.

Fig. 5 .
Fig. 5. Cross-validated classification results of both movement and baseline classes without the post-processing steps.

Fig. 6 .
Fig.6.The correlation between imbalance measure, the ratio of the movement samples and the baseline samples, and the enhancement of TF (in red) and accuracy of the movement class (in blue)