Can We Assess Mental Health through Social Media and Smart Devices? Addressing Bias in Methodology and Evaluation

Predicting mental health from smartphone and social media data on a longitudinal basis has recently attracted great interest, with very promising results being reported across many studies. Such approaches have the potential to revolutionise mental health assessment, if their development and evaluation follows a real world deployment setting. In this work we take a closer look at state-of-the-art approaches, using different mental health datasets and indicators, different feature sources and multiple simulations, in order to assess their ability to generalise. We demonstrate that under a pragmatic evaluation framework, none of the approaches deliver or even approach the reported performances. In fact, we show that current state-of-the-art approaches can barely outperform the most na\"ive baselines in the real-world setting, posing serious questions not only about their deployment ability, but also about the contribution of the derived features for the mental health assessment task and how to make better use of such data in the future.


Introduction
Establishing the right indicators of mental well-being is a grand challenge posed by the World Health Organisation [7]. Poor mental health is highly correlated with low motivation, lack of satisfaction, low productivity and a negative economic impact [20]. The current approach is to combine census data at the population level [19], thus failing to capture well-being on an individual basis. The latter is only possible via self-reporting on the basis of established psychological scales, which are hard to acquire consistently on a longitudinal basis, and they capture long-term aggregates instead of the current state of the individual.
The widespread use of smart-phones and social media offers new ways of assessing mental well-being, and recent research [1,2,3,5,9,10,13,14,22,23,26] has started exploring the effectiveness of these modalities for automatically assessing arXiv:1807.07351v1 [cs.CY] 19 Jul 2018 the mental health of a subject, reporting very high accuracy. What is typically done in these studies is to use features based on the subjects' smart phone logs and social media, to predict some self-reported mental health index (e.g., "wellbeing", "depression" and others), which is provided either on a Likert scale or on the basis of a psychological questionnaire (e.g., PHQ-8 [12], PANAS [29], WEMWBS [25] and others).
Most of these studies are longitudinal, where data about individuals is collected over a period of time and predictions of mental health are made over a sliding time window. Having such longitudinal studies is highly desirable, as it can allow fine-grained monitoring of mental health. However, a crucial question is what constitutes an appropriate evaluation framework, in order for such approaches to be employable in a real world setting. Generalisation to previously unobserved users can only be assessed via leave-N-users-out cross-validation setups, where typically, N is equal to one (LOUOCV, see Table 1). However, due to the small number of subjects that are available, such generalisation is hard to achieve by any approach [13]. Alternatively, personalised models [3,13] for every individual can be evaluated via a within-subject, leave-N-instances-out cross-validation (for N=1, LOIOCV), where an instance for a user u at time i is defined as a {X ui , y ui } tuple of {features(u, i), mental-health-score(u, i)}. In a real world setting, a LOIOCV model is trained on some user-specific instances, aiming to predict her mental health state at some future time points. Again however, the limited number of instances for every user make such models unable to generalize well. In order to overcome these issues, previous work [2,5,9,10,22,26] has combined the instances {X uj i , y uj i } from different individuals u j and performed evaluation using randomised cross validation (MIXED). While such approaches can attain optimistic performance, the corresponding models fail to generalise to the general population and also fail to ensure effective personalised assessment of the mental health state of a single individual. In this paper we demonstrate the challenges that current state-of-the-art models face, when tested in a real-world setting. We work on two longitudinal datasets with four mental health targets, using different features derived from a wide range of heterogeneous sources. Following the state-of-the-art experimental methods and evaluation settings, we achieve very promising results, regardless of the features we employ and the mental health target we aim to predict. However, when tested under a pragmatic setting, the performance of these models drops  heavily, failing to outperform the most naïve -from a modelling perspectivebaselines: majority voting, random classifiers, models trained on the identity of the user, etc. This poses serious questions about the contribution of the features derived from social media, smartphones and sensors for the task of automatically assessing well-being on a longitudinal basis. Our goal is to flesh out, study and discuss such limitations through extensive experimentation across multiple settings, and to propose a pragmatic evaluation and model-building framework for future research in this domain.

Related Work
Research in assessing mental health on a longitudinal basis aims to make use of relevant features extracted from various modalities, in order to train models for automatically predicting a user's mental state (target), either in a classification or a regression manner [1,2,3,9,10,13,26]. Examples of state-of-the-art work in this domain are listed in Table 2, along with the number of subjects that was used and the method upon which evaluation took place. Most approaches have used the "MIXED" approach to evaluate models [1,2,5,9,10,22,26], which, as we will show, is vulnerable to bias, due to the danger of recognising the user in the test set and thus simply inferring her average mood score. LOIOCV approaches that have not ensured that their train/test sets are independent are also vulnerable to bias in a realistic setting [3,13]. From the works listed in Table 2, only Suhara et al. [23] achieves unbiased results with respect to model generalisability; however, the features employed for their prediction task are derived from self-reported questionnaires of the subjects and not by automatic means.

Problem Statement
We first describe three major problems stemming from unrealistic construction and evaluation of mental health assessment models and then we briefly present the state-of-the-art in each case, which we followed in our experiments.

P1
Training on past values of the target variable: This issue arises when the past N mood scores of a user are required to predict his/her next mood score in an autoregressive manner. Since such an approach would require the previous N scores of past mood forms, it would limit its ability to generalise without the need of manual user input in a continuous basis. This makes it impractical for a real-world scenario. Most importantly, it is difficult to measure the contribution of the features towards the prediction task, unless the model is evaluated using target feature ablation. For demonstration purposes, we have followed the experimental setup by LiKamWa et al. [13], which is one of the leading works in this field. P2 Inferring test set labels: When training personalised models (LOIOCV ) in a longitudinal study, it is important to make sure that there are no overlapping instances across consecutive time windows. Some past works have extracted features {f (t − N ), ..., f (t)} over N days, in order to predict the score t on day N + 1 [3,13]. Such approaches are biased if there are overlapping days of train/test data. To illustrate this problem we have followed the approach by Canzian and Musolesi [3], as one of the pioneering works on predicting depression with GPS traces, on a longitudinal basis. P3 Predicting users instead of mood scores: Most approaches merge all the instances from different subjects, in an attempt to build user-agnostic models in a randomised cross-validation framework [2,9,10,26]. This is problematic, especially when dealing with a small number of subjects, whose behaviour (as captured through their data) and mental health scores differ on an individual basis. Such approaches are in danger of "predicting" the user in the test set, since her (test set) features might be highly correlated with her features in the training set, and thus infer her average well-being score, based on the corresponding observations of the training set. Such approaches cannot guarantee that they will generalise on either a population-wide (LOUOCV ) or a personalised (LOIOCV ) level. In order to examine this effect in both a regression and a classification setting, we have followed the experimental framework by Tsakalidis et al. [26] and Jaques et al. [9].

P1: Training on past values of the target (LOIOCV, LOUOCV)
LiKamWa et al. [13] collected smartphone data from 32 subjects over a period of two months. The subjects were asked to self-report their "pleasure" and "activeness" scores at least four times a day, following a Likert scale (1 to 5), and the average daily scores served as the two targets. The authors aggregated various features on social interactions (e.g., number of emails sent to frequently interacting contacts) and routine activities (e.g., browsing and location history) derived from the smartphones of the participants. These features were extracted over a period of three days, along with the two most recent scores on activeness and pleasure. The issue that naturally arises is that such a method cannot generalise to new subjects in the LOUOCV setup, as it requires their last two days of self-assessed scores. Moreover, in the LOIOCV setup, the approach is limited in a real world setting, since it requires the previous mental health scores by the subject to provide an estimate of her current state. Even in this case though, the feature extraction should be based on past information only -under LOIOCV in [13], the current mood score we aim at predicting is also used as a feature in the (time-wise) subsequent two instances of the training data. Experiments in [13] are conducted under LOIOCV and LOUOCV, using Multiple Linear Regression (LR) with Sequential Feature Selection (in LOUOCV, the past two pairs of target labels of the test user are still used as features). In order to better examine the effectiveness of the features for the task, the same model can be tested without any ground-truth data as input. Nevertheless, a simplistic model predicting the per-subject average outperforms their LR in the LOUOCV approach, which poses the question of whether the smartphone-derived features can be used effectively to create a generalisable model that can assess the mental health of unobserved users. Finally, the same model tested in the LOIOCV setup achieves the lowest error; however, this is trained not only on target scores overlapping with the test set, but also on features derived over a period of three days, introducing further potential bias, as discussed in the following.

P2: Inferring Test Labels (LOIOCV)
Canzian and Musolesi [3] extracted mobility metrics from 28 subjects to predict their depressive state, as derived from their daily self-reported PHQ-8 questionnaires. A 14-day moving average filter is first applied to the PHQ-8 scores and the mean value of the same day (e.g. Monday) is subtracted from the normalised scores, to avoid cyclic trends. This normalisation results into making the target score s t on day t dependent on the past {s t−14 , ..., s t−1 } scores. The normalised PHQ-8 scores are then converted into two classes, with the instances deviating more than one standard deviation above the mean score of a subject being assigned to the class "1" ("0", otherwise). The features are extracted over various time windows (looking at T HIST = {0, ..., 14} days before the completion of a mood form) and personalised model learning and evaluation are performed for every T HIST separately, using a LOIOCV framework.
What is notable is that the results improve significantly when features are extracted from a wider T HIST window. This could imply that the depressive state of an individual can be detected with a high accuracy if we look back at her history. However, by training and testing a model on instances whose features are derived from the same days, there is a high risk of over-fitting the model to the timestamp of the day in which the mood form was completed. In the worstcase scenario, there will be an instance in the train set whose features (e.g. total covered distance) are derived from the 14 days, 13 of which will also be used for the instance in the test set. Additionally, the target values of these two instances will also be highly correlated due to the moving average filter, making the task artificially easy for large T HIST and not applicable in a real-world setting.
While we focus on the approach in [3], a similar approach with respect to feature extraction was also followed in LiKamWa et al. [13] and Bogomolov et al. [2], extracting features from the past 2 and 2 to 5 days, respectively.

P3: Predicting Users (LOUOCV)
Tsakalidis et al. [26] monitored the behaviour of 19 individuals over four months. The subjects were asked to complete two psychological scales [25,29] on a daily basis, leading to three target scores (positive, negative, mental well-being); various features from smartphones (e.g., time spent on the preferred locations) and textual features (e.g., ngrams) were extracted passively over the 24 hours preceding a mood form timestamp. Model training and evaluation was performed in a randomised (MIXED) cross-validation setup, leading to high accuracy (R 2 = 0.76). However, a case demonstrating the potential user bias is when the models are trained on the textual sources: initially the highest R 2 (0.22) is achieved when a model is applied to the mental-wellbeing target; by normalising the textual features on a per-user basis, the R 2 increases to 0.65. While this is likely to happen because the vocabulary used by different users is normalised, there is also the danger of over-fitting the trained model to the identity of the user. To examine this potential, the LOIOCV/LOUOCV setups need to be studied alongside the MIXED validation approach, with and without the per-user feature normalisation step.
A similar issue is encountered in Jaques et al. [9] who monitored 68 subjects over a period of a month. Four types of features were extracted from survey and smart devices carried by subjects. Self-reported scores on a daily basis served as the ground truth. The authors labelled the instances with the top 30% of all the scores as "happy" and the lowest 30% as "sad" and randomly separated them into training, validation and test sets, leading to the same user bias issue. Since different users exhibit different mood scores on average [26], by selecting instances from the top and bottom scores, one might end up separating users and convert the mood prediction task into a user identification one. A more suitable task could have been to try to predict the highest and lowest scores of every individual separately, either in a LOIOCV or in a LOUOCV setup.
While we focus on the works of Tsakalidis et al. [26] and Jaques et al. [9], similar experimental setups were also followed in [10], using the median of scores to separate the instances and performing five-fold cross-validation, and by Bogomolov et al. in [2], working on a user-agnostic validation setting on 117 subjects to predict their happiness levels, and in [1], for the stress level classification task.

Datasets
By definition, the aforementioned issues are feature-, dataset-and target-independent (albeit the magnitude of the effects may vary). To illustrate this, we run a series of experiments employing two datasets, with different feature sources and four different mental health targets.
Dataset 1: We employed the dataset obtained by Tsakalidis et al. [26], a pioneering dataset which contains a mix of longitudinal textual and mobile phone usage data for 30 subjects. From a textual perspective, this dataset consists of social media posts (1,854/5,167 facebook/twitter posts) and private messages (64,221/132/47,043 facebook/twitter/ SMS messages) sent by the subjects. For our ground truth, we use the {positive, negative, mental well-being} mood scores (in the ranges of , , , respectively) derived from self-assessed psychological scales during the study period.

Dataset 2:
We employed the StudentLife dataset [28], which contains a wealth of information derived from the smartphones of 48 students during a 10-week period. Such information includes samples of the detected activity of the subject, timestamps of detected conversations, audio mode of the smartphone, status of the smartphone (e.g., charging, locked), etc. For our target, we used the selfreported stress levels of the students (range [0-4]), which were provided several times a day. For the approach in LiKamWa et al. [13], we considered the average daily stress level of a student as our ground-truth, as in the original paper; for the rest, we used all of the stress scores and extracted features based on some time interval preceding their completion, as described next, in 4.3 4 .

Task Description
We studied the major issues in the following experimental settings (see Table 3): P1: Using Past Labels: We followed the experimental setting in [13] (see section 3.1): we treated our task as a regression problem and used Mean Squared Error (MSE) and classification accuracy 5 for evaluation. We trained a Linear Regression (LR) model and performed feature selection using Sequential Feature Selection under the LOIOCV and LOUOCV setups; feature extraction is performed over the previous 3 days preceding the completion of a mood form. For comparison, we use the same baselines as in [13]: Model A always predicts the average mood score for a certain user (AVG); Model B predicts the last entered scores (LAST); Model C makes a prediction using the LR model trained on the ground-truth features only (-feat). We also include Model D, trained on non-target features only (-mood) in an unbiased LOUOCV setting.
P2: Inferring Test Labels: We followed the experimental setting presented in [3]. We process our ground-truth in the same way as the original paper (see 4 For P3, this creates the P2 cross-correlation issue in the MIXED/LOIOCV settings. For this reason, we ran the experiments by considering only the last entered score in a given day as our target. We did not witness any major differences that would alter our conclusions. 5 Accuracy is defined in [13] as follows: 5 classes are assumed (e.g., [0, ..., 4]) and the squared error e between the centre of a class halfway towards the next class is calculated (e.g., 0.25). If the squared error of a test instance is smaller than e, then it is considered as having been classified correctly. section 3.2) and thus treat our task as a binary classification problem. We use an SVM RBF classifier, using grid search for parameter optimisation, and perform evaluation using specificity and sensitivity. We run experiments in the LOIOCV and LOUOCV settings, performing feature extraction at different time windows (T HIST = {1, ..., 14}). In order to better demonstrate the problem that arises here, we use the previous label classifier (LAST) and the SVM classifier to which we feed only the mood timestamp as a feature (DATE) for comparison. Finally, we replace our features with completely random data and train the same SVM with T HIST = 14 by keeping the same ground truth, performing 100 experiments and reporting averages of sensitivity and specificity (RAND).

P3: Predicting Users:
We followed the evaluation settings of two past works (see section 3.3), with the only difference being the use of 5-fold CV instead of a train/dev/test split that was used in [9]. The features of every instance are extracted from the past day before the completion of a mood form. In Experiment 1 we follow the setup in [26]: we perform 5-fold CV (MIXED) using SVM (SVR RBF ) and evaluate performance based on R 2 and RM SE. We compare the performance when tested under the LOIOCV /LOUOCV setups, with and without the per-user feature normalisation step. We also compare the performance of the MIXED setting, when our model is trained on the one-hot-encoded user id only. In Experiment 2 we follow the setup in [9]: we label the instances as "high" ("low"), if they belong to the top-30% (bottom-30%) of mood score values ("UNIQ" -for "unique" -setup). We train an SVM classifier in 5-fold CV using accuracy for evaluation and compare performance in the LOIOCV and LOUOCV settings. In order to further examine user bias, we perform the same experiments, this time by labelling the instances on a per-user basis ("PERS"for "personalised" -setup), aiming to predict the per-user high/low mood days 6 .

Features
For Dataset 1, we first defined a "user snippet" as the concatenation of all texts generated by a user within a set time interval, such that the maximum time difference between two consecutive document timestamps is less than 20 minutes. We performed some standard noise reduction steps (converted text to lowercase, replaced URLs/user mentions and performed language identification 7 and tokenisation [6]). Given a mood form and a set of snippets produced by a user before the completion of a mood form, we extracted some commonly used feature sets for every snippet written in English [26], which were used in all experiments. To ensure sufficient data density, we excluded users for whom we had overall fewer than 25 snippets on the days before the completion of the mood form or fewer than 40 mood forms overall, leading to 27 users and 2, 368 mood forms. For Dataset 2, we extracted the features presented in Table 4. We only kept the users that had at least 10 self-reported stress questionnaires, leading to 44 users and 2, 146 instances. For our random experiments used in P2, in Dataset 1 we replaced the text representation of every snippet with random noise (µ = 0, σ = 1) of the same feature dimensionality; in Dataset 2, we replaced the actual inferred value of every activity/audio sample with a random inference class; we also replaced each of the detected conversation samples and samples detected in a dark environment/locked/charging, with a random number (<100, uniformly distributed) indicating the number of pseudo-detected samples.
(a) duration of the snippet; (b) binary ngrams (n = 1, 2); (c) cosine similarity between the words of the document and the 200 topics obtained by [21]; (d) functions over word embeddings dimensions [24] (mean, max, min, median, stdev, 1st/3rd quartile); (e) lexicons [8,11,16,17,18,30]: for lexicons providing binary values (pos/neg), we counted the number of ngrams matching each class and for those with score values, we used the counts and the total summation of the corresponding scores.   Table 5 presents the results on the basis of the methodology by LiKamWa et al. [13], along with the average scores reported in [13] -note that the range of the mood scores varies on a per-target basis; hence, the reported results of different models should be compared among each other when tested on the same target. As in [13], always predicting the average score (AVG) for an unseen user performs better than applying a LR model trained on other users in a LOUOCV setting. If the same LR model used in LOUOCV is trained without using the previously self-reported ground-truth scores (Model D, -mood), its performance drops further. This showcases that personalised models are needed for more  Table 5. P1: Results following the approach in [13].

P1: Using Past Labels
accurate mental health assessment (note that the AVG baseline is, in fact, a personalised baseline) and that there is no evidence that we can employ effective models in real-world applications to predict the mental health of previously unseen individuals, based on this setting.
The accuracy of LR under LOIOCV is higher, except for the "stress" target, where the performance is comparable to LOUOCV and lower than the AVG baseline. However, the problem in LOIOCV is the fact that the features are extracted based on the past three days, thus creating a temporal cross-correlation in our input space. If a similar correlation exists in the output space (target), then we end up in danger of overfitting our model to the training examples that are temporally close to the test instance. This type of bias is essentially present if we force a temporal correlation in the output space, as studied next.

P2: Inferring Test Labels
The charts in Fig. 1 (top) show the results by following the LOIOCV approach from Canzian and Musolesi [3]. The pattern that these metrics take is consistent and quite similar to the original paper: specificity remains at high values, while sensitivity increases as we increase the time window from which we extract our features. The charts on the bottom in Fig. 1 show the corresponding results in the LOUOCV setting. Here, such a generalisation is not feasible, since the increases in sensitivity are accompanied by sharp drops in the specificity scores.
The arising issue though lies in the LOIOCV setting. By training and testing on the same days (for T HIST > 1), the kernel matrix takes high values for cells which are highly correlated with respect to time, making the evaluation of the contribution of the features difficult. To support this statement, we train the same model under LOIOCV, using only on the mood form completion date (Unix epoch) as a feature. The results are very similar to those achieved by training on T HIST = 14 (see Table 6). We also include the results of another naïve classifier (LAST), predicting always the last observed score in the training set, which again achieves similar results. The clearest demonstration of the problem though is by comparing the results of the RAND against the FEAT classifier, which shows that under the proposed evaluation setup we can achieve similar performance if we replace our inputs with random data, clearly demonstrating the temporal bias that can lead to over-optimistic results, even in the LOIOCV setting.   Table 6. P2: Performance (sensitivity/specificity) of the SVM classifier trained over 14 days of smartphone/social media features (FEAT) compared against 3 naïve baselines.

P3: Predicting Users
Experiment 1: Table 7 shows the results based on the evaluation setup of Tsakalidis et al. [26]. In the MIXED cases, the pattern is consistent with [26], indicating that normalising the features on a per-user basis yields better results, when dealing with sparse textual features (positive, negative, wellbeing targets). The explanation of this effect lies within the danger of predicting the user's identity instead of her mood scores. This is why the per-user normalisation does not have any effect for the stress target, since for that we are using dense features derived from smartphones: the vocabulary used by the subjects for the other targets is more indicative of their identity. In order to further support this statement, we trained the SVR model using only the one-hot encoded user id as a feature, without any textual features. Our results yielded R 2 ={0.64, 0.50, 0.66} and RM SE={5.50, 5.32, 6.50} for the {positive, negative, wellbeing} targets, clearly demonstrating the user bias in the MIXED setting.
The RMSEs in LOIOCV are the lowest, since different individuals exhibit different ranges of mental health scores. Nevertheless, R 2 is slightly negative, implying again that the average predictor for a single user provides a better estimate for her mental health score. Note that while the predictions across all individuals seem to be very accurate (see Fig. 2), by separating them on a per-user basis, we end up with a negative R 2 .
In the unbiased LOUOCV setting the results are, again, very poor. The reason for the high differences observed between the three settings is provided by the R 2 formula itself (1 − ( i (pred i − y i ) 2 )/( i (y i −ȳ) 2 )). In the MIXED case, we train and test on the same users, whileȳ is calculated as the mean of the mood scores across all users, whereas in the LOIOCV /LOUOCV cases,ȳ is calculated for every user separately. In MIXED, by identifying who the user is, we have a rough estimate of her mood score, which is by itself a good predictor, if it is compared with the average predictor across all mood scores of all users. Thus, the effect of the features in this setting cannot be assessed with certainty.  Table 7. P3: Results following the evaluation setup in [26] (MIXED), along with the results obtained in the LOIOCV and LOUOCV settings with (+) and without (-) per-user input normalisation. Table 8 displays our results based on Jaques et al. [9] (see section 3.3). The average accuracy on the "UNIQ" setup is higher by 14% compared to the majority classifier in MIXED. The LOIOCV setting also yields very promising results (mean accuracy: 81.17%). As in all previous cases, in LOUOCV our models fail to outperform the majority classifier. A closer look at the LOIOCV and MIXED results though reveals the user bias issue that is responsible for the high accuracy. For example, 33% of the users had all of their "positive" scores binned into one class, as these subjects were exhibiting higher (or lower) mental health scores throughout the experiment, whereas another 33% of the subjects had 85% of their instances classified into one class. By recognising the user, we can achieve high accuracy in the MIXED setting; in the LOIOCV, the majority classifier can also achieve at least 85% accuracy for 18/27 users.  Table 8. P3: Accuracy by following the evaluation setup in [9] (MIXED), along with the results obtained in LOIOCV & LOUOCV.

Experiment 2:
In the "PERS" setup, we removed the user bias, by separating the two classes on a per-user basis. The results now drop heavily even in the two previously well-performing settings and can barely outperform the majority classifier. Note that the task in Experiment 2 is relatively easier, since we are trying to classify instances into two classes which are well-distinguished from each other from a psychological point of view. However, by removing the user bias, the contribution of the user-generated features to this task becomes once again unclear.

Proposal for Future Directions
Our results emphasize the difficulty of automatically predicting individuals' mental health scores in a real-world setting and demonstrate the dangers due to flaws in the experimental setup. Our findings do not imply that the presented issues will manifest themselves to the same degree in different datasets -e.g., the danger of predicting the user in the MIXED setting is higher when using the texts of 27 users rather than sensor-based features of more users [1,2,9,22]. Nevertheless, it is crucial to establish appropriate evaluation settings to avoid providing false alarms to users, if our aim is to build systems that can be deployed in practice. To this end, we propose model building and evaluation under the following: -LOUOCV: By definition, training should be performed strictly on features and target data derived from a sample of users and tested on a completely new user, since using target data from the unseen user as features violates the independence hypothesis. A model trained in this setting should achieve consistently better results on the unseen user compared to the naïve (from a modelling perspective) model that always predicts his/her average score. -LOIOCV: By definition, the models trained under this setting should not violate the iid hypothesis. We have demonstrated that the temporal dependence between instances in the train and test set can provide over-optimistic results. A model trained on this setting should consistently outperform naïve, yet competitive, baseline methods, such as the last-entered mood score predictor, the user's average mood predictor and the auto-regressive model.
Models that can be effectively applied in any of the above settings could revolutionise the mental health assessment process while providing us in an unbiased setting with great insights on the types of behaviour that affect our mental wellbeing. On the other hand, positive results in the MIXED setting cannot guarantee model performance in a real-world setting in either LOUOCV or LOIOCV, even if they are compared against the user average baseline [4].
Transfer learning approaches can provide significant help in the LOUOCV setting. However, these assume that single-domain models have been effectively learned beforehand -but all of our single-user (LOIOCV ) experiments provided negative results. Better feature engineering through latent feature representations may prove to be beneficial. While different users exhibit different behaviours, these behaviours may follow similar patterns in a latent space. Such representations have seen great success in recent years in the field of natural language processing [15], where the aim is to capture latent similarities between seemingly diverse concepts and represent every feature based on its context. Finally, working with larger datasets can help in providing more data to train on, but also in assessing the model's ability to generalise in a more realistic setting.

Conclusion
Assessing mental health with digital media is a task which could have great impact on monitoring of mental well-being and personalised health. In the current paper, we have followed past experimental settings to evaluate the contribution of various features to the task of automatically predicting different mental health indices of an individual. We find that under an unbiased, real-world setting, the performance of state-of-the-art models drops significantly, making the contribution of the features impossible to assess. Crucially, this holds for both cases of creating a model that can be applied in previously unobserved users (LOUOCV ) and a personalised model that is learned for every user individually (LOIOCV ).
Our major goal for the future is to achieve positive results in the LOUOCV setting. To overcome the problem of having only few instances from a diversely behaving small group of subjects, transfer learning techniques on latent feature representations could be beneficial. A successful model in this setting would not only provide us with insights on what types of behaviour affect mental state, but could also be employed in a real-world system without the danger of providing false alarms to its users.