Measuring Hidden Bias within Face Recognition via Racial Phenotypes

Recent work reports disparate performance for intersectional racial groups across face recognition tasks: face verification and identification. However, the definition of those racial groups has a significant impact on the underlying findings of such racial bias analysis. Previous studies define these groups based on either demographic information (e.g. African, Asian etc.) or skin tone (e.g. lighter or darker skins). The use of such sensitive or broad group definitions has disadvantages for bias investigation and subsequent counter-bias solutions design. By contrast, this study introduces an alternative racial bias analysis methodology via facial phenotype attributes for face recognition. We use the set of observable characteristics of an individual face where a race-related facial phenotype is hence specific to the human face and correlated to the racial profile of the subject. We propose categorical test cases to investigate the individual influence of those attributes on bias within face recognition tasks. We compare our phenotype-based grouping methodology with previous grouping strategies and show that phenotype-based groupings uncover hidden bias without reliance upon any potentially protected attributes or ill-defined grouping strategies. Furthermore, we contribute corresponding phenotype attribute category labels for two face recognition tasks: RFW for face verification and VGGFace2 (test set) for face identification.


Introduction
An increasing number of automated face recognition systems have been deployed by companies, nonprofits and governments to make autonomous decisions for millions of users [1].Such wide-scale adoption within real-world scenarios brings with it valid concerns on the potential abuse of face recognition due to the presence of data and algorithmic bias [2,3].The most common issue pertaining to such bias arises in racial groups [4].Subsequently, the research community have been focused on methods that rely on demographic or skin type group annotations drawn from public face recognition benchmark datasets [5,6].This provides algorithmic performance on such predefined groupings to measure bias.However, current grouping annotations and related bias evaluation strategies may lead to unintended negative implications-each of which we now detail to illustrate our motivation clearly.
Ambiguous Definition of Race: The historical and biological definitions of race vary and racial context is not fixed over time [7].Such ambiguity becomes more problematic for the face recognition literature, as many researchers do not provide any related background about the details of their racial categorisation design process [8].However, racial groupings are critical to the effective evolution of face recognition methodologies as they often represent the all-important means of quantitative evaluation.As in any recognition task, poorly defined groupings result in skewed mean and standard deviation measures of relative performance due to the ill-posed boundary conditions on membership of each group that can cause a given an example to justifiably transit from one group to another.
Privacy of Protected Attributes: Auditing benchmark datasets can cause potential privacy and consent violations [9] for dataset subjects.For example, exposing demographic origin may enhance the representations of a group under threat, leading to the potential for racial profiling and associated targeting [10].As information of racial or ethnic origin is sensitive [11], researchers should either avoid revealing such sensitive data or provide an appropriate context for use [9].
Confined Groupings: Skin or racial grouping strategies such as binary {light vs. dark; black vs. white} for evaluating racial bias limits the scope of any study as they fail to capture the whole aspect of the bias problem where it needs to consider both multi-racial or less stereotypical members of such groups instead [12,13] use Fitzpatrick skin type groupings to evaluate racial bias, but one such skin-tone based racial grouping contains multidimensional traits including nose, hair type, eye, and lips [14].Leveraging all such traits together instead brings improved interpretations and derivations to address racial bias.
Racial Appearance Bias: Maddox [15] explains racial appearance bias as a negative disposition toward phenotypic variations in facial appearance.He also [16] discusses how race-conscious social policies may fail to address racial bi- ases in the treatment and outcomes of disadvantaged groups.Many studies show that individuals with more stereotypical racial appearance suffer poorer outcomes than those with less stereotypical appearance for their race [16,25,26].On the other hand, a better understanding of the role of phenotypic variation complements solutions for both racial bias [15].By way of phenotype, we mean the set of observable characteristics of an individual face where a race-related facial phenotype is hence specific to the human face and correlated to the racial profile of the subject.
Accordingly, we propose using race-related facial (phenotype) characteristics within face recognition to investigate racial bias.We categorise representative racial characteristics on the face and explore the impact of each characteristic phenotype attribute: skin types, eyelid type, nose shape, lips shape, hair colour and hair type.We audit these attributes for two different publicly available face datasets: VGGFace2 (test set) and RFW.We assess the impact of both attribute-based and subgroup-based evaluations on racial bias of face recognition tasks.We utilise two different training setups for face verification to compare performance disparities between imbalance and racially balanced training datasets.We compare our phenotype-based evaluation strategy with race or skin type based grouping evaluation.We show that our strategy provides a more elaborate perception of bias without revealing any potentially protected or ill-defined information.
This study presents a new evaluation strategy using facial phenotype attributes to investigate and measure racial bias with greater granularity within face recognition tasks.In this paper, our key contributions are as follows: • we propose a new evaluation strategy that uses facial phenotype attributes rather than race labels to measure racial bias within both attribute-based and subgroup-based performance of state-of-the-art face recognition algorithms.• we contribute additional facial phenotype attribute labelling for the VGGFace2 (face identification) and RFW (face verification) benchmark face datasets.• we uncover the potentially hidden source of bias within the evaluation of racial groups, which is supported by quantitative evidence.

Related Work
Automated facial recognition encompasses two different tasks: face identification and face verification.For both tasks, studies present approaches that achieve overall performance on public benchmark datasets [22,27] whilst the racial diversity within these datasets is often limited, biased and overlooked [28].Consequently, numerous studies audit publicly available face datasets to demonstrate the dataset bias in face recognition [19,29,30].However, the group definitions in use vary, meaning that a lack of consensus makes it significantly harder to tackle bias collaboratively due to an inconsistent problem definition across the literature.
We show leading benchmark face datasets with their grouping strategies in Table 1 to tackle racial bias.As a subset of MS-Celeb-1M, the RFW dataset [19] measures the racial performance of face verification on four different racial groups: {African, Asian, Indian, Caucasian}.Fair-Face [29] is another dataset drawn from the YFCC-100M Flickr dataset, providing additional group labels, {Middle East, Latino} to evaluate bias on wider groupings.UTK-Face [31] is a large-scale face dataset with five different ethnicity categories for a variety of tasks, such as face detection, age estimation, age progression/regression, etc.More recently, The Casual Conversations Dataset [23] yielded from vendor data contains 45K videos with corresponding Fitzpatrick skin type labels.We subsequently categorise studies in the literature according to grouping strategies they adopt and explain each category below.
Racial Groupings: Although the definition of race carries a large amount of complexity and ambiguity, an increasing number of studies adopt various racial groupings and show performance disparities among them [4,32].The underlying reasons for such disparate results are summarised into two categories by [33].Both the distribution of data on pre-defined racial groups and how we measure the bias play a major role in the results [33].However, the majority of racial bias research rarely contains underlying details about how racial groups are determined or how racial bias evaluation metrics are designed [8].Furthermore, [34] showed that non-explicit racial factors (accessories, hairstyles or facial anomalies) conflates with explicit racial factors (skin tone, nose shape or eye shape) and both factors strongly affect the recognition performance.He discusses the need to investigate each factor in order to have robust, fair, and explainable face recognition solutions.Such needs contradict the use of racial groups as they remain too narrow to have elaborate explanations [35].
Skin Type Groupings: Moreover, various studies [12,21,23] measure the racial bias in face recognition using either the Fitzpatrick Skin Types [36] or binary skin groups instead of using racial groups.Skin type grouping labels are yielded mostly from crowd-sourcing [30], neural network-based classifiers [37,38] or professional human annotators.Merler [30] presents additional humaninterpretable quantitative measures of intrinsic facial features along with subjective annotations.Although [39] show that Fitzpatrick Skin Types classification from uncontrolled imagery is a challenging task, [38] achieves high correlation with ground-truth labels using consistent reference points in human-level annotation interface.
Another issue is the impact of skin tone on racial bias.Cook [40] shows that the measure of skin reflectance on binary skin groups had the greatest net effect on the average biometric performance of face recognition.On the contrary, [41] claims it is not the case when it comes to continuous Fitzpatrick Skin Type groupings.Furthermore, skin tone is not enough for analysing racial bias as there is no clear evidence that skin tone is the primary driver for disparate false match rates [41].Accordingly, many studies [42,43] suggest looking for other race-related facial attributes including lip, eye, face shape in order to measure racial bias within face recognition.
In order to address these aforementioned issues, we propose a phenotype-based evaluation strategy for racial bias within face recognition.We provide facial phenotype attributes for the public benchmark datasets VGGFace2 and RFW in Table 1.We explain the phenotype-based attribute category selection process followed by our annotation framework in Section 4 and 5, respectively.We elaborate on our findings by measuring two state-of-the-art algorithms performance using imbalance and racially balanced training sets.Firstly, we analyse bias for each phenotype attribute category (attribute-based).Secondly, we produce different appearance-based joint distributions of face subjects and assess algorithm performance on subject groupings (subject-based).We provide related experiments with details in Section 6.

Ethical Considerations
Intent: This work intends to provide a novel racial bias analysis methodology via facial phenotype attributes for face recognition.The proposed strategy avoids the need for researchers to use potentially protected or ill-defined subject attributes and instead introduces racial phenotype attributes to explore racial bias in face recognition.
Denotation of Facial Phenotypes: We denote racerelated phenotype attributes according to the studies of [44,45] to have descriptive naming whilst avoiding causing any unintended offence to individuals.
Use of VGGFace2 and RFW: We conduct our experiments on two different face datasets which are publicly available for research use only.The reader is directed to the original source publication and the associated research organisation for access to these datasets.We make available supplementary facial attribute labels for these datasets in order to facilitate the use of our proposed evaluation strategy by other researchers, with the aim of furthering our stated intent above.

Racial Phenotypes on Face Images
In this section, we explain the categorisation of facial phenotype attributes for face recognition.Quine [46] presents three possible definitions of the race concept: a genetic variation between humans, morphological attributes, and genetically determined psychological characteristics.These morphological attributes are the primary interest for resolving racial bias in face recognition.For morphological attributes, studies [47,48] focus on the impact of human phenotype characteristics over race estimation.They categorise the attributes by considering biological traits.The study of Shades of Race [44] investigates the marginal effects of phenotypic characteristics including skin tone, lips, nose, hair and body type on racial categorisation.Zhuang [49]  points.He finds statistically significant differences in facial measurements between four racial/ethnic groups, which are {Caucasian, Hispanic, African, other (mainly Asian)}.
We adopt such groupings and measurements for face recognition by considering two limitations.Firstly, effectively evaluating face recognition tasks requires tight cropped (e.g.112 × 112 px) low-quality images containing occlusion, shadows, and illumination variations for both the training and test stages.This makes phenotype attribute detection on the specific characteristics of face dataset images more difficult when compared to real-world human faces.Secondly, the broader categorisation increases the number of potential groupings, making bias evaluation inefficient for face recognition systems.Correspondingly, we decide to use 6 primary attributes that define the phenotype groupings for our study: skin type, eyelid type, nose shape, lip shape, hair type and hair colour 1 .Subsequently, we have 21 different attribute categories under the 6 primary attributes as listed in Table 2 [44] along with normalised standard deviations σ/µ.
We choose to use Fitzpatrick Skin Types [36] for skin tones as it provides more granularity, {Type 1, Type 2, Type 3, Type 4, Type 5, Type 6}, than binary skin-tone groupings, {lighter skin-tone, darker skin-tone}.The appearance of the human eye has been grouped by its position, shape and settings in many cosmetic industry guidelines [50].However, they have either no scientific background or solid re-lation with race.Instead, we look into epicanthal folds and check eyelid difference as it is a more distinctive attribute for racial bias [51].We acknowledge that a single attribute category can be observed in multiple race groups.However, our main concern is identifying the most observable and convenient racial phenotype attributes on images to evaluate the bias (see Table 2).
For the appearance of the nose, we use two categories, wide and narrow, by examining the nasal breadth [49].Hair texture is labelled into eight categories using the frequency of twists, waves, and curve diameter metric by [52].Here we utilise eight categories and group them into three main hair texture types: straight, wavy, curly, in addition to bald.Despite being the most artificially manipulable attribute, we retain hair colour as it is related to skin tone [53]-the categories for hair colour we use: red, grey, black, blonde, brown (see Table 2).

Annotation of Racial Phenotypes
Previously in Section 4, we explain how we define racial phenotype attributes and their categories.Before the annotation process, we choose the most established face recognition datasets to validate our proposed methodology.For the face verification task, we choose the RFW dataset [19] as it provides a relatively broader racial distribution of subjects where each subject contains 3-5 images.For face identification, we use the VGGFace2 closed-test set [5], which contains at least 300 images per subject.For both datasets, we design an annotation interface to make the annotation process both user-friendly and robust.We present multiple sample images of a subject to avoid incorrect annotation caused by challenging samples such as grey-scale images, facial makeup, and poor scene illumination.Each subject is presented with attribute category selectors next to a set of face images within the annotation interface.Subsequently, an experienced annotator who has experience in morphological differences among races annotates each subject using the interface.
We obtain 11654 subjects annotations from the RFW and VGGFace2 benchmark datasets.Each annotation took 10-20 seconds, and overall annotation took 12 days (, i.e. annotator working at a maximum of 6 hours per day with regular breaks).The result of this annotation process, the phenotype attributes distributions for the RFW and VGGFace2 benchmark datasets, are shown in Figure 1 left/right, respectively.We also present the normalised standard deviations (Coefficient of Variance), σ/µ, among attribute categories of benchmark datasets to show the level of imbalance within these categories in Table 2.For both datasets, we can observe that the dominant phenotype attribute categories are Skin Type 3, Straight Hair, Narrow Nose, Other (non-monolid) Eyes, Small Lips, which correlates to the dominant presence of Caucasian faces based on the analysis of Figure 1.

Experimental Results and Discussion
In this section, we analyse the performance of our phenotype-based grouping methodology for face recognition tasks.We provide a public reference implementation, dataset reference links and pre-trained models2 .

Training Setups
Setup 1 (Imbalanced Training Data): We train ArcFace [54] with a ResNet100 [55] on the VGGFace2 benchmark datasets that contains 8631 subjects where subject distribution is racially imbalanced.Here, our specific choice of VGGFace2 is due to investigate the impact of imbalanced training data that includes data bias on our proposed evaluation strategy.Setup 2 (Racially Balanced Training Data): We use a ResNet34 [55] backbone architecture with the Softmax loss [56] trained on the BUPT-Balanced benchmark dataset [20] that contains 28000 face subjects.The BUPT-Balanced has racially balanced distributions among four groups {African, Asian, Indian, Caucasian} with 7000 face subjects each.The primary purpose of setup 2 is to assess the impact of a racially balanced training dataset on results over the bias using our proposed phenotype-based methodology.We compare how much a racially balanced training dataset improved the performance difference compared to setup 1.

Face Verification
Face verification, also known as one-to-one verification, is the task of comparing two different facial images to estimate whether they belong to the same individual subject.We follow two pairing strategies to explore the impact of a single attribute (attribute-based) and appearance-based facial groups (subgroup-based) on the evaluation performance of face verification.Attribute-based pairing: Firstly, we generate pairs from images containing the same attribute category-for example, facial images from people who all have monolid eyes.Consequently, we compare individual attributes performance using both training setups for face verification.
For attribute-based face verification, we randomly select 20k positive and 20k negative pairs from all possible pairs of each attribute.We calculate the cosine similarity of feature encoding of all selected negative and positive pairs to obtain the most challenging pairs.Subsequently, we select the most similar 3000 pairs from the negative samples and the least similar 3000 pairs from the positive samples for each attribute category in Table 3.Since the Type 1 category of the skin type attribute and red hair category of hair colour attribute do not have enough samples to generate 6000 pairs, we instead produce 602 pairs (301 positive, 301 negative) for Type 1, 1200 (600 positives, 600 negative) pairs for red hair.
In this way, we measure each face attributes accuracy using on face verification performance.We use both training setups to show how much standard deviation ( σ) changes between balanced and imbalanced training data.We present the attribute-based sample groups in Table 3 with a standard deviation of accuracies excluding red hair and Type σ 5.07 Table 4. Subgroup-based face verification performance of RFW using training setup 1, sorted by descending order of accuracy.
1 attribute accuracies ( σ * ) and including them ( σ) .It is clear from Table 3 that for both setup 1 (imbalanced training data) and setup 2 (racially balanced training data), accuracy is lower for monolid eyes, black hair, full lips, and wide nose than the other eye, blonde hair, and small lips, and narrow nose respectively.We also do find a slight correlation between darker skin tones and higher false matching rates when we pair from the same attribute categories (Supplementary Table S1).Moreover, although the imbalanced training setup results a bigger performance difference ( σ) , the amount of difference between two setups is small, meaning that a racially balanced dataset distribution is not enough to overcome performance bias.
Additionally, NIST [4] suggests providing false matching rates of pairing combinations between each grouping in the dataset as it is necessary for real-world scenarios.Therefore, we pair each attribute category with all other attribute categories to assess cross-attribute pairing performance.Subsequently, we evaluate false matching rates between any attribute category pair combination in Figure 2. We randomly generate 10000 pairs for each category pairings; in total, we have 441 (21 × 21) pairings.For example, each cross-attribute pairings means 10000 pairs between blonde hair -monolid eye, type 3 -wide nose or wavy hairfull lips etc.As a result of this, we clearly show that Type 5, Type 6 and monolid eyes pairings have higher false matching rates among all attribute categories in Figure 2 using training setup 1.Consequently, the impact of the dark skin tones on performance increases for cross-attribute pairings compared to the attribute-based pairings.
Subgroup-based pairing: Secondly, we create various subgroups with different phenotypic attribute combinations in the dataset.For example, one such subgroup consists of subjects with skin type 3, monolid eyes, straight hair, wide nose, and small lips.Our main purpose of such pairing is to show the effects of single attribute changes over a group-for instance, what would change when only skin gets darker, but other attributes remain the same?Furthermore, we generate all possible subgroups with different phenotypic attribute category combinations to investigate subgroup-based performances.However, we need to limit the number of subgroups such that we can present our results efficiently.We first remove the hair colour attribute as it is the easiest race-relevant attribute that individuals can readily modify via styling.Consequently, we merge skin types into three groups and show them as {1,2}   for Type 1 and Type 2, {3,4} for Type 3 and Type 4, and {5,6} for Type 5 and Type 6. Lastly, we remove subgroups with a few or even no samples in the test set, which comprises 3% of all samples.In Table 4, we show the performance of each subgroup with its proportion in the original test dataset.To evaluate the performance, we generate 6000 pairs (3k positive and 3k negative) from all possible pairs of subgroups that have enough samples.For the rest, we generate an equal number of negative and positive pairs as much as availability facilitates.From our observation of causes many different evaluation and analysis problems.It lacks sufficient interpretation in the test phase; there are minorities in the global populous with dark skin and monolid eyes or any other less common variations.Benchmark datasets do not contain enough representations for such minority groups.An improved evaluation dataset would be one that is able to cover more phenotype combinations such that its distribution is an unbiased representation of the global populous.Lastly, we estimate such disparities among different grouping strategies using training setup 2. We take racial groupings {African, Asian, Indian, Caucasian} and binary skin tone groupings {lighter skin-tone, darker skin-tone} as they are very common grouping strategies in the literature.We compare them with our phenotype-based grouping strategy.In Figure 3, we show that how accuracy and the standard deviation differs between sub-groups in three different strategies.Higher variation reveals hidden bias, which may be missed in narrow, erroneous racial or binary skin tones grouping strategies.The phenotype-based grouping strategy brings a more granular observation of the variability in performance (i.e. higher standard deviation) and hence a more resolute measure of performance bias.

Face Identification
Face identification as a one-to-many verification is the task of searching for a face across a facial database.There are two scenarios for face identification applications based on whether a queried face is enrolled in a database or not.Open-set identification assumes the database does not necessarily contain the queried face, while closed-set identification always looks for a match in the database.In this study, we apply closed-set identification using the test set of the VGGFace2 benchmark dataset on the originally proposed protocol [5] and we extract the image features using training setup 1 [54].We apply a 5-fold train-test split where we sample 50 images from each subject as the test set and use the rest as the training set.We train a standard linear SVM on the extracted feature representations and predict the identities for test samples.Our results are shown in Table 5 where we can observe that the standard deviation ( σ) is much smaller when compared to the earlier attribute-based face verification results of Table 3.It shows that the closed-set face identification does not have the same level of bias correlation as we find for face verification.However, in this experiment, we are unable to have the same proportion for each attribute, and we did not measure open-set face identification.As suggested in [4], future work should design and apply open-set tests for face identification on better-distributed benchmark datasets to measure bias extensively.

Conclusion
We propose a new evaluation strategy using facial phenotype attributes to assess racial bias in face recognition tasks.We elaborate experimental results to show the impact of each phenotype attributes using two different training setups, including imbalanced and racially balanced training sets.We also provide different pairing strategies for face verification to draw attention to the importance of pairing for comprehensive evaluation.We observe apparent performance differences between race-related phenotype attribute categories and subgroups for both training setups.However, we also uncover more considerable performance disparities among phenotype attributes than racial groups.We demonstrate that phenotype-based evaluation strategy reveals racial bias comprehensively whilst avoiding exposing potentially protected or ill-defined attributes.Future work will focus on improving facial appearance variations using generative models to provide more balanced and realistic test scenario distributions.

Measuring Hidden Bias within Face Recognition via Racial Phenotypes -Supplementary Material
Seyma Yucer 1 , Furkan Tektas 3 , Noura Al Moubayed 1 and Toby P. Breckon We present attribute-based face verification scores including False Non-Match Rate (FNMR), False match rate (FMR) and F1 score in the Table S1.We use the same pairings and protocol [27] presented in the Section 6.2 for Table 3.
Whilst F1 scores are correlated with Table 3 accuracies, for the imbalanced training setup 1, the false matching ratio is higher on attributes like Monolid Eye, Type 6/5/4/3, Wide Nose, Full Lips than the different categories under the same attribute.Moreover, we observe that the balanced training setup 2 improves the FMR while increasing the FNMR for the attribute categories with higher accuracies and F1 scores.

22 nan - 1 . 55 Figure 2 .
Figure 2. False matching rates (FMR) of cross-attribute based pairings for 21 attribute categories using training setup 1.Each cell depicts FMR on a logarithmic scale which is log10(FMR) with lower negative values (close to zero) encoding superior false match rates.

Table 1 .
Publicly available face datasets for different types of facial analysis tasks and their grouping strategies to address racial bias.
*Casual Conversations dataset provides videos.
considers 21 anthropometric measurements such as face width, length, nose breadth and length, eye corner 3HUFHQWDJHRI6XEMHFWV Figure 1.The distribution of facial phenotype attributes of RFW (left) and VGGFace2 Test (right) datasets.

Table 2 .
. Facial phenotype attributes and their categorisation based on

Table 3
. Attribute-based face verification performance of RFW.σ represents the standard deviation of all attribute category accuracies, including red hair and type 1, σ * represents excluding standard deviation.

Table 4 ,
[20]an conclude that groups who have one of the attributes like wide nose, full lips, and monolid eye type always have less accuracy than the other groups with a narrow nose, small lips and other eye (when rest of the attributes are same).Furthermore, whilst the average accuracy of the subgroups with Type {5,6} skin type is 86.97%, subgroups with Type {1,2} skin type is 92.56%, but this notably includes other attributes effects.Accuracy variations for three grouping strategies.Standard deviation of the groupings reflects the amount of measured bias.Racial groupings {African, Asian, Caucasian, Indian} accuracies are obtained from[20].Binary skin tones {lighter skin-tone, darker skin-tone} are the average accuracy of Type 1-3 and Type 4-6 skin tones, respectively.

Table 5 .
Face identification performance on VGGFace2 test set using standard linear SVM and features from training setup 1, sorted by descending order of accuracy.

Table S1 .
Attribute-based face verification F1, FNMR, TMR scores of RFW dataset on both training setups.