Deconfounding Causal Inference for Zero-Shot Action Recognition

Zero-shot action recognition (ZSAR) aims to recognize unseen action categories in the test set without corresponding training examples. Most existing zero-shot methods follow the feature generation framework to transfer knowledge from seen action categories to model the feature distribution of unseen categories. However, due to the complexity and diversity of actions, it remains challenging to generate unseen feature distribution, especially for the cross-dataset scenario when there is a potentially larger domain shift. This article proposes a Deconfounding Ca USAl GAN (DeCalGAN) for generating unseen action video features with the following technical contributions: 1) Our model unifies compositional ZSAR with traditional visual-semantic models to incorporate local object information with global semantic information for feature generation. 2) A GAN-based architecture is proposed for causal inference and unseen distribution discovery. 3) A deconfounding module is proposed to refine representations of local objects and global semantic information confounder in the training data. Action descriptions and random object features after causal inference are then used to discover unseen distributions of novel actions in different datasets. Our extensive experiments on Cross-Dataset Zero-Shot Action Recognition (CD-ZSAR) demonstrate substantial improvement over the UCF101 and HMDB51 standard benchmarks for this problem.


I. INTRODUCTION
A CTION recognition, also known as video recognition, is a fundamental problem in video understanding.Over the last decade, there has been increasing research attention in video action recognition, with the emergence of high-quality large-scale action recognition datasets.Recently, a wide range of popular and successful model architectures have been designed for action recognition tasks.However, these methods require a large Junyan Wang, Maurice Pagnucco, and Yang Song are with the School of Computer Science and Engineering, University of New South Wales, Sydney, NSW 2052, Australia (e-mail: junyan.wang@unsw.edu.au;morri@unsw.edu.au;yang.song1@unsw.edu.au).
Digital Object Identifier 10.1109/TMM.2023.3318300number of training data for each action class, which requires costly and laborious annotations of videos, and the trained model does not generalize to unseen action categories.It is infeasible and extremely expensive to annotate action videos with the ever-increasing need for new categories.To solve this problem, zero-shot action recognition has recently drawn considerable interest, with its ability to identify unseen action categories without labeled examples.
Existing studies of ZSAR have mainly focused on innerdataset seen/unseen splits due to the requirement of humandefined domain attributes [1], [2].This setting is not very practical since a new dataset could require re-training, as different datasets might exhibit cross-domain issues.Moreover, regardless of the type of side information we adopt, the generalization capability of these approaches could be lacking, due to the higher degree of domain shift across datasets.Recently, a more realistic cross-dataset zero-shot action recognition (CD-ZSAR) task [3] was proposed, which aims to make large-scale pretrained model transfer seamlessly to unseen classes across new datasets, and thus our work focuses on CD-ZSAR scenario.
One of the main challenges of CD-ZSAR is the weak knowledge representation.Early research [4], [5] in zero-shot learning focused on developing a compatibility model, and most of these methods are attribute-based.In CD-ZSAR, it is infeasible to design a universal attribute-space that is applicable to every new task and dataset.Therefore, word embedding is currently the most efficient side information for CD-ZSAR.Also, videos are highly complex containing both spatial and temporal information, and hence it is difficult to apply an automatic word-embedding model to represent the global semantic knowledge of a class.Recent studies have investigated how object information [6], [7] or semantic embedding [8] performs in action recognition, and these studies have demonstrated successful outcomes, with object information and semantic embedding representing spatial and temporal information, respectively.However, the basic video-based backbone is ineffective in learning different domain knowledge in the zero-shot setting.Another main challenge is the unseen distribution.Recently, thanks to advances in generative adversarial networks (GANs), many approaches have been proposed to directly generate unseen samples in zero-shot tasks [9], [10], [11].However, although GAN is able to generate data from the distribution of the training dataset, it cannot expand the original distribution without seeing novel samples.As shown in Fig. 1, part of the feature distribution of unseen action videos is different from the training data, which means the generated unseen action videos by a basic GAN model is difficult to represent the unseen distribution.
The above challenges motivate us to design a new framework for cross-dataset zero-shot learning action recognition with two sub-tasks, i.e. compositional generation and distribution inference.In general, video data distribution is more complex than that of image-level data.Instead of directly generating video data, our focus lies on generating lower-dimensional features extracted by the conventional backbone.Firstly, an action video contains both spatial and temporal information, such as characters, movements, and interactions.Weak knowledge representations such as word embedding can be compensated by compositional knowledge which consists of local object information using pretrained detectors and global semantic information using Elaborative Description (ED) [12] as shown in Fig. 2. The second task aims to generate unseen action representations that can effectively infer the unseen distribution.We design a Deconfounding Causal GAN (DeCalGAN) framework with the following insights: 1) We propose a novel approach for generating compositional features from dual channels, i.e., Elaborative Descriptions (ED) and object detection, based on causal inference.Causal inference has been shown to be useful in compositional zero-shot learning, as it can identify the true causal relationships between variables [13].Our approach builds a structured causality-inspired generative model that captures the causal relationships between features and actions.Specifically, we use a conditional causal graph to infer action features based on their corresponding semantic and object representations.2) One of the main challenges in representation learning from videos is the presence of confounding factors that can arise due to the diverse range of latent information.To address this challenge, we propose a deconfounding module that can handle the interference between global semantic and local object features.This is particularly important since each class can have an unlimited number of possible compositions of objects, and distinguishing between confounded object feature dimensions and semantic feature dimensions is critical.By ensuring that each factor is kept independent, our generative model can accurately infer unseen distributions.3) Our proposed approach achieves zero-shot recognition by generating unseen action features based on random object information and EDs of test actions.Our method outperforms existing approaches on various benchmarks, demonstrating the effectiveness of our causal inference-based generative model for compositional feature generation.This work provides a promising direction for addressing the challenge of zero-shot recognition in video analysis.Our contributions are summarized as follows: r To the best of our knowledge, we present the first causal in- ference approach to address the unseen distribution problem for cross-dataset zero-shot recognition (CD-ZSAR), and we propose a GAN architecture as a new paradigm for causal inference.
r The proposed Deconfounding Causal GAN (DeCalGAN) consists of a reconstruction module and a deconfounding module that can make confounding features learned from the source domain better generalize to the unseen distribution in the test domain.
r The proposed DeCalGAN is introduced to unify composi- tional and generative frameworks to tackle the challenging CD-ZSAR problem.Local object information and global semantic descriptions can jointly generate missing distributions across different datasets and achieve state-of-the-art performance.

II. RELATED WORK
Action recognition has drawn a significant amount of attention from the computer vision community in the past few years [14], [15], [16], [17], [18], [19].Some attempts have been made to design an efficient method by combining a lightweight temporal module with a conventional 2D CNN-based backbone [16], [20].For example, Li et al. [16] proposed a Temporal Excitation and Aggregation (TEA) block, including a motion excitation module and a multiple temporal aggregation module, specifically designed to capture both short-and long-range temporal evolution.Recent research shows that pure 3D CNNs outperform 2D ones on large-scale benchmarks [21], as 3D CNNs can jointly capture the spatio-temporal features in a unified framework.However, most approaches rely on specific large-scale training video datasets with annotated samples per action class.In this work, we focus on zero-shot action recognition in which test raw data is unavailable.

A. Zero-Shot Action Recognition
Many zero-shot action recognition methods have been proposed recently [1], [6], [8], [22], [23].An initial work [1] used a set of manually defined attributes to describe the spatio-temporal evolution of actions in a video.Other early attempts [8], [23] follow a standard strategy, which first extracts visual features from videos and then trains a joint model that maps the visual embedding to a semantic embedding space.The work of [23] explores word vectors as a shared semantic space to embed labels and videos for zero-shot action recognition.[6] proposed a spatial-aware object embedding for zero-shot action localization and classification.Besides, the work of [24] devises a simple Fig. 2. Illustrations of Elaborative Description in "tai chi", "fencing", "diving", "skiing", "yoga" and "bowling" actions from UCF101 dataset.semantic transfer scheme that embeds semantic relatedness information between seen and unseen classes to composite unseen visual prototypes.However, previous studies have typically focused on inner-dataset seen/unseen splits.A recent work [22] proposed to train a 3D CNN to predict word embedding of labels as end-to-end training for CD-ZSAR.In this work, we also follow the cross-dataset protocol of [22] and apply causal inference to generate unseen class representations.

B. Causal Inference
Causality [25], [26] has inspired computer vision researchers to design new methodologies for various tasks such as image recognition [27] and domain adaption [28], [29].The work of [30] learns a conditional-GAN model jointly with a causal model of label distribution.In contrast, our proposed DeCal-GAN jointly learns semantic and object components by causal inference.In addition, [13] formalizes causal inference as a problem of finding the most likely intervention, while another method [31] explicitly promotes the dependency between all primitives and their compositions in the learned graph embedding.Recently, [32] developed a Deconfounded Cross-modal Matching (DCM) method to remove the confounding effects of moment location in the video moment retrieval task.In this work, our proposed method incorporates adversarial training for deconfounding compositional confounders to better generalize to the unseen distribution.To achieve this, we follow the evaluation protocol in [22], using nearest-neighbor search in a semantic class embedding space.However, it is still difficult for zero shot learning classifier to generate representative embedding for the unseen test classes, due to the weak knowledge representation and unseen distribution challenges.Thus, we propose the Deconfounding Causal GAN (DeCalGAN) for unseen action generation to enhance the classifier, details of which are introduced below, and the overall framework is shown in Fig. 3.

A. Revisiting Causality
We first give a brief introduction of causality, on which our proposed DeCalGAN is based.In this work, we apply structural causal models (SCMs) [25], which contain structural equations and directed acyclic graphs.
Definition 1: A structural causal model is a triple M = (V , U , F ), where U is a set of exogenous variables, V denotes a set of endogenous variables and F is a group of deterministic functions.
Concretely, exogenous variables exist outside the model that we do not care about their causes, and each endogenous variable in the model is the child of at least one exogenous variable.Also, exogenous variables cannot be children of other variables, especially endogenous variables.If we know the value of each exogenous variable, we can completely determine the value of each endogenous variable by using the function in F .
Causal Graph: A causal model M has a corresponding causal graph G. Nodes in the graph represent V and U in the SCM, and edges in the graph represent the functions F .This means if a variable X is the child node of Y , then Y is a direct cause for X.And if X is a parent node of Y , then X is the potential cause of Y .
Confounding: The common cause in a pseudo-correlation is known as confounder, also called a bias.The pseudo-correlation caused by confounders is mixed with the real causal effect, which is the case of confounding.One of the goals of causal inference is to try to eliminate the bias caused by confounding, and find the true causal relationship.
Do-operator: Taking variables as conditions changes our view of the variable, while intervention changes the variable itself.In a causal model M, intervention do(X = x) is performed by replacing the original function X = f x (P x , U x ) with X = x, where f x represents a deterministic function, P x ⊆ V /V i and X ∈ V , that the intervention operation will delete all edges pointing to the variable.Thus, the intervention operation changes the distribution of the original data but does not change the distribution of the original data under the condition of variables.

B. Deconfounding Causal GAN
Although there has been significant research investigating zero-shot learning, learning visual-semantic embedding still remains a challenging issue in video-based tasks.In the CD-ZSAR scenario, a key challenge is to represent the unseen actions that do not exist in the training dataset.We consider that action information is composed of semantic and object features, where semantic information can be used to describe the action progress itself, and object information plays a crucial role in identifying explicit action categories, such as sports and makeup.Since we have action videos and their class labels from the training dataset, we can extract semantic and object information through existing recognition and detection methods.Compared to existing methods that directly learn from seen actions or indirectly learn object/semantic information to recognize unseen actions, our proposed generative model learns the compositional semantic and object representation.To achieve effective learning of such compositional information, we apply causal inference into our adversarial learning methods.
In our method, we approach the action recognition problem as modeling video features caused by real-word entities, and we consider two "elementary factors" which are "Semantic" s ∈ S and "Object" o ∈ O that are independent in the training data.Thus, our model is designed to estimate p(h|s, o), the likelihood of the feature vector h of a video, conditioned on a tuple (s, o) of semantic-object features.Although we consider the combination of semantic and object information capable of inferring the action class, a video contains much more information than just images, which makes it difficult to learn a comprehensive video representation.For example, action information is also characterized by speed difference, action interaction, trajectory, and so on.Therefore, we propose to apply the adversarial training mechanism, for realistic and diverse generation of video representations.In this work, we define our idea as a simple causal graph O → X ← S. Based on the observation [30] Compositional Generation: For seen compositional learning in the training dataset, we apply the Elaborative Description [12] on each action label y ed for semantic information extraction by a language model as s = F sem (y ed ) and object detection for obtaining Top-k object information from the action video x as o = F obj (x).Following WGAN with gradient penalty [33], the adversarial objective function of generated video feature h x = G(s, o, N x ) can be defined as: where N x denotes the noise sampled from Gaussian distribution N (0,1), G and D represent the video feature generator and discriminator, respectively.h x is sampled along straight lines between real feature h x and generated feature h x.  presented in Fig. 4. As discussed in previous work, object information and semantic embedding can effectively capture spatial and temporal information.To extract this information, we employ Faster R-CNN [34] for object feature extraction and BERT [35] for semantic feature extraction.Both E s and E o are implemented as three layers of a multi-layer perceptron (MLP) with 512 dimensions.Additionally, we randomly sample noise from a Gaussian distribution N (0,1) represented as 768 dimensions.The generated video feature h x is then discriminated by the video feature extracted by ResNet(2+1)D_18 [36].To achieve this, we utilize a fully connected layer-based discriminator.
As our model is designed to estimate p(h x |s, o), this generative model has two representation distribution spaces: semantic space Φ s ∈ R d s and object space Φ o ∈ R d o , which might be confounded to estimate the video distribution.Therefore, the above structure only constructs the causal correlation, i.e., "conditioning on" operation, but does not solve the confounder problem.Conventionally defined confounder only considers statistical implications, and the actual causal structure is not considered, while confounder is a concept related to real causal structure.To this end, we propose a deconfounding module that overrides the joint distribution to enforce s and o to specific values and propagate them through the causal graph.
Deconfounding Module: With deconfounding, the intervention changes the joint distribution of nodes in the proposed causal graph G. Inspired by [13], we then reconstruct the latent semantic and object features as h ŝ and h ô from the generated video representation h x by two feed-forward networks E ŝ and E ô as h ŝ = E ŝ(h x) and h ô = E ô(h x).We expect that the reconstructed features h ŝ and h ô maintain approximately the same independence relations and belong to the same independence space as the original features.To this end, with video feature generator G, the factors s and o are inferred by minimizing the reconstruction loss L rec as: where h x denotes the video feature extracted from action recognition networks of the given video.In this causal graph, Φ s and Φ o are parent nodes of video feature h x .The reconstructed distribution Φ ŝ and Φ ô are estimated from h x and thus are child nodes of h x , which as shown in Fig. 5. Therefore, they do not immediately follow the conditional relations that Φ s and Φ o obey.Since semantic and object representations h s and h o are latent and unobserved, they may confound true signals in the generative process.Even though the semantic and object representations are not obviously independent of each other in the representation view, we can also make causal inference if we can find the factors that jointly affect the semantic and object representations and exclude it by some method.To address the challenge of confounding factors between semantic and object representations, we propose a deconfounding module, denoted as De, which compares the original semantic and object feature distribution with the reconstructed distribution.To measure the distance between these two distributions, we employ Multimodel Low-rank Bilinear pooling (MLB) [37] as the distribution score, which has been shown to be effective in multi-modality tasks.The MLB score with given two features (h 1 and h 2 ) is defined as follows: where σ is a linear function, producing values as a onedimensional score, is the Hadamard product, and both U and V are learnable parameters.Unlike concatenating features, we compare the object and semantic modalities and learn to adaptively weigh them.Our objective is to minimize the distance between similar distributions and increase the distance between dissimilar ones.We achieve this by defining the loss function L de of the deconfounding module as follows: where minimizing the loss can make the deconfounding module obtain a higher score of the same distribution and a lower score of the different distributions.Thus the following property of causality is encouraged as: Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.
As confounding factors between training semantic and object sources are latent and unobserved, we assume the notion that the distance between two different distribution spaces can quantify the degree of confounding.We suggest that the closer the distance between the semantic space Φ s and the object space Φ o , the higher the correlation probability between the distributions, indicating a higher probability of confounding.Thus, we model the relationship between Φ s and Φ o as follows: To this end, the designed deconfounding module is expected to have the ability to maximize the distance, and the experimental results help validate our hypothesis.
Zero-shot Recognition: For the unseen compositional generation, we utilize the above causal inference for generating unseen action representations for zero-shot recognition.According to the proposed deconfounding assumption, we can then apply the "Do-intervention" that overrides the joint distribution to enforce s, o to specific values and propagate them through the causal graph.With this propagation, an intervention can change the joint distribution of nodes in the causal graph, and thus an unseen action representation is generated according to a new joint distribution.
In the action recognition task, we observe that objects in action recognition tasks are similar, e.g., human.Meanwhile, we cannot obtain the object details from the target dataset in the CD-ZSAR setting, and the label distributions between training and target datasets are different as they have "non-overlapping classed".Thus, to achieve the distribution shift, we randomly iterate over all combinations of object variables from training data and ED of test actions to generate the unseen action video features, as follows: where h x denotes the generated unseen video features.s and õ represent the ED of a test action and a random object variable, respectively.With the proposed deconfounding module, we consider the generated video feature belongs to the given test action class.Finally, after obtaining the unseen video features, we utilize the generated video representation h x and a classifier network R to obtain the test class embedding.

C. Causal Training Strategy
In this work, we aim to train an effective classifier R that has the ability to classify unseen test action classes.The overall training and inference of CD-ZSAR can be described by two pipelines: Basic Pipeline: Following the work of [22], we use nearestneighbor search in the semantic class embedding space to obtain zero-shot classification.Given a training set D = (x 1 , y 1 ), (x 2 , y 2 ) . . .(x N , y N ) consisting of pairs of video x and its class label y, zero-shot learning classifiers need to generalize to unseen test classes, and we apply the common way [22] to achieve this that uses the nearest-neighbor search in a semantic class embedding space.To do this, we first apply a backbone action recognition network for extracting the video feature h x , and then use the classifier R to infer the corresponding semantic embedding.The final recognition model M (•) classifies x as the nearest neighbor in the set of embeddings of the classes: where cos is the cosine distance and the semantic embedding is computed using the Word2Vec function F W 2V .Given a video-class pair (x, y) from training dataset, the classifier R is optimized by minimizing the classifier loss L cls : where L cls denotes the overall loss and h x is generated by the proposed DeCalGAN.
Causal Inference: To improve the feature representation capability of unseen action distributions, we extend the basic pipeline by incorporating DeCalGAN.In the proposed DeCalGAN, semantic and object features h s and h o are extracted using E h and E o , from the video-class pair (x, y) in the training dataset respectively.As we utilize adversarial training, the generator G network is used to obtain the generated video feature x and discriminator D network to discriminate real and fake video features.Therefore, by incorporating the deconfounding operation with the proposed deconfounding loss and reconstruction loss, the overall generator training loss is: where λ 1 , λ 2 , and λ 3 are hyperparameters that control weights of each loss.Recalling the bias problem in ZSL with generative models, the synthesized unseen samples could be unexpectedly too close to the real seen ones.This would significantly decrease the classification performance for unseen classes.Thus, we infer the unseen video features according to the unseen semantic feature h s extracted by ED of unseen action labels and randomly selected object features h o from the seen dataset after training the causal generator.Note that ED is only applied for extracting semantic information.

IV. EXPERIMENTS
In this section, we present our experimental results on two public datasets: UCF101 [38] and HMDB51 [39].We compare our approach with other state-of-the-art methods and an in-depth ablation analysis is provided to better understand our method.We also discuss the limitations and potential future work in this task.

A. Experimental Setup
Datasets: We employ Kinetics-700 [40] as the source dataset for compositional generation and basic pipeline training, which is the most widely adopted benchmark, covering a wide range of human activities.Kinetics 700 is released in 2019, which has 700 classes with over 500 K videos sourced from YouTube.For target datasets to test the zero-shot classifier, there are two commonly used public datasets in zero-shot action recognition: 1) UCF101 is composed of real action videos focused on sports from YouTube, containing 13320 video clips distributed among Authorized licensed use limited to the terms of the applicable license agreement with IEEE.Restrictions apply.

classes; and, 2) HMDB51 contains 6766 videos divided into 51 human action categories focused on sports and daily activities from commercial videos.
Training Protocol: We first make sure that D train and D test have "non-overlapping classes".The simple solution which just removes the same class names does not work, because two classes with slightly different names can easily refer to the same concept.Thus, a distance between class names is needed.Equipped with such a metric, we can make sure training and test classes are not too similar.Following the work of [22], we apply the cosine distance as our non-overlapping metric, and the distance is defined as: where τ ∈ R denotes a similarity threshold and cos indicates cosine distance.This is consistent with the use of cosine distance in the zero-shot learning setting as we do in (8).In order for training and test datasets to contain disjoint video sources, we remove classes from Kinetics-700 whose cosine distance to any class in UCF101 and HMDB51 in the word embedding space is smaller than 0.05.These results in a subset of 663 classes as the training set to train our models.
Evaluation Protocol: We test our framework using two evaluation protocols.The first one is compatible with previous work and the second one emulates a true ZSL setting.Both evaluation protocols apply the same model to both UCF101 and HMDB51 datasets.1) To be fair with previous work, we randomly choose half of the classes of test datasets for evaluation, which are 50 for UCF101 and 25 for HMDB51, and average the results for each test dataset after repeating ten times.2) Top-1 and Top-5 accuracies (%) are used to evaluate the classifier on all 101 UCF classes and 51 HMDB classes, which is more restrictive than the evaluation protocol of the previous methods [8], [41].

B. Implementation Details
In our experiments, we first utilize R(2+1)D_18 pretrained on Kinetics-400 [42] as our base model.Then we use the classifier R to infer the corresponding semantic embedding of dimension 16 × 300, where 16 denotes the batch size.Each frame's shortest side is reshaped to 128 pixels, and we crop a random 112 × 112 patch during training and the center patch during inference.The video clips are 16 frames long and we choose them following the standard protocol established by Wang et al. [14].The feature size of all MLP blocks is 512 and the classifier R is a linear regression model with 512 × 300 nodes.According to the standard protocol [22], we average multi-word class names by Word2Vec (Python implementation in gensim [43]) into dimension 300.To minimize all losses, we applied the Adam optimizer with ascent learning rates from 1 × 10 −3 to 1 × 10 −4 for the classifier, and 1 × 10 −4 to 1 × 10 −5 for the generator and discriminator.All experiments are performed on 8 × Nvidia Tesla P100 GPUs.
Object Detection & Word Embedding: Following the work of [6], we apply Faster R-CNN [34], pretrained on the MS-COCO dataset [44], for detection of local objects, which consist of the person class and 79 objects, such as snowboard, human and horse.We obtain roughly 50 detections for each object per frame and extract top 8 objects with a controlled experiment to select the best number of Top-k.In Fig. 6, we present an example of using object detection models to extract object features.Here, we select the person and basketball as object information.We also follow the standard protocol in computing semantic embedding of action names by using a pretrained Word2Vec model.In rare cases of words not available in the pretrained W2V model (for example, 'rubiks' or 'photobombing') we manually change the words following the work of [22].The pre-trained model produces a 300-dimensional representation for each word.If an action class name contains multiple words c = [c 1 , . . ., c N ], we averaged the embedding as c = N i=1 F W 2V (c i ) ∈ R 300 .Elaborative Description: Examples of Elaborative Descriptions (ED) in Kinetics-700 [42], UCF101 [38] and HMDB51 [39] are shown in Table I.Chen et al. [12] collected ED for action classes by firstly automatically crawling candidate sentences to describe action classes from the Internet; then manually selecting or modifying a minimum set of candidate sentences as the EDs.In the first crawling step, they utilized Wikipedia and online dictionaries.In the second cleaning step, they presented candidate sentences and a video exemplar on a webpage to annotators.As the BERT model has demonstrated excellent capability in implicitly encoding commonsense knowledge, we apply BERT representation as our semantic information source.In this work, denote d = {w 1 , . . ., w Nd } as the ED for action y, where w i is the composed work.The goal of the pre-trained BERT model is to extract semantic features s ∈ R K with the dimension of K. Denote s i ∈ R 768 as the hidden state from the last layer of BERT for word w i .We apply average pooling to obtain a sentence-level information s: We then use an MLP model as semantic encoder E to translate s into the joint semantic feature space.

C. Comparison With State-of-The-Art
We compare our model with both inner-dataset methods and cross-dataset inductive zero-shot learning methods, results as shown in Table II.Inner-dataset methods utilize different training and test classes in the same dataset, but cross-dataset methods apply training and test classes from different datasets.
Inner-dataset Methods: We can observe from Table II that the performance of our DeCalGAN gains large improvement over inner-dataset methods.Even though our method is applied in the cross dataset setting which is more difficult, the results indicate that the essential features can be more effectively obtained by recent state-of-the-art backbones and a large-scale dataset.Compared with O2A [8], TS-GCN [8], and TARN [41], we can observe that incorporating global semantic information and local object information can perform better in actions, and our De-CalGAN enhanced model can effectively infer key cross-domain information from spatio-temporal features.Note that for class labels, our approach follows the conventional training protocol   using word embeddings of class names in the final recognition, which contains less semantic information than ED [12], but the work of [12] applies ED for both feature learning and zero-shot recognition.Therefore, even though our method utilizes ED for semantic feature learning, our setting is more challenging compared to ED.Moreover, our training and test datasets are different, which will lead to cross-domain issues, yet our model still can obtain close performance on the UCF101 dataset and the highest performance on the HMDB51 dataset.
Cross-dataset Methods: We compare our proposed methods with cross-dataset methods using the same protocol.As shown in Table II, results indicate that E2E and our proposed method outperform the universal-based method URL.This finding suggests that a video-based backbone, such as R(2+1)D, is more effective than an image-based backbone, such as ResNet, in video-level zero-shot learning tasks.We attribute this to the fact that a video-based backbone can more effectively capture motion information compared to an image-based backbone.Our proposed DeCalGAN approach outperforms the E2E method, indicating that a generative model that incorporates both global semantic and local object factors can enable the classifier to learn more action information.Furthermore, the improved performance demonstrates the effectiveness of using causal inference conditioned on joint semantic and object information to generate unseen action representations.
Per-category Improvement Analysis: As shown in Fig. 7, the per-category analysis reveals an average improvement of 5.1%.Most of these categories exhibit a wide range of actions and substantial variations, making their improvement highly dependent on semantic reasoning over the global spatial context.For instance, the "Skiing" action is characterized by a long duration and is highly related to object priors and semantic reasoning.Overall, the improvements over the baseline are mainly attributed to the inclusion of both global semantic information and local object information in the causal generation process.

D. Ablation Study
The success of our DeCalGAN can be attributed to both the framework design and technical improvement in each component.To analyze the effect of each component in DeCalGAN, we construct ablation study models including: 1) the basic E2E model without DeCalGAN; 2) "w/o semantic" model without the semantic factor; 3) "w/o object" model without the object factor; 4) "w/o intervention" model without the reconstruction operation and deconfounding loss; 5) "w/o deconfounding" model without deconfounding loss; 6) "Word2Vec" model denotes using Word2Vec embedding in place of BERT model for extracting semantic information of ED; 7) "BERT" model denotes using pretrained BERT model for extracting semantic information of ED; and, 8) "CLIP" model denotes using CLIP pretrained model in place the of BERT model for extracting semantic information of ED.All the ablation studies below are carried out using the second evaluation protocol.
Factor Effects: Comparing the results of "w/o semantic" and "w/o object", we can see that the model with semantic information gains better performance, which indicates that the semantic description contains more action information than only applying object embedding.Meanwhile, comparing the results of E2E and "w/o object", there is no obvious change in performance, which indicates that if the generative model only applies object features without any additional side information, it would not help the classifier infer unseen distribution.Comparing the results of the full model and both "w/o semantic" and "w/o object", we observe a notable performance increase on both datasets.We believe that both information is important for unknown actions, and causal inference may not work if a single factor is used.
Deconfounding Effects: Comparing the "full model" with "w/o intervention", we observe that the performance shows a large increase.We think it indicates there exists some latent information in the object-semantic joint distribution containing confounders that will confound true signals of generating unseen action representations, and verifies our proposed deconfounding module has the ability to remove some confounders.Moreover, "w/o deconfounding loss" performs better than "w/o intervention", which also proves our proposed deconfounding loss can deconfound confounders in our causal inference setting, whereas reconstruction loss alone cannot remove confounder affect effectively.
Semantic Representation: We also evaluated different methods for extracting semantic information from Event Descriptions (ED).Recent works on open-vocabulary learning have started using multi-modality pretrained models such as CLIP [53] for the extraction of semantic features.In this ablation analysis, we also apply pretrained CLIP model as another semantic feature extractor, employing 'a video of {category}' as the text prompt.The results, as shown in Table IV, demonstrate that the selection of the semantic representation methodology does not much influence the overall performance of our proposed framework when using text-only embedding techniques.However, with CLIP, a substantial performance improvement is observed.However, using multi-modality pretrained models could lead to unfair comparisons with other zero-shot learning approaches, since the text embedding integrates visual information into the semantic representation.Therefore, we selected BERT as our semantic representation learning approach, which has shown to be effective in capturing semantic information.

Top-k objects:
We conducted experiments to investigate the impact of the number of Top-k objects on the quality of generated unseen action features.Results show that the best performance occurs when k = 3 for the UCF101 dataset and k = 2 for the HMDB51 dataset.Our findings suggest that inferring actions in UCF101 requires more object information compared to HMDB51.Additionally, our results indicate that including too many object features can make it more challenging for our deconfounding module to remove confounding effects effectively, thus highlighting the importance of selecting an optimal number of objects for optimal performance in compositional zero-shot learning.
Feature Channel Selection: To explore the best representation of semantic and object information, we conduct an ablation study of different feature dimensions as shown in Fig. 8.We can observe that a feature dimension of 512 can achieve the highest performance.It indicates that small channel networks lack in learning semantic and object information effectively, which might lose the important factor information for action feature generation.Meanwhile, large channels might learn more confounding information which is difficult for our proposed deconfounding module to remove latent confounders.Loss Weight Selection: When training the causal generator, we try to obtain optimal performance by tuning the training loss weight hyperparameters λ 1 , λ 2 and λ 3 according to (10).According to Table V, we observe that when the ratio is 1:0.3:0.6, the performance is the best.It indicates that the deconfounding module can be more effective when the generator is well-trained.Also, when deconfounding loss becomes relatively small, the performance drops, which implies that our deconfounding module is important and necessary.

E. Qualitative Evaluation Via t-SNE Visualization
We employ t-SNE visualization to compare the performance of E2E and our proposed DeCalGAN approach.We randomly select 20 samples from eight actions and use the extracted 300-dimensional features to visualize t-SNE, as shown in Fig. 9. Our visualization reveals that the distribution using the E2E is sparser compared to DeCalGAN, and all samples are closer to the center when applying our proposed approach.This observation suggests that leveraging both local object information and global semantic descriptions can jointly benefit causal inference, and the deconfounding module can enable confounding features learned from the source domain to better generalize the unseen distribution in the test domain.However, the basic video-based backbone struggles to distinguish complex actions, such as "rope climbing", which may mislead our proposed causal generator.In future work, we will explore methods to enhance the video representations to better capture the complexity of human actions.

V. CONCLUSION
This article proposed a DeCalGAN model to address the problems of weak knowledge representation and unseen distribution in CD-ZSAR.Class word embedding is enhanced by local object information that unifies the compositional and traditional visual-semantic frameworks.A deconfounding module is proposed to refine global semantic and local object features by reconstruction and deconfounding constraints.The proposed method is able to transfer the large-scale pretrained model on Kinectcs-700 to two ZSAR benchmark datasets, UCF101 and HMDB51.Extensive results validate that DeCalGAN can successfully infer novel samples with unseen distributions in new datasets.

Fig. 1 .
Fig. 1.Illustration of cross-dataset zero-shot action recognition and our proposed causal inference application.With semantic information "S" and object information "O", the proposed DeCalGAN can generate unseen video representations by causal inference.Orange balls represent videos in the training dataset and purple ones denote videos in the test dataset.
III. METHODOLOGYIn cross-dataset zero-shot action recognition (CD-ZSAR), letD = {(x 1 , y 1 ), . . ., (x N , y N )} ⊆ X × Y denotethe training dataset that consists of pairs of action videos x and their class labels y, where N is the number of videos.y ∈ {1, . . ., C} contains C discrete labels of training classes.Given a target dataset D t , where D t does not overlap with D (Y ∩ Y t = ∅), we first train a classification model on D and then test on D t .

Fig. 3 .
Fig. 3. Architecture of our DeCalGAN enhanced ZSAR model.The bottom part shows our proposed DeCalGAN training process, which incorporates global semantic information and local object information.The top-left part shows our proposed causal inference pipeline and the top-right represents the basic pipeline following the work of [22].
: In the GAN training framework, generator neural network connections can be arranged to reflect the causal graph structure.The generative model can be denoted as O = f O (E O ), S = f S (E S ) and X = f Z (O, S, E Z ), where f O , f S , f Z represent the corresponding generative methods, and E O , E S , E Z are independent variables.Therefore, the essential parts of the DeCalGAN are divided into two components: compositional generation and deconfounding module.

Fig. 4 .
Fig. 4. Details of DeCalGAN architecture, which consists of a causal generator and causal discriminator.The causal generator detail is shown at the top and the network detail of the discriminator is shown at the bottom.

Fig. 5 .
Fig. 5. Causal graph and illustration of deconfounding procedure.The topright graph represents our designed causal graph and the scissors denote the decondounding operation.Green and blue distributions represent semantic space Φ s and object space Φ o , respectively.

Fig. 6 .
Fig. 6.Example of using object region features as inputs for the "play basketball" class in the Kinetics-700 dataset.

Fig. 8 .
Fig. 8.Comparison results among (a) different number of selected objects (b) different feature weight selection on UCF101 and HMDB51 datasets.Both follow evaluation protocol (2).

Fig. 9 .
Fig. 9. t-SNE visualization of video representation extracted by E2E baseline and our DeCalGAN in eight video actions from the UCF101 dataset.
Manuscript received 26 July 2022; revised 16 April 2023 and 22 June 2023; accepted 8 September 2023.Date of publication 22 September 2023; date of current version 23 February 2024.The associate editor coordinating the review of this manuscript and approving it for publication was Professor Zheng-Jun Zha.(Corresponding author: Yang Song.)

TABLE I EXAMPLES
OF ELABORATIVE DESCRIPTIONS (ED) FOR ACTION CLASSES IN KINETICS-700, UCF101, AND HMDB51 DATASETS

TABLE II AVERAGE
TOP-1 ACCURACY (%) OF STATE-OF-THE-ART ZERO-SHOT LEARNING METHODS ON THE UCF AND HMDB BENCHMARKS

TABLE V ABLATION
STUDY OF LOSS WEIGHT SELECTION ON UCF101 AND HMDB51 DATASETS, FOLLOWING EVALUATION PROTOCOL (2)