ECCV 2024 MacDiff

MacDiff: Unified Skeleton Modeling with Masked Conditional Diffusion

ECCV 2024

Lehong Wu^1,2 Lilang Lin¹ Jiahang Zhang¹ Yiyang Ma¹ Jiaying Liu¹

¹Wangxuan Institute of Computer Technology, Peking University ²School of EECS, Peking University

We present Masked Conditional Diffusion (MacDiff), a unified framework for human skeleton modeling, which learns powerful representations for both discriminative and generative downstream tasks. We theoretically demonstrate the adavantage of our framework over prevalent representation learning paradigms. MacDiff achieves state-of-the-art performance on large-scale representation learning benchmarks. Remarkably, by leveraging diffusion-based data augmentation with MacDiff for fine-tuning, we significantly improve the action recognition performance with scarce labeled data.

Abstract

Self-supervised learning has proved effective for skeleton based human action understanding. However, previous works either rely on contrastive learning that suffers false negative problems or are based on reconstruction that learns too much unessential low-level clues, lead ing to limited representations for downstream tasks. Recently, great advances have been made in generative learning, which is naturally a challenging yet meaningful pretext task to model the general underlying data distributions. However, the representation learning capacity of generative models is under-explored, especially for the skeletons with spacial sparsity and temporal redundancy. To this end, we propose Masked Conditional Diffusion (MacDiff) as a unified framework for human skeleton modeling. For the first time, we leverage diffusion models as effective skeleton representation learners. Specifically, we train a diffusion decoder conditioned on the representations extracted by a semantic encoder. Random masking is applied to encoder inputs to introduce a information bottleneck and remove redundancy of skeletons. Furthermore, we theoretically demonstrate that our generative objective involves the contrastive learning objective which aligns the masked and noisy views. Meanwhile, it also enforces the representation to complement for the noisy view, leading to better generalization performance. MacDiff achieves state-of-the-art performance on representation learning benchmarks while maintaining the competence for generative tasks. Moreover, we leverage the diffusion model for data augmentation, significantly enhancing the fine-tuning performance in scenarios with scarce labeled data.

Method

We train a diffusion decoder conditioned on the representations extracted by a semantic encoder.
(I) We embed skeletons into tokens and employ random masking. The global representation is obtained by pooling the local representations extracted by the semantic encoder.
(II) We sample the noisy skeleton following the diffusion process $q(x_t|x_0)$. The diffusion decoder predicts the noise $\epsilon$ from $x_0$ guided by the learned representation $z$. The pre-trained encoder can be utilized independently in downstream discriminative tasks.

Experimental Results

Self-supervised Evaluation Protocols. MacDiff outperforms contrastive learning and reconstruction-based methods on linear evaluation and transfer learning evaluation. In linear evaluation, a linear classifier is attached to the fixed encoder, and is trained for action recognition. In transfer learning evaluation, we pretrain the encoder on a source dataset and perform linear evaluation on a target dataset.

Semi-supervised Fine-tuning Evaluation. In semi-supervised fine-tuning, the pretrained encoder and classifier are trained with a small proportion of labeled data. With the proposed diffusion-based data augmentation, our method brings significant performance gain compared with state-of-the-art methods and our baseline.

Generative Tasks. Additionally, we demonstrate that MacDiff is capable of generative tasks as a diffusion model, such as motion generation and motion reconstruction.

MacDiff: Unified Skeleton Modeling with Masked Conditional Diffusion

ECCV 2024

Lehong Wu^1,2 Lilang Lin¹ Jiahang Zhang¹ Yiyang Ma¹ Jiaying Liu¹

¹Wangxuan Institute of Computer Technology, Peking University ²School of EECS, Peking University

Abstract

Method

Experimental Results

Citation

Contact

MacDiff: Unified Skeleton Modeling with Masked Conditional Diffusion

ECCV 2024

Lehong Wu1,2 Lilang Lin1 Jiahang Zhang1 Yiyang Ma1 Jiaying Liu1 1Wangxuan Institute of Computer Technology, Peking University 2School of EECS, Peking University

Abstract

Method

Experimental Results

Citation

Contact

Lehong Wu^1,2 Lilang Lin¹ Jiahang Zhang¹ Yiyang Ma¹ Jiaying Liu¹

¹Wangxuan Institute of Computer Technology, Peking University ²School of EECS, Peking University