MacDiff: Unified Skeleton Modeling with Masked Conditional Diffusion

ECCV 2024


Lehong Wu1,2    Lilang Lin1    Jiahang Zhang1    Yiyang Ma1    Jiaying Liu1   

1Wangxuan Institute of Computer Technology, Peking University    2School of EECS, Peking University   



We present Masked Conditional Diffusion (MacDiff), a unified framework for human skeleton modeling, which learns powerful representations for both discriminative and generative downstream tasks. We theoretically demonstrate the adavantage of our framework over prevalent representation learning paradigms. MacDiff achieves state-of-the-art performance on large-scale representation learning benchmarks. Remarkably, by leveraging diffusion-based data augmentation with MacDiff for fine-tuning, we significantly improve the action recognition performance with scarce labeled data.



Abstract


Self-supervised learning has proved effective for skeleton based human action understanding. However, previous works either rely on contrastive learning that suffers false negative problems or are based on reconstruction that learns too much unessential low-level clues, lead ing to limited representations for downstream tasks. Recently, great advances have been made in generative learning, which is naturally a challenging yet meaningful pretext task to model the general underlying data distributions. However, the representation learning capacity of generative models is under-explored, especially for the skeletons with spacial sparsity and temporal redundancy. To this end, we propose Masked Conditional Diffusion (MacDiff) as a unified framework for human skeleton modeling. For the first time, we leverage diffusion models as effective skeleton representation learners. Specifically, we train a diffusion decoder conditioned on the representations extracted by a semantic encoder. Random masking is applied to encoder inputs to introduce a information bottleneck and remove redundancy of skeletons. Furthermore, we theoretically demonstrate that our generative objective involves the contrastive learning objective which aligns the masked and noisy views. Meanwhile, it also enforces the representation to complement for the noisy view, leading to better generalization performance. MacDiff achieves state-of-the-art performance on representation learning benchmarks while maintaining the competence for generative tasks. Moreover, we leverage the diffusion model for data augmentation, significantly enhancing the fine-tuning performance in scenarios with scarce labeled data.


Method


input

We train a diffusion decoder conditioned on the representations extracted by a semantic encoder.
(I) We embed skeletons into tokens and employ random masking. The global representation is obtained by pooling the local representations extracted by the semantic encoder.
(II) We sample the noisy skeleton following the diffusion process $q(x_t|x_0)$. The diffusion decoder predicts the noise $\epsilon$ from $x_0$ guided by the learned representation $z$. The pre-trained encoder can be utilized independently in downstream discriminative tasks.


Experimental Results


Self-supervised Evaluation Protocols. MacDiff outperforms contrastive learning and reconstruction-based methods on linear evaluation and transfer learning evaluation. In linear evaluation, a linear classifier is attached to the fixed encoder, and is trained for action recognition. In transfer learning evaluation, we pretrain the encoder on a source dataset and perform linear evaluation on a target dataset.

input

Semi-supervised Fine-tuning Evaluation. In semi-supervised fine-tuning, the pretrained encoder and classifier are trained with a small proportion of labeled data. With the proposed diffusion-based data augmentation, our method brings significant performance gain compared with state-of-the-art methods and our baseline.

input

Generative Tasks. Additionally, we demonstrate that MacDiff is capable of generative tasks as a diffusion model, such as motion generation and motion reconstruction.

input


Citation


        
        @inproceedings{wu2024macdiff,
          title={MacDiff: Unified Skeleton Modeling with Masked Conditional Diffusion},
          author={Lehong Wu and Lilang Lin and Jiahang Zhang and Yiyang Ma and Jiaying Liu},
          booktitle={ECCV},
          year={2024},
        }
      

Contact


If you have any questions, please feel free to contact us:

  • Lehong Wu: aladonwlhPrevent spamming@Prevent spammingstu.pku.edu.cn
  • Team Page: STRUCT