LEIA: Latent View-invariant Embeddings for Implicit 3D Articulation

University of Maryland, College Park
ECCV 2024

LEIA can generate unseen articulated states of objects with multiple moving parts, using only start and end state as input!

Abstract

Neural Radiance Fields (NeRFs) have revolutionized the reconstruction of static scenes and objects in 3D, offering unprecedented quality. However, extending NeRFs to model dynamic objects or object articulations remains a challenging problem. Previous works have tackled this issue by focusing on part-level reconstruction and motion estimation for objects, but they often rely on heuristics regarding the number of moving parts or object categories, which can limit their practical use. In this work, we introduce LEIA, a novel approach for representing dynamic 3D objects. Our method involves observing the object at distinct time steps or "states" and conditioning a hypernetwork on the current state, using this to parameterize our NeRF. This approach allows us to learn a view-invariant latent representation for each state. We further demonstrate that by interpolating between these states, we can generate novel articulation configurations in 3D space that were previously unseen. Our experimental results highlight the effectiveness of our method in articulating objects in a manner that is independent of the viewing angle and joint configuration. Notably, our approach outperforms previous methods that rely on motion information for articulation registration.

How it works

Description of the image
Overview of our method. We take multi-view images in different states as input. A learnable latent dictionary based off an autoencoder learns an embedding per state id. The latent embedding is used as an input to the hypernet, that modulates and generates weights of the NeRF to reconstruct the state that is fed in. At inference time, we do a weighed interpolation of the learnt latents to obtain a corresponding newly generated intermediate state.

Results

We evaluate our method on both synthetic and real-world data. We show that our method can generate novel states for articulated objects that were not seen during training. We demonstrate the effectiveness of our method in capturing complex articulations and show that it outperforms previous methods that rely on motion information for articulation registration. We also show that our method is robust to single and multiple articulations, as well as combinations of motions.

Real World Data Results

The following results show LEIA working for data from a real-world storage object whose images we collected. We show the start state and the motion generated by LEIA.

Synthetic Data Results

The following results show LEIA working for data from PartNet-Mobility, a synthetic dataset. We show the start and end states, the ground truth motion, and the motion generated by LEIA.

     State 1           State 2           GT Motion       LEIA Motion

washing machine, motion: revolute

     State 1           State 2           GT Motion       LEIA Motion

dishwasher, motion: prismatic

storage, motion: prismatic

storage, motion: revolute multi part

storage, motion: prismatic multi part

storage, motion: prismatic and revolute multi part

sunglasses, motion: revolute multi part

box, motion: prismatic and revolute multi part

BibTeX


      @misc{swaminathan2024leialatentviewinvariantembeddings,
        title={LEIA: Latent View-invariant Embeddings for Implicit 3D Articulation}, 
        author={Archana Swaminathan and Anubhav Gupta and Kamal Gupta and Shishira R. Maiya and Vatsal Agarwal and Abhinav Shrivastava},
        year={2024},
        eprint={2409.06703},
        archivePrefix={arXiv},
        primaryClass={cs.CV},
        url={https://arxiv.org/abs/2409.06703}, 
  }