Various 3D modalities have been proposed for high-precision imitation learning tasks to compensate for the short-comings of RGB-only policies. Modalities that explicitly represent positions in Cartesian space have an inherent advantage over purely image-based ones, since they allow policies to reason about geometry. Point clouds are a common way to represent geometric information, and have several benefits such as permutation invariance and flexible observation size. Despite their effectiveness, a number of hybrid 2D/3D architectures have been proposed in the literature, indicating that this performance can often be task-dependent. We hypothesize that this may be due to the spectral bias of neural networks towards learning low frequency functions, which especially affects models conditioned on slow-moving Cartesian features. Building on prior work that uses a parametric projection from Cartesian space into high-dimensional Fourier space to overcome the innate low-pass filtering characteristic of neural networks, we apply Fourier features to several representative point cloud encoder architectures. We validate this approach on challenging manipulation tasks from the RoboCasa and ManiSkill3 benchmarks, and find that adding Fourier feature projections provides benefits across diverse encoder architectures and tasks, with meaningful improvements seen in the vast majority of tasks. We show that Fourier features are a general-purpose tool for point cloud-based imitation learning, which consistently improves performance by enabling policies to leverage geometric details more effectively than models conditioned on Cartesian features.
Adding a Fourier feature mapping from Cartesian coordinates into a higher-dimensional feature space improves performance for any point cloud encoder used for diffusion imitation learning. For high-precision policies, the network must learn to condition on fine details in the scene geometry to e.g. device whether to insert the leg into the slot or reposition it, yet neural networks learn the high frequency components of the target function only slowly, if at all. While neighbouring points in the scene have very similar Cartesian features, the high-dimensional Fourier features allow them to easily be distinguished.
Given a pointcloud, we first map each point and its neighbourhood to Fourier feature space. This amplifies subtle geometric differences in each neighborhood. The tokenizer extracts and aggregates features for each neighborhood to produce a set of tokens which are then forwarded to a goal-conditioned diffusion policy to denoise the next chunk of actions.