Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Rongzhen Zhao

Vector-Quantized Vision Foundation Models for Object-Centric Learning

Feb 27, 2025

Rongzhen Zhao, Vivienne Wang, Juho Kannala, Joni Pajarinen

Abstract:Decomposing visual scenes into objects, as humans do, facilitates modeling object relations and dynamics. Object-Centric Learning (OCL) achieves this by aggregating image or video feature maps into object-level feature vectors, known as \textit{slots}. OCL's self-supervision via reconstructing the input from slots struggles with complex textures, thus many methods employ Vision Foundation Models (VFMs) to extract feature maps with better objectness. However, using VFMs merely as feature extractors does not fully unlock their potential. We propose Vector-Quantized VFMs for OCL (VQ-VFM-OCL, or VVO), where VFM features are extracted to facilitate object-level information aggregation and further quantized to strengthen supervision in reconstruction. Our VVO unifies OCL representatives into a concise architecture. Experiments demonstrate that VVO not only outperforms mainstream methods on object discovery tasks but also benefits downstream tasks like visual prediction and reasoning. The source code is available in the supplement.

Via

Access Paper or Ask Questions

Grouped Discrete Representation for Object-Centric Learning

Nov 04, 2024

Rongzhen Zhao, Vivienne Wang, Juho Kannala, Joni Pajarinen

Figure 1 for Grouped Discrete Representation for Object-Centric Learning

Figure 2 for Grouped Discrete Representation for Object-Centric Learning

Figure 3 for Grouped Discrete Representation for Object-Centric Learning

Figure 4 for Grouped Discrete Representation for Object-Centric Learning

Abstract:Object-Centric Learning (OCL) can discover objects in images or videos by simply reconstructing the input. For better object discovery, representative OCL methods reconstruct the input as its Variational Autoencoder (VAE) intermediate representation, which suppresses pixel noises and promotes object separability by discretizing continuous super-pixels with template features. However, treating features as units overlooks their composing attributes, thus impeding model generalization; indexing features with scalar numbers loses attribute-level similarities and differences, thus hindering model convergence. We propose \textit{Grouped Discrete Representation} (GDR) for OCL. We decompose features into combinatorial attributes via organized channel grouping, and compose these attributes into discrete representation via tuple indexes. Experiments show that our GDR improves both Transformer- and Diffusion-based OCL methods consistently on various datasets. Visualizations show that our GDR captures better object separability.

Via

Access Paper or Ask Questions

Multi-Scale Fusion for Object Representation

Oct 02, 2024

Rongzhen Zhao, Vivienne Wang, Juho Kannala, Joni Pajarinen

Figure 1 for Multi-Scale Fusion for Object Representation

Figure 2 for Multi-Scale Fusion for Object Representation

Figure 3 for Multi-Scale Fusion for Object Representation

Figure 4 for Multi-Scale Fusion for Object Representation

Abstract:Representing images or videos as object-level feature vectors, rather than pixel-level feature maps, facilitates advanced visual tasks. Object-Centric Learning (OCL) primarily achieves this by reconstructing the input under the guidance of Variational Autoencoder (VAE) intermediate representation to drive so-called \textit{slots} to aggregate as much object information as possible. However, existing VAE guidance does not explicitly address that objects can vary in pixel sizes while models typically excel at specific pattern scales. We propose \textit{Multi-Scale Fusion} (MSF) to enhance VAE guidance for OCL training. To ensure objects of all sizes fall within VAE's comfort zone, we adopt the \textit{image pyramid}, which produces intermediate representations at multiple scales; To foster scale-invariance/variance in object super-pixels, we devise \textit{inter}/\textit{intra-scale fusion}, which augments low-quality object super-pixels of one scale with corresponding high-quality super-pixels from another scale. On standard OCL benchmarks, our technique improves mainstream methods, including state-of-the-art diffusion-based ones. The source code is available in the supplemental material.

Via

Access Paper or Ask Questions

Organized Grouped Discrete Representation for Object-Centric Learning

Sep 05, 2024

Rongzhen Zhao, Vivienne Wang, Juho Kannala, Joni Pajarinen

Abstract:Object-Centric Learning (OCL) represents dense image or video pixels as sparse object features. Representative methods utilize discrete representation composed of Variational Autoencoder (VAE) template features to suppress pixel-level information redundancy and guide object-level feature aggregation. The most recent advancement, Grouped Discrete Representation (GDR), further decomposes these template features into attributes. However, its naive channel grouping as decomposition may erroneously group channels belonging to different attributes together and discretize them as sub-optimal template attributes, which losses information and harms expressivity. We propose Organized GDR (OGDR) to organize channels belonging to the same attributes together for correct decomposition from features into attributes. In unsupervised segmentation experiments, OGDR is fully superior to GDR in augmentating classical transformer-based OCL methods; it even improves state-of-the-art diffusion-based ones. Codebook PCA and representation similarity analyses show that compared with GDR, our OGDR eliminates redundancy and preserves information better for guiding object representation learning. The source code is available in the supplementary material.

Via

Access Paper or Ask Questions

Grouped Discrete Representation Guides Object-Centric Learning

Jul 01, 2024

Rongzhen Zhao, Vivienne Wang, Juho Kannala, Joni Pajarinen

Figure 1 for Grouped Discrete Representation Guides Object-Centric Learning

Figure 2 for Grouped Discrete Representation Guides Object-Centric Learning

Figure 3 for Grouped Discrete Representation Guides Object-Centric Learning

Figure 4 for Grouped Discrete Representation Guides Object-Centric Learning

Abstract:Similar to humans perceiving visual scenes as objects, Object-Centric Learning (OCL) can abstract dense images or videos into sparse object-level features. Transformer-based OCL handles complex textures well due to the decoding guidance of discrete representation, obtained by discretizing noisy features in image or video feature maps using template features from a codebook. However, treating features as minimal units overlooks their composing attributes, thus impeding model generalization; indexing features with natural numbers loses attribute-level commonalities and characteristics, thus diminishing heuristics for model convergence. We propose \textit{Grouped Discrete Representation} (GDR) to address these issues by grouping features into attributes and indexing them with tuple numbers. In extensive experiments across different query initializations, dataset modalities, and model architectures, GDR consistently improves convergence and generalizability. Visualizations show that our method effectively captures attribute-level information in features. The source code will be available upon acceptance.

Via

Access Paper or Ask Questions

Learnable Heterogeneous Convolution: Learning both topology and strength

Jan 13, 2023

Rongzhen Zhao, Zhenzhi Wu, Qikun Zhang

Abstract:Existing convolution techniques in artificial neural networks suffer from huge computation complexity, while the biological neural network works in a much more powerful yet efficient way. Inspired by the biological plasticity of dendritic topology and synaptic strength, our method, Learnable Heterogeneous Convolution, realizes joint learning of kernel shape and weights, which unifies existing handcrafted convolution techniques in a data-driven way. A model based on our method can converge with structural sparse weights and then be accelerated by devices of high parallelism. In the experiments, our method either reduces VGG16/19 and ResNet34/50 computation by nearly 5x on CIFAR10 and 2x on ImageNet without harming the performance, where the weights are compressed by 10x and 4x respectively; or improves the accuracy by up to 1.0% on CIFAR10 and 0.5% on ImageNet with slightly higher efficiency. The code will be available on www.github.com/Genera1Z/LearnableHeterogeneousConvolution.

* Published in Neural Networks journal

Via

Access Paper or Ask Questions