Multi-aspect controllable text generation aims to generate fluent sentences that possess multiple desired attributes simultaneously. Traditional methods either combine many operators in the decoding stage, often with costly iteration or search in the discrete text space, or train separate controllers for each aspect, resulting in a degeneration of text quality due to the discrepancy between different aspects. To address these limitations, we introduce a novel approach for multi-aspect control, namely MacLaSa, that estimates compact latent space for multiple aspects and performs efficient sampling with a robust sampler based on ordinary differential equations (ODEs). To eliminate the domain gaps between different aspects, we utilize a Variational Autoencoder (VAE) network to map text sequences from varying data sources into close latent representations. The estimated latent space enables the formulation of joint energy-based models (EBMs) and the plugging in of arbitrary attribute discriminators to achieve multi-aspect control. Afterwards, we draw latent vector samples with an ODE-based sampler and feed sampled examples to the VAE decoder to produce target text sequences. Experimental results demonstrate that MacLaSa outperforms several strong baselines on attribute relevance and textual quality while maintaining a high inference speed.