Hybrid ventilation (coupling natural and mechanical ventilation) is an energy-efficient solution to provide fresh air for most climates, given that it has a reliable control system. To operate such systems optimally, a high-fidelity control-oriented model is required. It should enable near-real time forecast of the indoor air temperature and humidity based on operational conditions such as window opening and HVAC schedules. However, widely used physics-based simulation models (i.e., white-box models) are labour-intensive and computationally expensive. Alternatively, black-box models based on artificial neural networks can be trained to be good estimators for building dynamics. This paper investigates the capabilities of a multivariate multi-head attention-based long short-term memory (LSTM) encoder-decoder neural network to predict indoor air conditions of a building equipped with hybrid ventilation. The deep neural network used for this study aims to predict indoor air temperature dynamics when a window is opened and closed, respectively. Training and test data were generated from detailed multi-zone office building model (EnergyPlus). The deep neural network is able to accurately predict indoor air temperature of five zones whenever a window was opened and closed.