This paper presents the details of the Audio-Visual Scene Classification task in the DCASE 2021 Challenge (Task 1 Subtask B). The task is concerned with classification using audio and video modalities, using a dataset of synchronized recordings. Here we describe the datasets and baseline systems. After the challenge submission deadline, challenge results and analysis of the submissions will be added.