Despite the popularity of decentralized controller learning, very few successes have been demonstrated on learning to control large robot swarms using raw visual observations. To fill in this gap, we present Vision-based Graph Aggregation and Inference (VGAI), a decentralized learning-to-control framework that directly maps raw visual observations to agent actions, aided by sparse local communication among only neighboring agents. Our framework is implemented by an innovative cascade of convolutional neural networks (CNNs) and one graph neural network (GNN), addressing agent-level visual perception and feature learning, as well as swarm-level local information aggregation and agent action inference, respectively. Using the application example of drone flocking, we show that VGAI yields comparable or more competitive performance with other decentralized controllers, and even the centralized controller that learns from global information. Especially, it shows substantial scalability to learn over large swarms (e.g., 50 agents), thanks to the integration between visual perception and local communication.