We present an optimization-based theory describing spiking cortical ensembles equipped with Spike-Timing-Dependent Plasticity (STDP) learning, as empirically observed in the visual cortex. Using our methods, we build a class of fully-connected, convolutional and action-based feature descriptors for event-based camera that we respectively assess on N-MNIST, challenging CIFAR10-DVS and on the IBM DVS128 gesture dataset. We report significant accuracy improvements compared to conventional state-of-the-art event-based feature descriptors (+8% on CIFAR10-DVS). We report large improvements in accuracy compared to state-of-the-art STDP-based systems (+10% on N-MNIST, +7.74% on IBM DVS128 Gesture). In addition to ultra-low-power learning in neuromorphic edge devices, our work helps paving the way towards a biologically-realistic, optimization-based theory of cortical vision.