The past few years have witnessed increasing research interest on covariance-based feature representation. A variety of methods have been proposed to boost its efficacy, with some recent ones resorting to nonlinear kernel technique. Noting that the essence of this feature representation is to characterise the underlying structure of visual features, this paper argues that an equally, if not more, important approach to boosting its efficacy shall be to improve the quality of this characterisation. Following this idea, we propose to exploit the structure sparsity of visual features in skeletal human action recognition, and compute sparse inverse covariance estimate (SICE) as feature representation. We discuss the advantage of this new representation on dealing with small sample, high dimensionality, and modelling capability. Furthermore, utilising the monotonicity property of SICE, we efficiently generate a hierarchy of SICE matrices to characterise the structure of visual features at different sparsity levels, and two discriminative learning algorithms are then developed to adaptively integrate them to perform recognition. As demonstrated by extensive experiments, the proposed representation leads to significantly improved recognition performance over the state-of-the-art comparable methods. In particular, as a method fully based on linear technique, it is comparable or even better than those employing nonlinear kernel technique. This result well demonstrates the value of exploiting structure sparsity for covariance-based feature representation.