In video-based person re-identification, both the spatial and temporal features are known to provide orthogonal cues to effective representations. Such representations are currently typically obtained by aggregating the frame-level features using max/avg pooling, at different points of the models. However, such operations also decrease the amount of discriminating information available, which is particularly hazardous in case of poor separability between the different classes. To alleviate this problem, this paper introduces a symbolic temporal pooling method, where frame-level features are represented in the distribution valued symbolic form, yielding from fitting an Empirical Cumulative Distribution Function (ECDF) to each feature. Also, considering that the original triplet loss formulation cannot be applied directly to this kind of representations, we introduce a symbolic triplet loss function that infers the similarity between two symbolic objects. Having carried out an extensive empirical evaluation of the proposed solution against the state-of-the-art, in four well known data sets (MARS, iLIDS-VID, PRID2011 and P-DESTRE), the observed results point for consistent improvements in performance over the previous best performing techniques.