https://anonymous.4open.science/r/TSSR-2A27/.
Sequential recommender systems (SRS) could capture dynamic user preferences by modeling historical behaviors ordered in time. Despite effectiveness, focusing only on the \textit{collaborative signals} from behaviors does not fully grasp user interests. It is also significant to model the \textit{semantic relatedness} reflected in content features, e.g., images and text. Towards that end, in this paper, we aim to enhance the SRS tasks by effectively unifying collaborative signals and semantic relatedness together. Notably, we empirically point out that it is nontrivial to achieve such a goal due to semantic gap issues. Thus, we propose an end-to-end two-stream architecture for sequential recommendation, named TSSR, to learn user preferences from ID-based and content-based sequence. Specifically, we first present novel hierarchical contrasting module, including coarse user-grained and fine item-grained terms, to align the representations of inter-modality. Furthermore, we also design a two-stream architecture to learn the dependence of intra-modality sequence and the complex interactions of inter-modality sequence, which can yield more expressive capacity in understanding user interests. We conduct extensive experiments on five public datasets. The experimental results show that the TSSR could yield superior performance than competitive baselines. We also make our experimental codes publicly available at