Traditional end-to-end task-oriented dialog systems first convert dialog context into dialog state and action state, before generating the system response. In this paper, we first empirically investigate the relationship between dialog/action state and generated system response. The empirical exploration shows that the system response performance is significantly affected by the quality of dialog state and action state. Based on these findings, we argue that enhancing the relationship modeling between dialog context and dialog/action state is beneficial to improving the quality of the dialog state and action state, which further improves the generated response quality. Therefore, we propose Mars, an end-to-end task-oriented dialog system with semantic-aware contrastive learning strategies to model the relationship between dialog context and dialog/action state. Empirical results show our proposed Mars achieves state-of-the-art performance on the MultiWOZ 2.0, CamRest676, and CrossWOZ.