Building a socially intelligent agent involves many challenges, one of which is to track the agent's mental state transition and teach the agent to make rational decisions guided by its utility like a human. Towards this end, we propose to incorporate a mental state parser and utility model into dialogue agents. The hybrid mental state parser extracts information from both the dialogue and event observations and maintains a graphical representation of the agent's mind; Meanwhile, the utility model is a ranking model that learns human preferences from a crowd-sourced social commonsense dataset, Social IQA. Empirical results show that the proposed model attains state-of-the-art performance on the dialogue/action/emotion prediction task in the fantasy text-adventure game dataset, LIGHT. We also show example cases to demonstrate: (\textit{i}) how the proposed mental state parser can assist agent's decision by grounding on the context like locations and objects, and (\textit{ii}) how the utility model can help the agent make reasonable decisions in a dilemma. To the best of our knowledge, we are the first work that builds a socially intelligent agent by incorporating a hybrid mental state parser for both discrete events and continuous dialogues parsing and human-like utility modeling.