In real-world human-robot systems, it is essential for a robot to comprehend human objectives and respond accordingly while performing an extended series of motor actions. Although human objective alignment has recently emerged as a promising paradigm in the realm of physical human-robot interaction, its application is typically confined to generating simple motions due to inherent theoretical limitations. In this work, our goal is to develop a general formulation to learn manipulation functional modules and long-term task goals simultaneously from physical human-robot interaction. We show the feasibility of our framework in enabling robots to align their behaviors with the long-term task objectives inferred from human interactions.