This paper presents a novel layered framework that integrates visual foundation models to improve robot manipulation tasks and motion planning. The framework consists of five layers: Perception, Cognition, Planning, Execution, and Learning. Using visual foundation models, we enhance the robot's perception of its environment, enabling more efficient task understanding and accurate motion planning. This approach allows for real-time adjustments and continual learning, leading to significant improvements in task execution. Experimental results demonstrate the effectiveness of the proposed framework in various robot manipulation tasks and motion planning scenarios, highlighting its potential for practical deployment in dynamic environments.