The current landscape of Machine Learning (ML) and Deep Learning (DL) is rife with non-uniform frameworks, models, and system stacks but lacks standard tools to facilitate the evaluation and measurement of model. Due to the absence of such tools, the current practice for evaluating and comparing the benefits of proposed AI innovations (be it hardware or software) on end-to-end AI pipelines is both arduous and error prone -- stifling the adoption of the innovations. We propose MLModelScope -- a hardware/software agnostic platform to facilitate the evaluation, measurement, and introspection of ML models within AI pipelines. MLModelScope aids application developers in discovering and experimenting with models, data scientists developers in replicating and evaluating for publishing models, and system architects in understanding the performance of AI workloads. We describe the design and implementation of MLModelScope and show how it is able to give users a holistic view into the execution of models within AI pipelines. Using AlexNet as a case study, we demonstrate how MLModelScope aids in identifying deviation in accuracy, helps in pin pointing the source of system bottlenecks, and automates the evaluation and performance aggregation of models across frameworks and systems.