Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Yongjoo Park

Cache-of-Thought: Master-Apprentice Framework for Cost-Effective Vision Language Model Inference

Feb 27, 2025

Mingyuan Wu, Jize Jiang, Haozhen Zheng, Meitang Li, Zhaoheng Li, Beitong Tian, Bo Chen, Yongjoo Park, Minjia Zhang, Chengxiang Zhai(+1 more)

Abstract:Vision Language Models (VLMs) have achieved remarkable success in a wide range of vision applications of increasing complexity and scales, yet choosing the right VLM model size involves a trade-off between response quality and cost. While smaller VLMs are cheaper to run, they typically produce responses only marginally better than random guessing on benchmarks such as MMMU. In this paper, we propose Cache of Thought (CoT), a master apprentice framework for collaborative inference between large and small VLMs. CoT manages high quality query results from large VLMs (master) in a cache, which are then selected via a novel multi modal retrieval and in-context learning to aid the performance of small VLMs (apprentice). We extensively evaluate CoT on various widely recognized and challenging general VQA benchmarks, and show that CoT increases overall VQA performance by up to 7.7% under the same budget, and specifically boosts the performance of apprentice VLMs by up to 36.6%.

* Mingyuan, Jize, and Haozhen contributed equally, while Minjia, Chengxiang, and Klara advised equally

Via

Access Paper or Ask Questions

Transactional Python for Durable Machine Learning: Vision, Challenges, and Feasibility

May 15, 2023

Supawit Chockchowwat, Zhaoheng Li, Yongjoo Park

Abstract:In machine learning (ML), Python serves as a convenient abstraction for working with key libraries such as PyTorch, scikit-learn, and others. Unlike DBMS, however, Python applications may lose important data, such as trained models and extracted features, due to machine failures or human errors, leading to a waste of time and resources. Specifically, they lack four essential properties that could make ML more reliable and user-friendly -- durability, atomicity, replicability, and time-versioning (DART). This paper presents our vision of Transactional Python that provides DART without any code modifications to user programs or the Python kernel, by non-intrusively monitoring application states at the object level and determining a minimal amount of information sufficient to reconstruct a whole application. Our evaluation of a proof-of-concept implementation with public PyTorch and scikit-learn applications shows that DART can be offered with overheads ranging 1.5%--15.6%.

* 5 pages, 5 figures, to appear at DEEM 2023

Via

Access Paper or Ask Questions

Airphant: Cloud-oriented Document Indexing

Dec 26, 2021

Supawit Chockchowwat, Chaitanya Sood, Yongjoo Park

Figure 1 for Airphant: Cloud-oriented Document Indexing

Figure 2 for Airphant: Cloud-oriented Document Indexing

Figure 3 for Airphant: Cloud-oriented Document Indexing

Figure 4 for Airphant: Cloud-oriented Document Indexing

Abstract:Modern data warehouses can scale compute nodes independently of storage. These systems persist their data on cloud storage, which is always available and cost-efficient. Ad-hoc compute nodes then fetch necessary data on-demand from cloud storage. This ability to quickly scale or shrink data systems is highly beneficial if query workloads may change over time. We apply this new architecture to search engines with a focus on optimizing their latencies in cloud environments. However, simply placing existing search engines (e.g., Apache Lucene) on top of cloud storage significantly increases their end-to-end query latencies (i.e., more than 6 seconds on average in one of our studies). This is because their indexes can incur multiple network round-trips due to their hierarchical structure (e.g., skip lists, B-trees, learned indexes). To address this issue, we develop a new statistical index (called IoU Sketch). For lookup, IoU Sketch makes multiple asynchronous network requests in parallel. While IoU Sketch may fetch more bytes than existing indexes, it significantly reduces the index lookup time because parallel requests do not block each other. Based on IoU Sketch, we build an end-to-end search engine, called Airphant; we describe how Airphant builds, optimizes, and manages IoU Sketch; and ultimately, supports keyword-based querying. In our experiments with four real datasets, Airphant's average end-to-end latencies are between 13 milliseconds and 300 milliseconds, being up to 8.97x faster than Apache Lucence and 113.39x faster than Elasticsearch.

* 17 pages, to be published in ICDE 2022

Via

Access Paper or Ask Questions

BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees

Dec 26, 2018

Yongjoo Park, Jingyi Qing, Xiaoyang Shen, Barzan Mozafari

Figure 1 for BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees

Figure 2 for BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees

Figure 3 for BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees

Figure 4 for BlinkML: Efficient Maximum Likelihood Estimation with Probabilistic Guarantees

Abstract:The rising volume of datasets has made training machine learning (ML) models a major computational cost in the enterprise. Given the iterative nature of model and parameter tuning, many analysts use a small sample of their entire data during their initial stage of analysis to make quick decisions (e.g., what features or hyperparameters to use) and use the entire dataset only in later stages (i.e., when they have converged to a specific model). This sampling, however, is performed in an ad-hoc fashion. Most practitioners cannot precisely capture the effect of sampling on the quality of their model, and eventually on their decision-making process during the tuning phase. Moreover, without systematic support for sampling operators, many optimizations and reuse opportunities are lost. In this paper, we introduce BlinkML, a system for fast, quality-guaranteed ML training. BlinkML allows users to make error-computation tradeoffs: instead of training a model on their full data (i.e., full model), BlinkML can quickly train an approximate model with quality guarantees using a sample. The quality guarantees ensure that, with high probability, the approximate model makes the same predictions as the full model. BlinkML currently supports any ML model that relies on maximum likelihood estimation (MLE), which includes Generalized Linear Models (e.g., linear regression, logistic regression, max entropy classifier, Poisson regression) as well as PPCA (Probabilistic Principal Component Analysis). Our experiments show that BlinkML can speed up the training of large-scale ML tasks by 6.26x-629x while guaranteeing the same predictions, with 95% probability, as the full model.

* 22 pages, SIGMOD 2019

Via

Access Paper or Ask Questions

Database Learning: Toward a Database that Becomes Smarter Every Time

Mar 28, 2017

Yongjoo Park, Ahmad Shahab Tajik, Michael Cafarella, Barzan Mozafari

Figure 1 for Database Learning: Toward a Database that Becomes Smarter Every Time

Figure 2 for Database Learning: Toward a Database that Becomes Smarter Every Time

Figure 3 for Database Learning: Toward a Database that Becomes Smarter Every Time

Figure 4 for Database Learning: Toward a Database that Becomes Smarter Every Time

Abstract:In today's databases, previous query answers rarely benefit answering future queries. For the first time, to the best of our knowledge, we change this paradigm in an approximate query processing (AQP) context. We make the following observation: the answer to each query reveals some degree of knowledge about the answer to another query because their answers stem from the same underlying distribution that has produced the entire dataset. Exploiting and refining this knowledge should allow us to answer queries more analytically, rather than by reading enormous amounts of raw data. Also, processing more queries should continuously enhance our knowledge of the underlying distribution, and hence lead to increasingly faster response times for future queries. We call this novel idea---learning from past query answers---Database Learning. We exploit the principle of maximum entropy to produce answers, which are in expectation guaranteed to be more accurate than existing sample-based approximations. Empowered by this idea, we build a query engine on top of Spark SQL, called Verdict. We conduct extensive experiments on real-world query traces from a large customer of a major database vendor. Our results demonstrate that Verdict supports 73.7% of these queries, speeding them up by up to 23.0x for the same accuracy level compared to existing AQP systems.

* This manuscript is an extended report of the work published in ACM SIGMOD conference 2017

Via

Access Paper or Ask Questions