Abstract:Recent advancements in Large Vision Language Models (LVLMs) have enabled the development of LVLM-based Graphical User Interface (GUI) agents under various paradigms. Training-based approaches, such as CogAgent and SeeClick, struggle with cross-dataset and cross-platform generalization due to their reliance on dataset-specific training. Generalist LVLMs, such as GPT-4V, employ Set-of-Marks (SoM) for action grounding, but obtaining SoM labels requires metadata like HTML source, which is not consistently available across platforms. Moreover, existing methods often specialize in singular GUI tasks rather than achieving comprehensive GUI understanding. To address these limitations, we introduce TRISHUL, a novel, training-free agentic framework that enhances generalist LVLMs for holistic GUI comprehension. Unlike prior works that focus on either action grounding (mapping instructions to GUI elements) or GUI referring (describing GUI elements given a location), TRISHUL seamlessly integrates both. At its core, TRISHUL employs Hierarchical Screen Parsing (HSP) and the Spatially Enhanced Element Description (SEED) module, which work synergistically to provide multi-granular, spatially, and semantically enriched representations of GUI elements. Our results demonstrate TRISHUL's superior performance in action grounding across the ScreenSpot, VisualWebBench, AITW, and Mind2Web datasets. Additionally, for GUI referring, TRISHUL surpasses the ToL agent on the ScreenPR benchmark, setting a new standard for robust and adaptable GUI comprehension.
Abstract:With the multitude of companies and organizations abound today, ranking them and choosing one out of the many is a difficult and cumbersome task. Although there are many available metrics that rank companies, there is an inherent need for a generalized metric that takes into account the different aspects that constitute employee opinions of the companies. In this work, we aim to overcome the aforementioned problem by generating aspect-sentiment based embedding for the companies by looking into reliable employee reviews of them. We created a comprehensive dataset of company reviews from the famous website Glassdoor.com and employed a novel ensemble approach to perform aspect-level sentiment analysis. Although a relevant amount of work has been done on reviews centered on subjects like movies, music, etc., this work is the first of its kind. We also provide several insights from the collated embeddings, thus helping users gain a better understanding of their options as well as select companies using customized preferences.