Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Andrew Shin

Can A Gamer Train A Mathematical Reasoning Model?

Jun 10, 2025

Andrew Shin

Abstract:While large language models (LLMs) have achieved remarkable performance in various tasks including mathematical reasoning, their development typically demands prohibitive computational resources. Recent advancements have reduced costs for training capable models, yet even these approaches rely on high-end hardware clusters. In this paper, we demonstrate that a single average gaming GPU can train a solid mathematical reasoning model, by integrating reinforcement learning and memory optimization techniques. Specifically, we train a 1.5B parameter mathematical reasoning model on RTX 3080 Ti of 16GB memory that achieves comparable or better performance on mathematical reasoning benchmarks than models several times larger, in resource-constrained environments. Our results challenge the paradigm that state-of-the-art mathematical reasoning necessitates massive infrastructure, democratizing access to high-performance AI research. https://github.com/shinandrew/YouronMath.

Via

Access Paper or Ask Questions

Large Language Models Lack Understanding of Character Composition of Words

May 18, 2024

Andrew Shin, Kunitake Kaneko

Abstract:Large language models (LLMs) have demonstrated remarkable performances on a wide range of natural language tasks. Yet, LLMs' successes have been largely restricted to tasks concerning words, sentences, or documents, and it remains questionable how much they understand the minimal units of text, namely characters. In this paper, we examine contemporary LLMs regarding their ability to understand character composition of words, and show that most of them fail to reliably carry out even the simple tasks that can be handled by humans with perfection. We analyze their behaviors with comparison to token level performances, and discuss the potential directions for future research.

Via

Access Paper or Ask Questions

The Lost Melody: Empirical Observations on Text-to-Video Generation From A Storytelling Perspective

May 13, 2024

Andrew Shin, Yusuke Mori, Kunitake Kaneko

Abstract:Text-to-video generation task has witnessed a notable progress, with the generated outcomes reflecting the text prompts with high fidelity and impressive visual qualities. However, current text-to-video generation models are invariably focused on conveying the visual elements of a single scene, and have so far been indifferent to another important potential of the medium, namely a storytelling. In this paper, we examine text-to-video generation from a storytelling perspective, which has been hardly investigated, and make empirical remarks that spotlight the limitations of current text-to-video generation scheme. We also propose an evaluation framework for storytelling aspects of videos, and discuss the potential future directions.

* To appear at CVPR 2024 Workshop on AI for Content Creation (AI4CC)

Via

Access Paper or Ask Questions

LADER: Log-Augmented DEnse Retrieval for Biomedical Literature Search

Apr 10, 2023

Qiao Jin, Andrew Shin, Zhiyong Lu

Abstract:Queries with similar information needs tend to have similar document clicks, especially in biomedical literature search engines where queries are generally short and top documents account for most of the total clicks. Motivated by this, we present a novel architecture for biomedical literature search, namely Log-Augmented DEnse Retrieval (LADER), which is a simple plug-in module that augments a dense retriever with the click logs retrieved from similar training queries. Specifically, LADER finds both similar documents and queries to the given query by a dense retriever. Then, LADER scores relevant (clicked) documents of similar queries weighted by their similarity to the input query. The final document scores by LADER are the average of (1) the document similarity scores from the dense retriever and (2) the aggregated document scores from the click logs of similar queries. Despite its simplicity, LADER achieves new state-of-the-art (SOTA) performance on TripClick, a recently released benchmark for biomedical literature retrieval. On the frequent (HEAD) queries, LADER largely outperforms the best retrieval model by 39% relative NDCG@10 (0.338 v.s. 0.243). LADER also achieves better performance on the less frequent (TORSO) queries with 11% relative NDCG@10 improvement over the previous SOTA (0.303 v.s. 0.272). On the rare (TAIL) queries where similar queries are scarce, LADER still compares favorably to the previous SOTA method (NDCG@10: 0.310 v.s. 0.295). On all queries, LADER can improve the performance of a dense retriever by 24%-37% relative NDCG@10 while not requiring additional training, and further performance improvement is expected from more logs. Our regression analysis has shown that queries that are more frequent, have higher entropy of query similarity and lower entropy of document similarity, tend to benefit more from log augmentation.

* SIGIR 2023

Via

Access Paper or Ask Questions

Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision

Mar 06, 2021

Andrew Shin, Masato Ishii, Takuya Narihira

Figure 1 for Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision

Figure 2 for Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision

Figure 3 for Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision

Figure 4 for Perspectives and Prospects on Transformer Architecture for Cross-Modal Tasks with Language and Vision

Abstract:Transformer architectures have brought about fundamental changes to computational linguistic field, which had been dominated by recurrent neural networks for many years. Its success also implies drastic changes in cross-modal tasks with language and vision, and many researchers have already tackled the issue. In this paper, we review some of the most critical milestones in the field, as well as overall trends on how transformer architecture has been incorporated into visuolinguistic cross-modal tasks. Furthermore, we discuss its current limitations and speculate upon some of the prospects that we find imminent.

Via

Access Paper or Ask Questions

Neural Network Libraries: A Deep Learning Framework Designed from Engineers' Perspectives

Feb 12, 2021

Akio Hayakawa, Masato Ishii, Yoshiyuki Kobayashi, Akira Nakamura, Takuya Narihira, Yukio Obuchi, Andrew Shin, Takuya Yashima, Kazuki Yoshiyama

Figure 1 for Neural Network Libraries: A Deep Learning Framework Designed from Engineers' Perspectives

Figure 2 for Neural Network Libraries: A Deep Learning Framework Designed from Engineers' Perspectives

Figure 3 for Neural Network Libraries: A Deep Learning Framework Designed from Engineers' Perspectives

Figure 4 for Neural Network Libraries: A Deep Learning Framework Designed from Engineers' Perspectives

Abstract:While there exist a plethora of deep learning tools and frameworks, the fast-growing complexity of the field brings new demands and challenges, such as more flexible network design, speedy computation on distributed setting, and compatibility between different tools. In this paper, we introduce Neural Network Libraries (https://nnabla.org), a deep learning framework designed from engineer's perspective, with emphasis on usability and compatibility as its core design principles. We elaborate on each of our design principles and its merits, and validate our attempts via experiments.

* https://nnabla.org

Via

Access Paper or Ask Questions

Reference-Based Video Colorization with Spatiotemporal Correspondence

Nov 25, 2020

Naofumi Akimoto, Akio Hayakawa, Andrew Shin, Takuya Narihira

Figure 1 for Reference-Based Video Colorization with Spatiotemporal Correspondence

Figure 2 for Reference-Based Video Colorization with Spatiotemporal Correspondence

Figure 3 for Reference-Based Video Colorization with Spatiotemporal Correspondence

Figure 4 for Reference-Based Video Colorization with Spatiotemporal Correspondence

Abstract:We propose a novel reference-based video colorization framework with spatiotemporal correspondence. Reference-based methods colorize grayscale frames referencing a user input color frame. Existing methods suffer from the color leakage between objects and the emergence of average colors, derived from non-local semantic correspondence in space. To address this issue, we warp colors only from the regions on the reference frame restricted by correspondence in time. We propagate masks as temporal correspondences, using two complementary tracking approaches: off-the-shelf instance tracking for high performance segmentation, and newly proposed dense tracking to track various types of objects. By restricting temporally-related regions for referencing colors, our approach propagates faithful colors throughout the video. Experiments demonstrate that our method outperforms state-of-the-art methods quantitatively and qualitatively.

Via

Access Paper or Ask Questions

Customized Image Narrative Generation via Interactive Visual Question Generation and Answering

Apr 27, 2018

Andrew Shin, Yoshitaka Ushiku, Tatsuya Harada

Figure 1 for Customized Image Narrative Generation via Interactive Visual Question Generation and Answering

Figure 2 for Customized Image Narrative Generation via Interactive Visual Question Generation and Answering

Figure 3 for Customized Image Narrative Generation via Interactive Visual Question Generation and Answering

Figure 4 for Customized Image Narrative Generation via Interactive Visual Question Generation and Answering

Abstract:Image description task has been invariably examined in a static manner with qualitative presumptions held to be universally applicable, regardless of the scope or target of the description. In practice, however, different viewers may pay attention to different aspects of the image, and yield different descriptions or interpretations under various contexts. Such diversity in perspectives is difficult to derive with conventional image description techniques. In this paper, we propose a customized image narrative generation task, in which the users are interactively engaged in the generation process by providing answers to the questions. We further attempt to learn the user's interest via repeating such interactive stages, and to automatically reflect the interest in descriptions for new images. Experimental results demonstrate that our model can generate a variety of descriptions from single image that cover a wider range of topics than conventional models, while being customizable to the target user of interaction.

* To Appear at CVPR 2018 as spotlight presentation

Via

Access Paper or Ask Questions

DualNet: Domain-Invariant Network for Visual Question Answering

May 04, 2017

Kuniaki Saito, Andrew Shin, Yoshitaka Ushiku, Tatsuya Harada

Figure 1 for DualNet: Domain-Invariant Network for Visual Question Answering

Figure 2 for DualNet: Domain-Invariant Network for Visual Question Answering

Figure 3 for DualNet: Domain-Invariant Network for Visual Question Answering

Figure 4 for DualNet: Domain-Invariant Network for Visual Question Answering

Abstract:Visual question answering (VQA) task not only bridges the gap between images and language, but also requires that specific contents within the image are understood as indicated by linguistic context of the question, in order to generate the accurate answers. Thus, it is critical to build an efficient embedding of images and texts. We implement DualNet, which fully takes advantage of discriminative power of both image and textual features by separately performing two operations. Building an ensemble of DualNet further boosts the performance. Contrary to common belief, our method proved effective in both real images and abstract scenes, in spite of significantly different properties of respective domain. Our method was able to outperform previous state-of-the-art methods in real images category even without explicitly employing attention mechanism, and also outperformed our own state-of-the-art method in abstract scenes category, which recently won the first place in VQA Challenge 2016.

* Accepted as an oral paper by ICME 2017

Via

Access Paper or Ask Questions

The Color of the Cat is Gray: 1 Million Full-Sentences Visual Question Answering

Sep 21, 2016

Andrew Shin, Yoshitaka Ushiku, Tatsuya Harada

Figure 1 for The Color of the Cat is Gray: 1 Million Full-Sentences Visual Question Answering

Figure 2 for The Color of the Cat is Gray: 1 Million Full-Sentences Visual Question Answering

Figure 3 for The Color of the Cat is Gray: 1 Million Full-Sentences Visual Question Answering

Figure 4 for The Color of the Cat is Gray: 1 Million Full-Sentences Visual Question Answering

Abstract:Visual Question Answering (VQA) task has showcased a new stage of interaction between language and vision, two of the most pivotal components of artificial intelligence. However, it has mostly focused on generating short and repetitive answers, mostly single words, which fall short of rich linguistic capabilities of humans. We introduce Full-Sentence Visual Question Answering (FSVQA) dataset, consisting of nearly 1 million pairs of questions and full-sentence answers for images, built by applying a number of rule-based natural language processing techniques to original VQA dataset and captions in the MS COCO dataset. This poses many additional complexities to conventional VQA task, and we provide a baseline for approaching and evaluating the task, on top of which we invite the research community to build further improvements.

Via

Access Paper or Ask Questions