Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hiroyuki Shindo

Graph-Structured Trajectory Extraction from Travelogues

Oct 22, 2024

Aitaro Yamamoto, Hiroyuki Otomo, Hiroki Ouchi, Shohei Higashiyama, Hiroki Teranishi, Hiroyuki Shindo, Taro Watanabe

Abstract:Previous studies on sequence-based extraction of human movement trajectories have an issue of inadequate trajectory representation. Specifically, a pair of locations may not be lined up in a sequence especially when one location includes the other geographically. In this study, we propose a graph representation that retains information on the geographic hierarchy as well as the temporal order of visited locations, and have constructed a benchmark dataset for graph-structured trajectory extraction. The experiments with our baselines have demonstrated that it is possible to accurately predict visited locations and the order among them, but it remains a challenge to predict the hierarchical relations.

Via

Access Paper or Ask Questions

Distilling Named Entity Recognition Models for Endangered Species from Large Language Models

Mar 13, 2024

Jesse Atuhurra, Seiveright Cargill Dujohn, Hidetaka Kamigaito, Hiroyuki Shindo, Taro Watanabe

Abstract:Natural language processing (NLP) practitioners are leveraging large language models (LLM) to create structured datasets from semi-structured and unstructured data sources such as patents, papers, and theses, without having domain-specific knowledge. At the same time, ecological experts are searching for a variety of means to preserve biodiversity. To contribute to these efforts, we focused on endangered species and through in-context learning, we distilled knowledge from GPT-4. In effect, we created datasets for both named entity recognition (NER) and relation extraction (RE) via a two-stage process: 1) we generated synthetic data from GPT-4 of four classes of endangered species, 2) humans verified the factual accuracy of the synthetic data, resulting in gold data. Eventually, our novel dataset contains a total of 3.6K sentences, evenly divided between 1.8K NER and 1.8K RE sentences. The constructed dataset was then used to fine-tune both general BERT and domain-specific BERT variants, completing the knowledge distillation process from GPT-4 to BERT, because GPT-4 is resource intensive. Experiments show that our knowledge transfer approach is effective at creating a NER model suitable for detecting endangered species from texts.

Via

Access Paper or Ask Questions

Arukikata Travelogue Dataset with Geographic Entity Mention, Coreference, and Link Annotation

May 23, 2023

Shohei Higashiyama, Hiroki Ouchi, Hiroki Teranishi, Hiroyuki Otomo, Yusuke Ide, Aitaro Yamamoto, Hiroyuki Shindo, Yuki Matsuda, Shoko Wakamiya, Naoya Inoue(+2 more)

Figure 1 for Arukikata Travelogue Dataset with Geographic Entity Mention, Coreference, and Link Annotation

Figure 2 for Arukikata Travelogue Dataset with Geographic Entity Mention, Coreference, and Link Annotation

Figure 3 for Arukikata Travelogue Dataset with Geographic Entity Mention, Coreference, and Link Annotation

Figure 4 for Arukikata Travelogue Dataset with Geographic Entity Mention, Coreference, and Link Annotation

Abstract:Geoparsing is a fundamental technique for analyzing geo-entity information in text. We focus on document-level geoparsing, which considers geographic relatedness among geo-entity mentions, and presents a Japanese travelogue dataset designed for evaluating document-level geoparsing systems. Our dataset comprises 200 travelogue documents with rich geo-entity information: 12,171 mentions, 6,339 coreference clusters, and 2,551 geo-entities linked to geo-database entries.

Via

Access Paper or Ask Questions

Arukikata Travelogue Dataset

May 19, 2023

Hiroki Ouchi, Hiroyuki Shindo, Shoko Wakamiya, Yuki Matsuda, Naoya Inoue, Shohei Higashiyama, Satoshi Nakamura, Taro Watanabe

Abstract:We have constructed Arukikata Travelogue Dataset and released it free of charge for academic research. This dataset is a Japanese text dataset with a total of over 31 million words, comprising 4,672 Japanese domestic travelogues and 9,607 overseas travelogues. Before providing our dataset, there was a scarcity of widely available travelogue data for research purposes, and each researcher had to prepare their own data. This hinders the replication of existing studies and fair comparative analysis of experimental results. Our dataset enables any researchers to conduct investigation on the same data and to ensure transparency and reproducibility in research. In this paper, we describe the academic significance, characteristics, and prospects of our dataset.

* The application website for Arukikata Travelogue Dataset: https://www.nii.ac.jp/dsc/idr/arukikata/

Via

Access Paper or Ask Questions

LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention

Oct 02, 2020

Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, Yuji Matsumoto

Figure 1 for LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention

Figure 2 for LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention

Figure 3 for LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention

Figure 4 for LUKE: Deep Contextualized Entity Representations with Entity-aware Self-attention

Abstract:Entity representations are useful in natural language tasks involving entities. In this paper, we propose new pretrained contextualized representations of words and entities based on the bidirectional transformer. The proposed model treats words and entities in a given text as independent tokens, and outputs contextualized representations of them. Our model is trained using a new pretraining task based on the masked language model of BERT. The task involves predicting randomly masked words and entities in a large entity-annotated corpus retrieved from Wikipedia. We also propose an entity-aware self-attention mechanism that is an extension of the self-attention mechanism of the transformer, and considers the types of tokens (words or entities) when computing attention scores. The proposed model achieves impressive empirical performance on a wide range of entity-related tasks. In particular, it obtains state-of-the-art results on five well-known datasets: Open Entity (entity typing), TACRED (relation classification), CoNLL-2003 (named entity recognition), ReCoRD (cloze-style question answering), and SQuAD 1.1 (extractive question answering). Our source code and pretrained representations are available at https://github.com/studio-ousia/luke.

* EMNLP 2020

Via

Access Paper or Ask Questions

Length-controllable Abstractive Summarization by Guiding with Summary Prototype

Jan 21, 2020

Itsumi Saito, Kyosuke Nishida, Kosuke Nishida, Atsushi Otsuka, Hisako Asano, Junji Tomita, Hiroyuki Shindo, Yuji Matsumoto

Figure 1 for Length-controllable Abstractive Summarization by Guiding with Summary Prototype

Figure 2 for Length-controllable Abstractive Summarization by Guiding with Summary Prototype

Figure 3 for Length-controllable Abstractive Summarization by Guiding with Summary Prototype

Figure 4 for Length-controllable Abstractive Summarization by Guiding with Summary Prototype

Abstract:We propose a new length-controllable abstractive summarization model. Recent state-of-the-art abstractive summarization models based on encoder-decoder models generate only one summary per source text. However, controllable summarization, especially of the length, is an important aspect for practical applications. Previous studies on length-controllable abstractive summarization incorporate length embeddings in the decoder module for controlling the summary length. Although the length embeddings can control where to stop decoding, they do not decide which information should be included in the summary within the length constraint. Unlike the previous models, our length-controllable abstractive summarization model incorporates a word-level extractive module in the encoder-decoder model instead of length embeddings. Our model generates a summary in two steps. First, our word-level extractor extracts a sequence of important words (we call it the "prototype text") from the source text according to the word-level importance scores and the length constraint. Second, the prototype text is used as additional input to the encoder-decoder model, which generates a summary by jointly encoding and copying words from both the prototype text and source text. Since the prototype text is a guide to both the content and length of the summary, our model can generate an informative and length-controlled summary. Experiments with the CNN/Daily Mail dataset and the NEWSROOM dataset show that our model outperformed previous models in length-controlled settings.

Via

Access Paper or Ask Questions

Neural Attentive Bag-of-Entities Model for Text Classification

Sep 10, 2019

Ikuya Yamada, Hiroyuki Shindo

Figure 1 for Neural Attentive Bag-of-Entities Model for Text Classification

Figure 2 for Neural Attentive Bag-of-Entities Model for Text Classification

Figure 3 for Neural Attentive Bag-of-Entities Model for Text Classification

Figure 4 for Neural Attentive Bag-of-Entities Model for Text Classification

Abstract:This study proposes a Neural Attentive Bag-of-Entities model, which is a neural network model that performs text classification using entities in a knowledge base. Entities provide unambiguous and relevant semantic signals that are beneficial for capturing semantics in texts. We combine simple high-recall entity detection based on a dictionary, to detect entities in a document, with a novel neural attention mechanism that enables the model to focus on a small number of unambiguous and relevant entities. We tested the effectiveness of our model using two standard text classification datasets (i.e., the 20 Newsgroups and R8 datasets) and a popular factoid question answering dataset based on a trivia quiz game. As a result, our model achieved state-of-the-art results on all datasets. The source code of the proposed model is available online at https://github.com/wikipedia2vec/wikipedia2vec.

* Accepted to CoNLL 2019

Via

Access Paper or Ask Questions

Pre-training of Deep Contextualized Embeddings of Words and Entities for Named Entity Disambiguation

Sep 01, 2019

Ikuya Yamada, Hiroyuki Shindo

Figure 1 for Pre-training of Deep Contextualized Embeddings of Words and Entities for Named Entity Disambiguation

Figure 2 for Pre-training of Deep Contextualized Embeddings of Words and Entities for Named Entity Disambiguation

Figure 3 for Pre-training of Deep Contextualized Embeddings of Words and Entities for Named Entity Disambiguation

Abstract:Deep contextualized embeddings trained using unsupervised language modeling (e.g., ELMo and BERT) are successful in a wide range of NLP tasks. In this paper, we propose a new contextualized embedding model of words and entities for named entity disambiguation (NED). Our model is based on the bidirectional transformer encoder and produces contextualized embeddings for words and entities in the input text. The embeddings are trained using a new masked entity prediction task that aims to train the model by predicting randomly masked entities in entity-annotated texts. We trained the model using entity-annotated texts obtained from Wikipedia. We evaluated our model by addressing NED using a simple NED model based on the trained contextualized embeddings. As a result, we achieved state-of-the-art or competitive results on several standard NED datasets.

Via

Access Paper or Ask Questions

Gated Graph Recursive Neural Networks for Molecular Property Prediction

Aug 31, 2019

Hiroyuki Shindo, Yuji Matsumoto

Figure 1 for Gated Graph Recursive Neural Networks for Molecular Property Prediction

Figure 2 for Gated Graph Recursive Neural Networks for Molecular Property Prediction

Figure 3 for Gated Graph Recursive Neural Networks for Molecular Property Prediction

Figure 4 for Gated Graph Recursive Neural Networks for Molecular Property Prediction

Abstract:Molecule property prediction is a fundamental problem for computer-aided drug discovery and materials science. Quantum-chemical simulations such as density functional theory (DFT) have been widely used for calculating the molecule properties, however, because of the heavy computational cost, it is difficult to search a huge number of potential chemical compounds. Machine learning methods for molecular modeling are attractive alternatives, however, the development of expressive, accurate, and scalable graph neural networks for learning molecular representations is still challenging. In this work, we propose a simple and powerful graph neural networks for molecular property prediction. We model a molecular as a directed complete graph in which each atom has a spatial position, and introduce a recursive neural network with simple gating function. We also feed input embeddings for every layers as skip connections to accelerate the training. Experimental results show that our model achieves the state-of-the-art performance on the standard benchmark dataset for molecular property prediction.

Via

Access Paper or Ask Questions

Improving Multi-Word Entity Recognition for Biomedical Texts

Aug 15, 2019

Hamada A. Nayel, H. L. Shashirekha, Hiroyuki Shindo, Yuji Matsumoto

Figure 1 for Improving Multi-Word Entity Recognition for Biomedical Texts

Figure 2 for Improving Multi-Word Entity Recognition for Biomedical Texts

Figure 3 for Improving Multi-Word Entity Recognition for Biomedical Texts

Figure 4 for Improving Multi-Word Entity Recognition for Biomedical Texts

Abstract:Biomedical Named Entity Recognition (BioNER) is a crucial step for analyzing Biomedical texts, which aims at extracting biomedical named entities from a given text. Different supervised machine learning algorithms have been applied for BioNER by various researchers. The main requirement of these approaches is an annotated dataset used for learning the parameters of machine learning algorithms. Segment Representation (SR) models comprise of different tag sets used for representing the annotated data, such as IOB2, IOE2 and IOBES. In this paper, we propose an extension of IOBES model to improve the performance of BioNER. The proposed SR model, FROBES, improves the representation of multi-word entities. We used Bidirectional Long Short-Term Memory (BiLSTM) network; an instance of Recurrent Neural Networks (RNN), to design a baseline system for BioNER and evaluated the new SR model on two datasets, i2b2/VA 2010 challenge dataset and JNLPBA 2004 shared task dataset. The proposed SR model outperforms other models for multi-word entities with length greater than two. Further, the outputs of different SR models have been combined using majority voting ensemble method which outperforms the baseline models performance.

* International Journal of Pure and Applied Mathematics, Volume 118 No. 16, 2018
* 13 pages, 2 figures, International Conference on Cognitive Informatics and Soft Computing (ICCISC-2017)

Via

Access Paper or Ask Questions