Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Margreta Kuijper

Lightweight Conceptual Dictionary Learning for Text Classification Using Information Compression

Apr 28, 2024

Li Wan, Tansu Alpcan, Margreta Kuijper, Emanuele Viterbo

Figure 1 for Lightweight Conceptual Dictionary Learning for Text Classification Using Information Compression

Figure 2 for Lightweight Conceptual Dictionary Learning for Text Classification Using Information Compression

Figure 3 for Lightweight Conceptual Dictionary Learning for Text Classification Using Information Compression

Figure 4 for Lightweight Conceptual Dictionary Learning for Text Classification Using Information Compression

Abstract:We propose a novel, lightweight supervised dictionary learning framework for text classification based on data compression and representation. This two-phase algorithm initially employs the Lempel-Ziv-Welch (LZW) algorithm to construct a dictionary from text datasets, focusing on the conceptual significance of dictionary elements. Subsequently, dictionaries are refined considering label data, optimizing dictionary atoms to enhance discriminative power based on mutual information and class distribution. This process generates discriminative numerical representations, facilitating the training of simple classifiers such as SVMs and neural networks. We evaluate our algorithm's information-theoretic performance using information bottleneck principles and introduce the information plane area rank (IPAR) as a novel metric to quantify the information-theoretic performance. Tested on six benchmark text datasets, our algorithm competes closely with top models, especially in limited-vocabulary contexts, using significantly fewer parameters. \review{Our algorithm closely matches top-performing models, deviating by only ~2\% on limited-vocabulary datasets, using just 10\% of their parameters. However, it falls short on diverse-vocabulary datasets, likely due to the LZW algorithm's constraints with low-repetition data. This contrast highlights its efficiency and limitations across different dataset types.

* 12 pages, TKDE format

Via

Access Paper or Ask Questions