Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Jeroen Van Hautte

Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker

Nov 11, 2025

Matthias De Lange, Jens-Joris Decorte, Jeroen Van Hautte

Figure 1 for Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker

Figure 2 for Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker

Figure 3 for Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker

Figure 4 for Unified Work Embeddings: Contrastive Learning of a Bidirectional Multi-task Ranker

Abstract:Workforce transformation across diverse industries has driven an increased demand for specialized natural language processing capabilities. Nevertheless, tasks derived from work-related contexts inherently reflect real-world complexities, characterized by long-tailed distributions, extreme multi-label target spaces, and scarce data availability. The rise of generalist embedding models prompts the question of their performance in the work domain, especially as progress in the field has focused mainly on individual tasks. To this end, we introduce WorkBench, the first unified evaluation suite spanning six work-related tasks formulated explicitly as ranking problems, establishing a common ground for multi-task progress. Based on this benchmark, we find significant positive cross-task transfer, and use this insight to compose task-specific bipartite graphs from real-world data, synthetically enriched through grounding. This leads to Unified Work Embeddings (UWE), a task-agnostic bi-encoder that exploits our training-data structure with a many-to-many InfoNCE objective, and leverages token-level embeddings with task-agnostic soft late interaction. UWE demonstrates zero-shot ranking performance on unseen target spaces in the work domain, enables low-latency inference by caching the task target space embeddings, and shows significant gains in macro-averaged MAP and RP@10 over generalist embedding models.

* Preprint

Via

Access Paper or Ask Questions

Efficient Text Encoders for Labor Market Analysis

May 30, 2025

Jens-Joris Decorte, Jeroen Van Hautte, Chris Develder, Thomas Demeester

Abstract:Labor market analysis relies on extracting insights from job advertisements, which provide valuable yet unstructured information on job titles and corresponding skill requirements. While state-of-the-art methods for skill extraction achieve strong performance, they depend on large language models (LLMs), which are computationally expensive and slow. In this paper, we propose \textbf{ConTeXT-match}, a novel contrastive learning approach with token-level attention that is well-suited for the extreme multi-label classification task of skill classification. \textbf{ConTeXT-match} significantly improves skill extraction efficiency and performance, achieving state-of-the-art results with a lightweight bi-encoder model. To support robust evaluation, we introduce \textbf{Skill-XL}, a new benchmark with exhaustive, sentence-level skill annotations that explicitly address the redundancy in the large label space. Finally, we present \textbf{JobBERT V2}, an improved job title normalization model that leverages extracted skills to produce high-quality job title representations. Experiments demonstrate that our models are efficient, accurate, and scalable, making them ideal for large-scale, real-time labor market analysis.

* This work has been submitted to the IEEE for possible publication

Via

Access Paper or Ask Questions

SkillMatch: Evaluating Self-supervised Learning of Skill Relatedness

Oct 07, 2024

Jens-Joris Decorte, Jeroen Van Hautte, Thomas Demeester, Chris Develder

Figure 1 for SkillMatch: Evaluating Self-supervised Learning of Skill Relatedness

Figure 2 for SkillMatch: Evaluating Self-supervised Learning of Skill Relatedness

Figure 3 for SkillMatch: Evaluating Self-supervised Learning of Skill Relatedness

Abstract:Accurately modeling the relationships between skills is a crucial part of human resources processes such as recruitment and employee development. Yet, no benchmarks exist to evaluate such methods directly. We construct and release SkillMatch, a benchmark for the task of skill relatedness, based on expert knowledge mining from millions of job ads. Additionally, we propose a scalable self-supervised learning technique to adapt a Sentence-BERT model based on skill co-occurrence in job ads. This new method greatly surpasses traditional models for skill relatedness as measured on SkillMatch. By releasing SkillMatch publicly, we aim to contribute a foundation for research towards increased accuracy and transparency of skill-based recommendation systems.

* Accepted to the International workshop on AI for Human Resources and Public Employment Services (AI4HR&PES) as part of ECML-PKDD 2024

Via

Access Paper or Ask Questions

On the Biased Assessment of Expert Finding Systems

Oct 07, 2024

Jens-Joris Decorte, Jeroen Van Hautte, Chris Develder, Thomas Demeester

Figure 1 for On the Biased Assessment of Expert Finding Systems

Figure 2 for On the Biased Assessment of Expert Finding Systems

Figure 3 for On the Biased Assessment of Expert Finding Systems

Abstract:In large organisations, identifying experts on a given topic is crucial in leveraging the internal knowledge spread across teams and departments. So-called enterprise expert retrieval systems automatically discover and structure employees' expertise based on the vast amount of heterogeneous data available about them and the work they perform. Evaluating these systems requires comprehensive ground truth expert annotations, which are hard to obtain. Therefore, the annotation process typically relies on automated recommendations of knowledge areas to validate. This case study provides an analysis of how these recommendations can impact the evaluation of expert finding systems. We demonstrate on a popular benchmark that system-validated annotations lead to overestimated performance of traditional term-based retrieval models and even invalidate comparisons with more recent neural methods. We also augment knowledge areas with synonyms to uncover a strong bias towards literal mentions of their constituent words. Finally, we propose constraints to the annotation process to prevent these biased evaluations, and show that this still allows annotation suggestions of high utility. These findings should inform benchmark creation or selection for expert finding, to guarantee meaningful comparison of methods.

* Accepted to the 4th Workshop on Recommender Systems for Human Resources (RecSys in HR 2024) as part of RecSys 2024

Via

Access Paper or Ask Questions

Career Path Prediction using Resume Representation Learning and Skill-based Matching

Oct 24, 2023

Jens-Joris Decorte, Jeroen Van Hautte, Johannes Deleu, Chris Develder, Thomas Demeester

Figure 1 for Career Path Prediction using Resume Representation Learning and Skill-based Matching

Figure 2 for Career Path Prediction using Resume Representation Learning and Skill-based Matching

Figure 3 for Career Path Prediction using Resume Representation Learning and Skill-based Matching

Figure 4 for Career Path Prediction using Resume Representation Learning and Skill-based Matching

Abstract:The impact of person-job fit on job satisfaction and performance is widely acknowledged, which highlights the importance of providing workers with next steps at the right time in their career. This task of predicting the next step in a career is known as career path prediction, and has diverse applications such as turnover prevention and internal job mobility. Existing methods to career path prediction rely on large amounts of private career history data to model the interactions between job titles and companies. We propose leveraging the unexplored textual descriptions that are part of work experience sections in resumes. We introduce a structured dataset of 2,164 anonymized career histories, annotated with ESCO occupation labels. Based on this dataset, we present a novel representation learning approach, CareerBERT, specifically designed for work history data. We develop a skill-based model and a text-based model for career path prediction, which achieve 35.24% and 39.61% recall@10 respectively on our dataset. Finally, we show that both approaches are complementary as a hybrid approach achieves the strongest result with 43.01% recall@10.

* Accepted to the 3nd Workshop on Recommender Systems for Human Resources (RecSys in HR 2023) as part of RecSys 2023

Via

Access Paper or Ask Questions

Extreme Multi-Label Skill Extraction Training using Large Language Models

Jul 20, 2023

Jens-Joris Decorte, Severine Verlinden, Jeroen Van Hautte, Johannes Deleu, Chris Develder, Thomas Demeester

Figure 1 for Extreme Multi-Label Skill Extraction Training using Large Language Models

Figure 2 for Extreme Multi-Label Skill Extraction Training using Large Language Models

Figure 3 for Extreme Multi-Label Skill Extraction Training using Large Language Models

Figure 4 for Extreme Multi-Label Skill Extraction Training using Large Language Models

Abstract:Online job ads serve as a valuable source of information for skill requirements, playing a crucial role in labor market analysis and e-recruitment processes. Since such ads are typically formatted in free text, natural language processing (NLP) technologies are required to automatically process them. We specifically focus on the task of detecting skills (mentioned literally, or implicitly described) and linking them to a large skill ontology, making it a challenging case of extreme multi-label classification (XMLC). Given that there is no sizable labeled (training) dataset are available for this specific XMLC task, we propose techniques to leverage general Large Language Models (LLMs). We describe a cost-effective approach to generate an accurate, fully synthetic labeled dataset for skill extraction, and present a contrastive learning strategy that proves effective in the task. Our results across three skill extraction benchmarks show a consistent increase of between 15 to 25 percentage points in \textit{R-Precision@5} compared to previously published results that relied solely on distant supervision through literal matches.

* Accepted to the International workshop on AI for Human Resources and Public Employment Services (AI4HR&PES) as part of ECML-PKDD 2023

Via

Access Paper or Ask Questions

Design of Negative Sampling Strategies for Distantly Supervised Skill Extraction

Sep 13, 2022

Jens-Joris Decorte, Jeroen Van Hautte, Johannes Deleu, Chris Develder, Thomas Demeester

Figure 1 for Design of Negative Sampling Strategies for Distantly Supervised Skill Extraction

Figure 2 for Design of Negative Sampling Strategies for Distantly Supervised Skill Extraction

Figure 3 for Design of Negative Sampling Strategies for Distantly Supervised Skill Extraction

Figure 4 for Design of Negative Sampling Strategies for Distantly Supervised Skill Extraction

Abstract:Skills play a central role in the job market and many human resources (HR) processes. In the wake of other digital experiences, today's online job market has candidates expecting to see the right opportunities based on their skill set. Similarly, enterprises increasingly need to use data to guarantee that the skills within their workforce remain future-proof. However, structured information about skills is often missing, and processes building on self- or manager-assessment have shown to struggle with issues around adoption, completeness, and freshness of the resulting data. Extracting skills is a highly challenging task, given the many thousands of possible skill labels mentioned either explicitly or merely described implicitly and the lack of finely annotated training corpora. Previous work on skill extraction overly simplifies the task to an explicit entity detection task or builds on manually annotated training data that would be infeasible if applied to a complete vocabulary of skills. We propose an end-to-end system for skill extraction, based on distant supervision through literal matching. We propose and evaluate several negative sampling strategies, tuned on a small validation dataset, to improve the generalization of skill extraction towards implicitly mentioned skills, despite the lack of such implicit skills in the distantly supervised data. We observe that using the ESCO taxonomy to select negative examples from related skills yields the biggest improvements, and combining three different strategies in one model further increases the performance, up to 8 percentage points in RP@5. We introduce a manually annotated evaluation benchmark for skill extraction based on the ESCO taxonomy, on which we validate our models. We release the benchmark dataset for research purposes to stimulate further research on the task.

* Accepted to the 2nd Workshop on Recommender Systems for Human Resources (RecSys in HR 2022) as part of RecSys 2022

Via

Access Paper or Ask Questions

JobBERT: Understanding Job Titles through Skills

Sep 20, 2021

Jens-Joris Decorte, Jeroen Van Hautte, Thomas Demeester, Chris Develder

Figure 1 for JobBERT: Understanding Job Titles through Skills

Figure 2 for JobBERT: Understanding Job Titles through Skills

Abstract:Job titles form a cornerstone of today's human resources (HR) processes. Within online recruitment, they allow candidates to understand the contents of a vacancy at a glance, while internal HR departments use them to organize and structure many of their processes. As job titles are a compact, convenient, and readily available data source, modeling them with high accuracy can greatly benefit many HR tech applications. In this paper, we propose a neural representation model for job titles, by augmenting a pre-trained language model with co-occurrence information from skill labels extracted from vacancies. Our JobBERT method leads to considerable improvements compared to using generic sentence encoders, for the task of job title normalization, for which we release a new evaluation benchmark.

* Accepted to the International workshop on Fair, Effective And Sustainable Talent management using data science (FEAST) as part of ECML-PKDD 2021

Via

Access Paper or Ask Questions

Leveraging the Inherent Hierarchy of Vacancy Titles for Automated Job Ontology Expansion

Apr 06, 2020

Jeroen Van Hautte, Vincent Schelstraete, Mikaël Wornoo

Figure 1 for Leveraging the Inherent Hierarchy of Vacancy Titles for Automated Job Ontology Expansion

Figure 2 for Leveraging the Inherent Hierarchy of Vacancy Titles for Automated Job Ontology Expansion

Figure 3 for Leveraging the Inherent Hierarchy of Vacancy Titles for Automated Job Ontology Expansion

Figure 4 for Leveraging the Inherent Hierarchy of Vacancy Titles for Automated Job Ontology Expansion

Abstract:Machine learning plays an ever-bigger part in online recruitment, powering intelligent matchmaking and job recommendations across many of the world's largest job platforms. However, the main text is rarely enough to fully understand a job posting: more often than not, much of the required information is condensed into the job title. Several organised efforts have been made to map job titles onto a hand-made knowledge base as to provide this information, but these only cover around 60\% of online vacancies. We introduce a novel, purely data-driven approach towards the detection of new job titles. Our method is conceptually simple, extremely efficient and competitive with traditional NER-based approaches. Although the standalone application of our method does not outperform a finetuned BERT model, it can be applied as a preprocessing step as well, substantially boosting accuracy across several architectures.

* Accepted to the Proceedings of the 6th International Workshop on Computational Terminology (COMPUTERM 2020)

Via

Access Paper or Ask Questions

Bad Form: Comparing Context-Based and Form-Based Few-Shot Learning in Distributional Semantic Models

Oct 01, 2019

Jeroen Van Hautte, Guy Emerson, Marek Rei

Figure 1 for Bad Form: Comparing Context-Based and Form-Based Few-Shot Learning in Distributional Semantic Models

Figure 2 for Bad Form: Comparing Context-Based and Form-Based Few-Shot Learning in Distributional Semantic Models

Figure 3 for Bad Form: Comparing Context-Based and Form-Based Few-Shot Learning in Distributional Semantic Models

Figure 4 for Bad Form: Comparing Context-Based and Form-Based Few-Shot Learning in Distributional Semantic Models

Abstract:Word embeddings are an essential component in a wide range of natural language processing applications. However, distributional semantic models are known to struggle when only a small number of context sentences are available. Several methods have been proposed to obtain higher-quality vectors for these words, leveraging both this context information and sometimes the word forms themselves through a hybrid approach. We show that the current tasks do not suffice to evaluate models that use word-form information, as such models can easily leverage word forms in the training data that are related to word forms in the test data. We introduce 3 new tasks, allowing for a more balanced comparison between models. Furthermore, we show that hyperparameters that have largely been ignored in previous work can consistently improve the performance of both baseline and advanced models, achieving a new state of the art on 4 out of 6 tasks.

* Accepted to the Proceedings of the Second Workshop on Deep Learning for Low-Resource NLP (DeepLo 2019)

Via

Access Paper or Ask Questions