UMR TETIS, Cirad
Abstract:The Land Matrix initiative (https://landmatrix.org) and its global observatory aim to provide reliable data on large-scale land acquisitions to inform debates and actions in sectors such as agriculture, extraction, or energy in low- and middle-income countries. Although these data are recognized in the academic world, they remain underutilized in public policy, mainly due to the complexity of access and exploitation, which requires technical expertise and a good understanding of the database schema. The objective of this work is to simplify access to data from different database systems. The methods proposed in this article are evaluated using data from the Land Matrix. This work presents various comparisons of Large Language Models (LLMs) as well as combinations of LLM adaptations (Prompt Engineering, RAG, Agents) to query different database systems (GraphQL and REST queries). The experiments are reproducible, and a demonstration is available online: https://github.com/tetis-nlp/landmatrix-graphql-python.
Abstract:To address the current crises (climatic, social, economic), the self-sufficiency -- a set of practices that combine energy sobriety, self-production of food and energy, and self-construction - arouses an increasing interest. The CNRS STAY project (Savoirs Techniques pour l'Auto-suffisance, sur YouTube) explores this topic by analyzing techniques shared on YouTube. We present Agro-STAY, a platform designed for the collection, processing, and visualization of data from YouTube videos and their comments. We use Natural Language Processing (NLP) techniques and language models, which enable a fine-grained analysis of alternative agricultural practice described online. -- Face aux crises actuelles (climatiques, sociales, \'economiques), l'auto-suffisance -- ensemble de pratiques combinant sobri\'et\'e \'energ\'etique, autoproduction alimentaire et \'energ\'etique et autoconstruction - suscite un int\'er\^et croissant. Le projet CNRS STAY (Savoirs Techniques pour l'Auto-suffisance, sur YouTube) s'inscrit dans ce domaine en analysant les savoirs techniques diffus\'es sur YouTube. Nous pr\'esentons Agro-STAY, une plateforme d\'edi\'ee \`a la collecte, au traitement et \`a la visualisation de donn\'ees issues de vid\'eos YouTube et de leurs commentaires. En mobilisant des techniques de traitement automatique des langues (TAL) et des mod\`eles de langues, ce travail permet une analyse fine des pratiques agricoles alternatives d\'ecrites en ligne.
Abstract:Pre-trained vision-language models (VLMs), such as CLIP, demonstrate impressive zero-shot classification capabilities with free-form prompts and even show some generalization in specialized domains. However, their performance on satellite imagery is limited due to the underrepresentation of such data in their training sets, which predominantly consist of ground-level images. Existing prompting techniques for satellite imagery are often restricted to generic phrases like a satellite image of ..., limiting their effectiveness for zero-shot land-use and land-cover (LULC) mapping. To address these challenges, we introduce SenCLIP, which transfers CLIPs representation to Sentinel-2 imagery by leveraging a large dataset of Sentinel-2 images paired with geotagged ground-level photos from across Europe. We evaluate SenCLIP alongside other SOTA remote sensing VLMs on zero-shot LULC mapping tasks using the EuroSAT and BigEarthNet datasets with both aerial and ground-level prompting styles. Our approach, which aligns ground-level representations with satellite imagery, demonstrates significant improvements in classification accuracy across both prompt styles, opening new possibilities for applying free-form textual descriptions in zero-shot LULC mapping.
Abstract:Language models now constitute essential tools for improving efficiency for many professional tasks such as writing, coding, or learning. For this reason, it is imperative to identify inherent biases. In the field of Natural Language Processing, five sources of bias are well-identified: data, annotation, representation, models, and research design. This study focuses on biases related to geographical knowledge. We explore the connection between geography and language models by highlighting their tendency to misrepresent spatial information, thus leading to distortions in the representation of geographical distances. This study introduces four indicators to assess these distortions, by comparing geographical and semantic distances. Experiments are conducted from these four indicators with ten widely used language models. Results underscore the critical necessity of inspecting and rectifying spatial biases in language models to ensure accurate and equitable representations.
Abstract:In the context of Epidemic Intelligence, many Event-Based Surveillance (EBS) systems have been proposed in the literature to promote the early identification and characterization of potential health threats from online sources of any nature. Each EBS system has its own surveillance definitions and priorities, therefore this makes the task of selecting the most appropriate EBS system for a given situation a challenge for end-users. In this work, we propose a new evaluation framework to address this issue. It first transforms the raw input epidemiological event data into a set of normalized events with multi-granularity, then conducts a descriptive retrospective analysis based on four evaluation objectives: spatial, temporal, thematic and source analysis. We illustrate its relevance by applying it to an Avian Influenza dataset collected by a selection of EBS systems, and show how our framework allows identifying their strengths and drawbacks in terms of epidemic surveillance.
Abstract:Nowadays, modern Earth Observation systems continuously collect massive amounts of satellite information. The unprecedented possibility to acquire high resolution Satellite Image Time Series (SITS) data (series of images with high revisit time period on the same geographical area) is opening new opportunities to monitor the different aspects of the Earth Surface but, at the same time, it is raising up new challenges in term of suitable methods to analyze and exploit such huge amount of rich and complex image data. One of the main task associated to SITS data analysis is related to land cover mapping where satellite data are exploited via learning methods to recover the Earth Surface status aka the corresponding land cover classes. Due to operational constraints, the collected label information, on which machine learning strategies are trained, is often limited in volume and obtained at coarse granularity carrying out inexact and weak knowledge that can affect the whole process. To cope with such issues, in the context of object-based SITS land cover mapping, we propose a new deep learning framework, named TASSEL (aTtentive weAkly Supervised Satellite image time sEries cLassifier), that is able to intelligently exploit the weak supervision provided by the coarse granularity labels. Furthermore, our framework also produces an additional side-information that supports the model interpretability with the aim to make the black box gray. Such side-information allows to associate spatial interpretation to the model decision via visual inspection.
Abstract:Nowadays, there is a general agreement on the need to better characterize agricultural monitoring systems in response to the global changes. Timely and accurate land use/land cover mapping can support this vision by providing useful information at fine scale. Here, a deep learning approach is proposed to deal with multi-source land cover mapping at object level. The approach is based on an extension of Recurrent Neural Network enriched via an attention mechanism dedicated to multi-temporal data context. Moreover, a new hierarchical pretraining strategy designed to exploit specific domain knowledge available under hierarchical relationships within land cover classes is introduced. Experiments carried out on the Reunion island - a french overseas department - demonstrate the significance of the proposal compared to remote sensing standard approaches for land cover mapping.
Abstract:European satellite missions Sentinel-1 (S1) and Sentinel-2 (S2) provide at highspatial resolution and high revisit time, respectively, radar and optical imagesthat support a wide range of Earth surface monitoring tasks such as LandUse/Land Cover mapping. A long-standing challenge in the remote sensingcommunity is about how to efficiently exploit multiple sources of information and leverage their complementary. In this particular case, get the most out ofradar and optical satellite image time series (SITS). Here, we propose to dealwith land cover mapping through a deep learning framework especially tailoredto leverage the multi-source complementarity provided by radar and opticalSITS. The proposed architecture is based on an extension of Recurrent NeuralNetwork (RNN) enriched via a customized attention mechanism capable to fitthe specificity of SITS data. In addition, we propose a new pretraining strategythat exploits domain expert knowledge to guide the model parameter initial-ization. Thorough experimental evaluations involving several machine learningcompetitors, on two contrasted study sites, have demonstrated the suitabilityof our new attention mechanism combined with the extend RNN model as wellas the benefit/limit to inject domain expert knowledge in the neural networktraining process.
Abstract:Recurrent Neural Networks (RNNs) can be seriously impacted by the initial parameters assignment, which may result in poor generalization performances on new unseen data. With the objective to tackle this crucial issue, in the context of RNN based classification, we propose a new supervised layer-wise pretraining strategy to initialize network parameters. The proposed approach leverages a data-aware strategy that sets up a taxonomy of classification problems automatically derived by the model behavior. To the best of our knowledge, despite the great interest in RNN-based classification, this is the first data-aware strategy dealing with the initialization of such models. The proposed strategy has been tested on four benchmarks coming from two different domains, i.e., Speech Recognition and Remote Sensing. Results underline the significance of our approach and point out that data-aware strategies positively support the initialization of Recurrent Neural Network based classification models.
Abstract:Nowadays, modern Earth Observation systems continuously generate huge amounts of data. A notable example is represented by the Sentinel-2 mission, which provides images at high spatial resolution (up to 10m) with high temporal revisit period (every 5 days), which can be organized in Satellite Image Time Series (SITS). While the use of SITS has been proved to be beneficial in the context of Land Use/Land Cover (LULC) map generation, unfortunately, machine learning approaches commonly leveraged in remote sensing field fail to take advantage of spatio-temporal dependencies present in such data. Recently, new generation deep learning methods allowed to significantly advance research in this field. These approaches have generally focused on a single type of neural network, i.e., Convolutional Neural Networks (CNNs) or Recurrent Neural Networks (RNNs), which model different but complementary information: spatial autocorrelation (CNNs) and temporal dependencies (RNNs). In this work, we propose the first deep learning architecture for the analysis of SITS data, namely \method{} (DUal view Point deep Learning architecture for time series classificatiOn), that combines Convolutional and Recurrent neural networks to exploit their complementarity. Our hypothesis is that, since CNNs and RNNs capture different aspects of the data, a combination of both models would produce a more diverse and complete representation of the information for the underlying land cover classification task. Experiments carried out on two study sites characterized by different land cover characteristics (i.e., the \textit{Gard} site in France and the \textit{Reunion Island} in the Indian Ocean), demonstrate the significance of our proposal.