Time series analysis comprises statistical methods for analyzing a sequence of data points collected over an interval of time to identify interesting patterns and trends.
This study applies symbolic time series analysis and Markov modeling to explore the phonological structure of Evgenij Onegin-as captured through a graphemic vowel/consonant (V/C) encoding-and one contemporary Italian translation. Using a binary encoding inspired by Markov's original scheme, we construct minimalist probabilistic models that capture both local V/C dependencies and large-scale sequential patterns. A compact four-state Markov chain is shown to be descriptively accurate and generative, reproducing key features of the original sequences such as autocorrelation and memory depth. All findings are exploratory in nature and aim to highlight structural regularities while suggesting hypotheses about underlying narrative dynamics. The analysis reveals a marked asymmetry between the Russian and Italian texts: the original exhibits a gradual decline in memory depth, whereas the translation maintains a more uniform profile. To further investigate this divergence, we introduce phonological probes-short symbolic patterns that link surface structure to narrative-relevant cues. Tracked across the unfolding text, these probes reveal subtle connections between graphemic form and thematic development, particularly in the Russian original. By revisiting Markov's original proposal of applying symbolic analysis to a literary text and pairing it with contemporary tools from computational statistics and data science, this study shows that even minimalist Markov models can support exploratory analysis of complex poetic material. When complemented by a coarse layer of linguistic annotation, such models provide a general framework for comparative poetics and demonstrate that stylized structural patterns remain accessible through simple representations grounded in linguistic form.
The offshore wind energy sector is expanding rapidly, increasing the need for independent, high-temporal-resolution monitoring of infrastructure deployment and operation at global scale. While Earth Observation based offshore wind infrastructure mapping has matured for spatial localization, existing open datasets lack temporally dense and semantically fine-grained information on construction and operational dynamics. We introduce a global Sentinel-1 synthetic aperture radar (SAR) time series data corpus that resolves deployment and operational phases of offshore wind infrastructure from 2016Q1 to 2025Q1. Building on an updated object detection workflow, we compile 15,606 time series at detected infrastructure locations, with overall 14,840,637 events as analysis-ready 1D SAR backscatter profiles, one profile per Sentinel-1 acquisition and location. To enable direct use and benchmarking, we release (i) the analysis ready 1D SAR profiles, (ii) event-level baseline semantic labels generated by a rule-based classifier, and (iii) an expert-annotated benchmark dataset of 553 time series with 328,657 event labels. The baseline classifier achieves a macro F1 score of 0.84 in event-wise evaluation and an area under the collapsed edit similarity-quality threshold curve (AUC) of 0.785, indicating temporal coherence. We demonstrate that the resulting corpus supports global-scale analyses of deployment dynamics, the identification of differences in regional deployment patterns, vessel interactions, and operational events, and provides a reference for developing and comparing time series classification methods for offshore wind infrastructure monitoring.
A notable difference between the ordinary and Hadamard products is that the Hadamard product of two singular positive semidefinite matrices can be nonsingular, and one of the factors can even be indefinite. We present an eigenvalue lower bound for a Hadamard product that depends on the rank, effective condition number, and diagonal entries of one factor, and the smallest eigenvalues of certain principal submatrices of the other factor. We give numerical examples and discuss its applications in array signal processing and matrix time series analysis.
Evaluating the reasoning capabilities of Large Language Models (LLMs) for complex, quantitative financial tasks is a critical and unsolved challenge. Standard benchmarks often fail to isolate an agent's core ability to parse queries and orchestrate computations. To address this, we introduce a novel evaluation methodology and benchmark designed to rigorously measure an LLM agent's reasoning for financial time-series analysis. We apply this methodology in a large-scale empirical study using our framework, Time Series Augmented Generation (TSAG), where an LLM agent delegates quantitative tasks to verifiable, external tools. Our benchmark, consisting of 100 financial questions, is used to compare multiple SOTA agents (e.g., GPT-4o, Llama 3, Qwen2) on metrics assessing tool selection accuracy, faithfulness, and hallucination. The results demonstrate that capable agents can achieve near-perfect tool-use accuracy with minimal hallucination, validating the tool-augmented paradigm. Our primary contribution is this evaluation framework and the corresponding empirical insights into agent performance, which we release publicly to foster standardized research on reliable financial AI.
Root cause analysis (RCA) for time-series anomaly detection is critical for the reliable operation of complex real-world systems. Existing explanation methods often rely on unrealistic feature perturbations and ignore temporal and cross-feature dependencies, leading to unreliable attributions. We propose a conditional attribution framework that explains anomalies relative to contextually similar normal system states. Instead of using marginal or randomly sampled baselines, our method retrieves representative normal instances conditioned on the anomalous observation, enabling dependency-preserving and operationally meaningful explanations. To support high-dimensional time-series data, contextual retrieval is performed in learned low-dimensional representations using both variational autoencoder latent spaces and UMAP manifold embeddings. By grounding the retrieval process in the system's learned manifold, this strategy avoids out-of-distribution artifacts and ensures attribution fidelity while maintaining computational efficiency. We further introduce confidence-aware and temporal evaluation metrics for assessing explanation reliability and responsiveness. Experiments on the SWaT and MSDS benchmarks demonstrate that the proposed approach consistently improves root-cause identification accuracy, temporal localization, and robustness across multiple anomaly detection models. These results highlight the practical utility of conditional attribution for explainable anomaly diagnosis in complex time-series systems. Code and models will be publicly released.
This paper addresses the Variable Gapped Longest Common Subsequence (VGLCS) problem, a generalization of the classical LCS problem involving flexible gap constraints between consecutive solutions' characters. The problem arises in molecular sequence comparison, where structural distance constraints between residues must be respected, and in time-series analysis where events are required to occur within specified temporal delays. We propose a search framework based on the root-based state graph representation, in which the state space comprises a generally large number of rooted state subgraphs. To cope with the resulting combinatorial explosion, an iterative beam search strategy is employed, dynamically maintaining a global pool of promising candidate root nodes, enabling effective control of diversification across iterations. To exploit the search for high-quality solutions, several known heuristics from the LCS literature are utilized into the standalone beam search procedure. To the best of our knowledge, this is the first comprehensive computational study on the VGLCS problem comprising 320 synthetic instances with up to 10 input sequences and up to 500 characters. Experimental results show robustness of the designed approach over the baseline beam search in comparable runtimes.
In principle, deep generative models can be used to perform domain adaptation; i.e. align the input feature representations of test data with that of a separate discriminative model's training data. This can help improve the discriminative model's performance on the test data. However, generative models are prone to producing hallucinations and artefacts that may degrade the quality of generated data, and therefore, predictive performance when processed by the discriminative model. While uncertainty quantification can provide a means to assess the quality of adapted data, the standard framework for evaluating the quality of predicted uncertainties may not easily extend to generative models due to the common lack of ground truths (among other reasons). Even with ground truths, this evaluation is agnostic to how the generated outputs are used on the downstream task, limiting the extent to which the uncertainty reliability analysis provides insights about the utility of the uncertainties with respect to the intended use case of the adapted examples. Here, we describe how decision-theoretic uncertainty quantification can address these concerns and provide a convenient framework for evaluating the trustworthiness of generated outputs, in particular, for domain adaptation. We consider a case study in photoplethysmography time series denoising for Atrial Fibrillation classification. This formalises a well-known heuristic method of using a downstream classifier to assess the quality of generated outputs.
Open Radio Access Network (O-RAN) is an important 5G network architecture enabling flexible communication with adaptive strategies for different verticals. However, testing for O-RAN deployments involve massive volumes of time-series data (e.g., key performance indicators), creating critical challenges for scalable, unsupervised monitoring without labels or high computational overhead. To address this, we present ESN-DAGMM, a lightweight adaptation of the Deep Autoencoding Gaussian Mixture Model (DAGMM) framework for time series analysis. Our model utilizes an Echo State Network (ESN) to efficiently model temporal dependencies, proving effective in O-RAN networks where training samples are highly limited. Combined with DAGMM's integratation of dimensionality reduction and density estimation, we present a scalable framework for unsupervised monitoring of high volume network telemetry. When trained on only 10% of an O-RAN video-streaming dataset, ESN-DAGMM achieved on average 269.59% higher quality clustering than baselines under identical conditions, all while maintaining competitive reconstruction error. By extending DAGMM to capture temporal dynamics, ESN-DAGMM offers a practical solution for time-series analysis using very limited training samples, outperforming baselines and enabling operator's control over the clustering-reconstruction trade-off.
Diabetes devices, including Continuous Glucose Monitoring (CGM), Smart Insulin Pens, and Automated Insulin Delivery systems, generate rich time-series data widely used in research and machine learning. However, inconsistent data formats across sources hinder sharing, integration, and analysis. We present DIAX (DIAbetes eXchange), a standardized JSON-based format for unifying diabetes time-series data, including CGM, insulin, and meal signals. DIAX promotes interoperability, reproducibility, and extensibility, particularly for machine learning applications. An open-source repository provides tools for dataset conversion, cross-format compatibility, visualization, and community contributions. DIAX is a translational resource, not a data host, ensuring flexibility without imposing data-sharing constraints. Currently, DIAX is compatible with other standardization efforts and supports major datasets (DCLP3, DCLP5, IOBP2, PEDAP, T1Dexi, Loop), totaling over 10 million patient-hours of data. https://github.com/Center-for-Diabetes-Technology/DIAX
Heart rate variability (HRV) analysis is important for the assessment of autonomic cardiovascular regulation. The inverse Gaussian process (IGP) has been widely used for beat-to-beat HRV modeling, as it gives a physiological relevant interpretation of heart depolarization process. A key challenge in IGP-based heartbeat modeling is the accurate estimation of time-varying parameters. In this study, we investigated whether recurrent neural networks (RNNs) can be used for IGP parameter identification and thereby enhance probabilistic modeling of R-R dynamics. Specifically, four representative RNN architectures, namely, GRU, LSTM, Structured State Space sequence model (S4), and Mamba, were evaluated using the Kolmogorov-Smirnov statistics. The results demonstrate the possibility of combining neural sequence models with the IGP framework for beat-wise R-R series modeling. This approach provides a flexible basis for probabilistic HRV modeling and for future incorporation of more complex physiological mechanisms and dynamic conditions.