Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Andrea Bartolini

Assessing Tenstorrent's RISC-V MatMul Acceleration Capabilities

May 09, 2025

Hiari Pizzini Cavagna, Daniele Cesarini, Andrea Bartolini

Abstract:The increasing demand for generative AI as Large Language Models (LLMs) services has driven the need for specialized hardware architectures that optimize computational efficiency and energy consumption. This paper evaluates the performance of the Tenstorrent Grayskull e75 RISC-V accelerator for basic linear algebra kernels at reduced numerical precision, a fundamental operation in LLM computations. We present a detailed characterization of Grayskull's execution model, gridsize, matrix dimensions, data formats, and numerical precision impact computational efficiency. Furthermore, we compare Grayskull's performance against state-of-the-art architectures with tensor acceleration, including Intel Sapphire Rapids processors and two NVIDIA GPUs (V100 and A100). Whilst NVIDIA GPUs dominate raw performance, Grayskull demonstrates a competitive trade-off between power consumption and computational throughput, reaching a peak of 1.55 TFLOPs/Watt with BF16.

* Accepted to the Computational Aspects of Deep Learning Workshop at ISC High Performance 2025. To appear in the ISC High Performance 2025 Workshop Proceedings

Via

Access Paper or Ask Questions

Unleashing OpenTitan's Potential: a Silicon-Ready Embedded Secure Element for Root of Trust and Cryptographic Offloading

Jun 17, 2024

Maicol Ciani, Emanuele Parisi, Alberto Musa, Francesco Barchi, Andrea Bartolini, Ari Kulmala, Rafail Psiakis, Angelo Garofalo, Andrea Acquaviva, Davide Rossi

Figure 1 for Unleashing OpenTitan's Potential: a Silicon-Ready Embedded Secure Element for Root of Trust and Cryptographic Offloading

Figure 2 for Unleashing OpenTitan's Potential: a Silicon-Ready Embedded Secure Element for Root of Trust and Cryptographic Offloading

Figure 3 for Unleashing OpenTitan's Potential: a Silicon-Ready Embedded Secure Element for Root of Trust and Cryptographic Offloading

Figure 4 for Unleashing OpenTitan's Potential: a Silicon-Ready Embedded Secure Element for Root of Trust and Cryptographic Offloading

Abstract:The rapid advancement and exploration of open-hardware RISC-V platforms are driving significant changes in sectors like autonomous vehicles, smart-city infrastructure, and medical devices. OpenTitan stands out as a groundbreaking open-source RISC-V design with a comprehensive security toolkit as a standalone system-on-chip (SoC). OpenTitan includes Earl Grey, a fully implemented and silicon-proven SoC, and Darjeeling, announced but not yet fully implemented. Earl Grey targets standalone SoC implementations, while Darjeeling is for integrable implementations. The literature lacks a silicon-ready embedded implementation of an open-source Root of Trust, despite lowRISC's efforts on Darjeeling. We address the limitations of existing implementations by optimizing data transfer latency between memory and cryptographic accelerators to prevent under-utilization and ensure efficient task acceleration. Our contributions include a comprehensive methodology for integrating custom extensions and IPs into the Earl Grey architecture, architectural enhancements for system-level integration, support for varied boot modes, and improved data movement across the platform. These advancements facilitate deploying OpenTitan in broader SoCs, even without specific technology-dependent IPs, providing a deployment-ready research vehicle for the community. We integrated the extended Earl Grey architecture into a reference architecture in a 22nm FDX technology node, benchmarking the enhanced architecture's performance. The results show significant improvements in cryptographic processing speed, achieving up to 2.7x speedup for SHA-256/HMAC and 1.6x for AES accelerators compared to the baseline Earl Grey architecture.

Via

Access Paper or Ask Questions

DECICE: Device-Edge-Cloud Intelligent Collaboration Framework

May 04, 2023

Julian Kunkel, Christian Boehme, Jonathan Decker, Fabrizio Magugliani, Dirk Pleiter, Bastian Koller, Karthee Sivalingam, Sabri Pllana, Alexander Nikolov, Mujdat Soyturk(+4 more)

Abstract:DECICE is a Horizon Europe project that is developing an AI-enabled open and portable management framework for automatic and adaptive optimization and deployment of applications in computing continuum encompassing from IoT sensors on the Edge to large-scale Cloud / HPC computing infrastructures. In this paper, we describe the DECICE framework and architecture. Furthermore, we highlight use-cases for framework evaluation: intelligent traffic intersection, magnetic resonance imaging, and emergency response.

Via

Access Paper or Ask Questions

Experimenting with Emerging ARM and RISC-V Systems for Decentralised Machine Learning

Feb 15, 2023

Gianluca Mittone, Nicolò Tonci, Robert Birke, Iacopo Colonnelli, Doriana Medić, Andrea Bartolini, Roberto Esposito, Emanuele Parisi, Francesco Beneventi, Mirko Polato(+3 more)

Figure 1 for Experimenting with Emerging ARM and RISC-V Systems for Decentralised Machine Learning

Figure 2 for Experimenting with Emerging ARM and RISC-V Systems for Decentralised Machine Learning

Figure 3 for Experimenting with Emerging ARM and RISC-V Systems for Decentralised Machine Learning

Figure 4 for Experimenting with Emerging ARM and RISC-V Systems for Decentralised Machine Learning

Abstract:Decentralised Machine Learning (DML) enables collaborative machine learning without centralised input data. Federated Learning (FL) and Edge Inference are examples of DML. While tools for DML (especially FL) are starting to flourish, many are not flexible and portable enough to experiment with novel systems (e.g., RISC-V), non-fully connected topologies, and asynchronous collaboration schemes. We overcome these limitations via a domain-specific language allowing to map DML schemes to an underlying middleware, i.e. the \ff parallel programming library. We experiment with it by generating different working DML schemes on two emerging architectures (ARM-v8, RISC-V) and the x86-64 platform. We characterise the performance and energy efficiency of the presented schemes and systems. As a byproduct, we introduce a RISC-V porting of the PyTorch framework, the first publicly available to our knowledge.

Via

Access Paper or Ask Questions

RUAD: unsupervised anomaly detection in HPC systems

Aug 28, 2022

Martin Molan, Andrea Borghesi, Daniele Cesarini, Luca Benini, Andrea Bartolini

Figure 1 for RUAD: unsupervised anomaly detection in HPC systems

Figure 2 for RUAD: unsupervised anomaly detection in HPC systems

Figure 3 for RUAD: unsupervised anomaly detection in HPC systems

Figure 4 for RUAD: unsupervised anomaly detection in HPC systems

Abstract:The increasing complexity of modern high-performance computing (HPC) systems necessitates the introduction of automated and data-driven methodologies to support system administrators' effort toward increasing the system's availability. Anomaly detection is an integral part of improving the availability as it eases the system administrator's burden and reduces the time between an anomaly and its resolution. However, current state-of-the-art (SoA) approaches to anomaly detection are supervised and semi-supervised, so they require a human-labelled dataset with anomalies - this is often impractical to collect in production HPC systems. Unsupervised anomaly detection approaches based on clustering, aimed at alleviating the need for accurate anomaly data, have so far shown poor performance. In this work, we overcome these limitations by proposing RUAD, a novel Recurrent Unsupervised Anomaly Detection model. RUAD achieves better results than the current semi-supervised and unsupervised SoA approaches. This is achieved by considering temporal dependencies in the data and including long-short term memory cells in the model architecture. The proposed approach is assessed on a complete ten-month history of a Tier-0 system (Marconi100 from CINECA with 980 nodes). RUAD achieves an area under the curve (AUC) of 0.763 in semi-supervised training and an AUC of 0.767 in unsupervised training, which improves upon the SoA approach that achieves an AUC of 0.747 in semi-supervised training and an AUC of 0.734 in unsupervised training. It also vastly outperforms the current SoA unsupervised anomaly detection approach based on clustering, achieving the AUC of 0.548.

Via

Access Paper or Ask Questions

Source Code Classification for Energy Efficiency in Parallel Ultra Low-Power Microcontrollers

Dec 12, 2020

Emanuele Parisi, Francesco Barchi, Andrea Bartolini, Giuseppe Tagliavini, Andrea Acquaviva

Figure 1 for Source Code Classification for Energy Efficiency in Parallel Ultra Low-Power Microcontrollers

Figure 2 for Source Code Classification for Energy Efficiency in Parallel Ultra Low-Power Microcontrollers

Figure 3 for Source Code Classification for Energy Efficiency in Parallel Ultra Low-Power Microcontrollers

Figure 4 for Source Code Classification for Energy Efficiency in Parallel Ultra Low-Power Microcontrollers

Abstract:The analysis of source code through machine learning techniques is an increasingly explored research topic aiming at increasing smartness in the software toolchain to exploit modern architectures in the best possible way. In the case of low-power, parallel embedded architectures, this means finding the configuration, for instance in terms of the number of cores, leading to minimum energy consumption. Depending on the kernel to be executed, the energy optimal scaling configuration is not trivial. While recent work has focused on general-purpose systems to learn and predict the best execution target in terms of the execution time of a snippet of code or kernel (e.g. offload OpenCL kernel on multicore CPU or GPU), in this work we focus on static compile-time features to assess if they can be successfully used to predict the minimum energy configuration on PULP, an ultra-low-power architecture featuring an on-chip cluster of RISC-V processors. Experiments show that using machine learning models on the source code to select the best energy scaling configuration automatically is viable and has the potential to be used in the context of automatic system configuration for energy minimisation.

Via

Access Paper or Ask Questions

A Machine Learning Approach to Online Fault Classification in HPC Systems

Jul 27, 2020

Alessio Netti, Zeynep Kiziltan, Ozalp Babaoglu, Alina Sirbu, Andrea Bartolini, Andrea Borghesi

Figure 1 for A Machine Learning Approach to Online Fault Classification in HPC Systems

Figure 2 for A Machine Learning Approach to Online Fault Classification in HPC Systems

Figure 3 for A Machine Learning Approach to Online Fault Classification in HPC Systems

Figure 4 for A Machine Learning Approach to Online Fault Classification in HPC Systems

Abstract:As High-Performance Computing (HPC) systems strive towards the exascale goal, failure rates both at the hardware and software levels will increase significantly. Thus, detecting and classifying faults in HPC systems as they occur and initiating corrective actions before they can transform into failures becomes essential for continued operation. Central to this objective is fault injection, which is the deliberate triggering of faults in a system so as to observe their behavior in a controlled environment. In this paper, we propose a fault classification method for HPC systems based on machine learning. The novelty of our approach rests with the fact that it can be operated on streamed data in an online manner, thus opening the possibility to devise and enact control actions on the target system in real-time. We introduce a high-level, easy-to-use fault injection tool called FINJ, with a focus on the management of complex experiments. In order to train and evaluate our machine learning classifiers, we inject faults to an in-house experimental HPC system using FINJ, and generate a fault dataset which we describe extensively. Both FINJ and the dataset are publicly available to facilitate resiliency research in the HPC systems field. Experimental results demonstrate that our approach allows almost perfect classification accuracy to be reached for different fault types with low computational overhead and minimal delay.

* Future Generation Computer Systems, Volume 110, September 2020, Pages 1009-1022
* arXiv admin note: text overlap with arXiv:1807.10056, arXiv:1810.11208

Via

Access Paper or Ask Questions

pAElla: Edge-AI based Real-Time Malware Detection in Data Centers

Apr 07, 2020

Antonio Libri, Andrea Bartolini, Luca Benini

Figure 1 for pAElla: Edge-AI based Real-Time Malware Detection in Data Centers

Figure 2 for pAElla: Edge-AI based Real-Time Malware Detection in Data Centers

Figure 3 for pAElla: Edge-AI based Real-Time Malware Detection in Data Centers

Figure 4 for pAElla: Edge-AI based Real-Time Malware Detection in Data Centers

Abstract:The increasing use of Internet-of-Things (IoT) devices for monitoring a wide spectrum of applications, along with the challenges of "big data" streaming support they often require for data analysis, is nowadays pushing for an increased attention to the emerging edge computing paradigm. In particular, smart approaches to manage and analyze data directly on the network edge, are more and more investigated, and Artificial Intelligence (AI) powered edge computing is envisaged to be a promising direction. In this paper, we focus on Data Centers (DCs) and Supercomputers (SCs), where a new generation of high-resolution monitoring systems is being deployed, opening new opportunities for analysis like anomaly detection and security, but introducing new challenges for handling the vast amount of data it produces. In detail, we report on a novel lightweight and scalable approach to increase the security of DCs/SCs, that involves AI-powered edge computing on high-resolution power consumption. The method -- called pAElla -- targets real-time Malware Detection (MD), it runs on an out-of-band IoT-based monitoring system for DCs/SCs, and involves Power Spectral Density of power measurements, along with AutoEncoders. Results are promising, with an F1-score close to 1, and a False Alarm and Malware Miss rate close to 0%. We compare our method with State-of-the-Art MD techniques and show that, in the context of DCs/SCs, pAElla can cover a wider range of malware, significantly outperforming SoA approaches in terms of accuracy. Moreover, we propose a methodology for online training suitable for DCs/SCs in production, and release open dataset and code.

Via

Access Paper or Ask Questions

Anomaly Detection using Autoencoders in High Performance Computing Systems

Nov 13, 2018

Andrea Borghesi, Andrea Bartolini, Michele Lombardi, Michela Milano, Luca Benini

Figure 1 for Anomaly Detection using Autoencoders in High Performance Computing Systems

Figure 2 for Anomaly Detection using Autoencoders in High Performance Computing Systems

Figure 3 for Anomaly Detection using Autoencoders in High Performance Computing Systems

Figure 4 for Anomaly Detection using Autoencoders in High Performance Computing Systems

Abstract:Anomaly detection in supercomputers is a very difficult problem due to the big scale of the systems and the high number of components. The current state of the art for automated anomaly detection employs Machine Learning methods or statistical regression models in a supervised fashion, meaning that the detection tool is trained to distinguish among a fixed set of behaviour classes (healthy and unhealthy states). We propose a novel approach for anomaly detection in High Performance Computing systems based on a Machine (Deep) Learning technique, namely a type of neural network called autoencoder. The key idea is to train a set of autoencoders to learn the normal (healthy) behaviour of the supercomputer nodes and, after training, use them to identify abnormal conditions. This is different from previous approaches which where based on learning the abnormal condition, for which there are much smaller datasets (since it is very hard to identify them to begin with). We test our approach on a real supercomputer equipped with a fine-grained, scalable monitoring infrastructure that can provide large amount of data to characterize the system behaviour. The results are extremely promising: after the training phase to learn the normal system behaviour, our method is capable of detecting anomalies that have never been seen before with a very good accuracy (values ranging between 88% and 96%).

* 9 pages, 3 figures

Via

Access Paper or Ask Questions

Robust online identification of thermal models for in-production HPC clusters with machine learning-based data selection

Oct 03, 2018

Federico Pittino, Roberto Diversi, Luca Benini, Andrea Bartolini

Figure 1 for Robust online identification of thermal models for in-production HPC clusters with machine learning-based data selection

Figure 2 for Robust online identification of thermal models for in-production HPC clusters with machine learning-based data selection

Figure 3 for Robust online identification of thermal models for in-production HPC clusters with machine learning-based data selection

Figure 4 for Robust online identification of thermal models for in-production HPC clusters with machine learning-based data selection

Abstract:Power and thermal management are critical components of high performance computing (HPC) systems, due to their high power density and large total power consumption. The assessment of thermal dissipation by means of compact models directly from the thermal response of the final device enables more robust and precise thermal control strategies as well as automated diagnosis. However, when dealing with large scale systems "in production" the accuracy of learned thermal models depends on the dynamics of the power excitation, which depends also on the executed workload, and measurement nonidealities, such as quantization. In this paper we show that, using an advanced system identification algorithm, we are able to generate very accurate thermal models (average error lower than our sensors quantization step of 1{\deg}C) for a large scale HPC system on real workloads. However, we also show that: 1) not all real workloads allow for the identification of a good model; 2) starting from the theory of system identification it is very difficult to evaluate if a trace of data leads to a good estimated model. We then propose and validate a set of techniques based on machine learning and deep learning algorithms for the choice of data traces to be used for model identification. We also show that only via deep learning techniques these traces can be correctly chosen up to 96% of the times.

Via

Access Paper or Ask Questions