Abstract:In the woodworking industry, a huge amount of effort has to be invested into the initial quality assessment of the raw material. In this study we present an AI model to detect, quantify and localize defects on wooden logs. This model aims to both automate the quality control process and provide a more consistent and reliable quality assessment. For this purpose a dataset of 1424 sample images of wood logs is created. A total of 5 annotators possessing different levels of expertise is involved in dataset creation. An inter-annotator agreement analysis is conducted to analyze the impact of expertise on the annotation task and to highlight subjective differences in annotator judgement. We explore, train and fine-tune the state-of-the-art InternImage and ONE-PEACE architectures for semantic segmentation. The best model created achieves an average IoU of 0.71, and shows detection and quantification capabilities close to the human annotators.
Abstract:In the food industry, reprocessing returned product is a vital step to increase resource efficiency. [SBB23] presented an AI application that automates the tracking of returned bread buns. We extend their work by creating an expanded dataset comprising 2432 images and a wider range of baked goods. To increase model robustness, we use generative models pix2pix and CycleGAN to create synthetic images. We train state-of-the-art object detection model YOLOv9 and YOLOv8 on our detection task. Our overall best-performing model achieved an average precision AP@0.5 of 90.3% on our test set.
Abstract:In industrial manufacturing of glass bottles, quality control of bottle prints is necessary as numerous factors can negatively affect the printing process. Even minor defects in the bottle prints must be detected despite reflections in the glass or manufacturing-related deviations. In cooperation with our medium-sized industrial partner, two ML-based approaches for quality control of these bottle prints were developed and evaluated, which can also be used in this challenging scenario. Our first approach utilized different filters to supress reflections (e.g. Sobel or Canny) and image quality metrics for image comparison (e.g. MSE or SSIM) as features for different supervised classification models (e.g. SVM or k-Neighbors), which resulted in an accuracy of 84%. The images were aligned based on the ORB algorithm, which allowed us to estimate the rotations of the prints, which may serve as an indicator for anomalies in the manufacturing process. In our second approach, we fine-tuned different pre-trained CNN models (e.g. ResNet or VGG) for binary classification, which resulted in an accuracy of 87%. Utilizing Grad-Cam on our fine-tuned ResNet-34, we were able to localize and visualize frequently defective bottle print regions. This method allowed us to provide insights that could be used to optimize the actual manufacturing process. This paper also describes our general approach and the challenges we encountered in practice with data collection during ongoing production, unsupervised preselection, and labeling.
Abstract:Speech pauses, alongside content and structure, offer a valuable and non-invasive biomarker for detecting dementia. This work investigates the use of pause-enriched transcripts in transformer-based language models to differentiate the cognitive states of subjects with no cognitive impairment, mild cognitive impairment, and Alzheimer's dementia based on their speech from a clinical assessment. We address three binary classification tasks: Onset, monitoring, and dementia exclusion. The performance is evaluated through experiments on a German Verbal Fluency Test and a Picture Description Test, comparing the model's effectiveness across different speech production contexts. Starting from a textual baseline, we investigate the effect of incorporation of pause information and acoustic context. We show the test should be chosen depending on the task, and similarly, lexical pause information and acoustic cross-attention contribute differently.
Abstract:This paper explores the improvement of post-training quantization (PTQ) after knowledge distillation in the Whisper speech foundation model family. We address the challenge of outliers in weights and activation tensors, known to impede quantization quality in transformer-based language and vision models. Extending this observation to Whisper, we demonstrate that these outliers are also present when transformer-based models are trained to perform automatic speech recognition, necessitating mitigation strategies for PTQ. We show that outliers can be reduced by a recently proposed gating mechanism in the attention blocks of the student model, enabling effective 8-bit quantization, and lower word error rates compared to student models without the gating mechanism in place.
Abstract:Accurately detecting dysfluencies in spoken language can help to improve the performance of automatic speech and language processing components and support the development of more inclusive speech and language technologies. Inspired by the recent trend towards the deployment of large language models (LLMs) as universal learners and processors of non-lexical inputs, such as audio and video, we approach the task of multi-label dysfluency detection as a language modeling problem. We present hypotheses candidates generated with an automatic speech recognition system and acoustic representations extracted from an audio encoder model to an LLM, and finetune the system to predict dysfluency labels on three datasets containing English and German stuttered speech. The experimental results show that our system effectively combines acoustic and lexical information and achieves competitive results on the multi-label stuttering detection task.
Abstract:In this work, we optimize speculative sampling for parallel hardware accelerators to improve sampling speed. We notice that substantial portions of the intermediate matrices necessary for speculative sampling can be computed concurrently. This allows us to distribute the workload across multiple GPU threads, enabling simultaneous operations on matrix segments within thread blocks. Additionally, we use fast on-chip memory to store intermediate results, thereby minimizing the frequency of slow read and write operations across different types of memory. This results in profiling time improvements ranging from 6% to 13% relative to the baseline implementation, without compromising accuracy. To further accelerate speculative sampling, probability distributions parameterized by softmax are approximated by sigmoid. This approximation approach results in significantly greater relative improvements in profiling time, ranging from 37% to 94%, with a slight decline in accuracy. We conduct extensive experiments on both automatic speech recognition and summarization tasks to validate the effectiveness of our optimization methods.
Abstract:The Semmeldetector, is a machine learning application that utilizes object detection models to detect, classify and count baked goods in images. Our application allows commercial bakers to track unsold baked goods, which allows them to optimize production and increase resource efficiency. We compiled a dataset comprising 1151 images that distinguishes between 18 different types of baked goods to train our detection models. To facilitate model training, we used a Copy-Paste augmentation pipeline to expand our dataset. We trained the state-of-the-art object detection model YOLOv8 on our detection task. We tested the impact of different training data, model scale, and online image augmentation pipelines on model performance. Our overall best performing model, achieved an AP@0.5 of 89.1% on our test set. Based on our results, we conclude that machine learning can be a valuable tool even for unforeseen industries like bakeries, even with very limited datasets.
Abstract:Automatic speech recognition (ASR) has reached a level of accuracy in recent years, that even outperforms humans in transcribing speech to text. Nevertheless, all current ASR approaches show a certain weakness against ambient noise. To reduce this weakness, audio-visual speech recognition (AVSR) approaches additionally consider visual information from lip movements for transcription. This additional modality increases the computational cost for training models from scratch. We propose an approach, that builds on a pre-trained ASR model and extends it with an adaptive upstream module, that fuses audio and visual information. Since we do not need to train the transformer structure from scratch, our approach requires a fraction of the computational resources compared to traditional AVSR models. Compared to current SOTA systems like AV-HuBERT, our approach achieves an average improvement of 8.3% in word error rate across different model sizes, noise categories and broad SNR range. The approach allows up to 21% smaller models and requires only a fraction of the computational resources for training and inference compared to common AVSR approaches.
Abstract:User-generated information content has become an important information source in crisis situations. However, classification models suffer from noise and event-related biases which still poses a challenging task and requires sophisticated task-adaptation. To address these challenges, we propose the use of contrastive task-specialized sentence encoders for downstream classification. We apply the task-specialization on the CrisisLex, HumAID, and TrecIS information type classification tasks and show performance gains w.r.t. F1-score. Furthermore, we analyse the cross-corpus and cross-lingual capabilities for two German event relevancy classification datasets.