Abstract:Recent advances in foundational Vision Language Models (VLMs) have reshaped the evaluation paradigm in computer vision tasks. These foundational models, especially CLIP, have accelerated research in open-vocabulary computer vision tasks, including Open-Vocabulary Semantic Segmentation (OVSS). Although the initial results are promising, the dense prediction capabilities of VLMs still require further improvement. In this study, we enhance the semantic segmentation performance of CLIP by introducing new modules and modifications: 1) architectural changes in the last layer of ViT and the incorporation of attention maps from the middle layers with the last layer, 2) Image Engineering: applying data augmentations to enrich input image representations, and 3) using Large Language Models (LLMs) to generate definitions and synonyms for each class name to leverage CLIP's open-vocabulary capabilities. Our training-free method, ITACLIP, outperforms current state-of-the-art approaches on segmentation benchmarks such as COCO-Stuff, COCO-Object, Pascal Context, and Pascal VOC. Our code is available at https://github.com/m-arda-aydn/ITACLIP.
Abstract:Seasonal forecasting is a crucial task when it comes to detecting the extreme heat and colds that occur due to climate change. Confidence in the predictions should be reliable since a small increase in the temperatures in a year has a big impact on the world. Calibration of the neural networks provides a way to ensure our confidence in the predictions. However, calibrating regression models is an under-researched topic, especially in forecasters. We calibrate a UNet++ based architecture, which was shown to outperform physics-based models in temperature anomalies. We show that with a slight trade-off between prediction error and calibration error, it is possible to get more reliable and sharper forecasts. We believe that calibration should be an important part of safety-critical machine learning applications such as weather forecasters.
Abstract:Point clouds are extensively employed in a variety of real-world applications such as robotics, autonomous driving and augmented reality. Despite the recent success of point cloud neural networks, especially for safety-critical tasks, it is essential to also ensure the robustness of the model. A typical way to assess a model's robustness is through adversarial attacks, where test-time examples are generated based on gradients to deceive the model. While many different defense mechanisms are studied in 2D, studies on 3D point clouds have been relatively limited in the academic field. Inspired from PointDP, which denoises the network inputs by diffusion, we propose Point Cloud Layerwise Diffusion (PCLD), a layerwise diffusion based 3D point cloud defense strategy. Unlike PointDP, we propagated the diffusion denoising after each layer to incrementally enhance the results. We apply our defense method to different types of commonly used point cloud models and adversarial attacks to evaluate its robustness. Our experiments demonstrate that the proposed defense method achieved results that are comparable to or surpass those of existing methodologies, establishing robustness through a novel technique. Code is available at https://github.com/batuceng/diffusion-layer-robustness-pc.
Abstract:Point clouds and meshes are widely used 3D data structures for many computer vision applications. While the meshes represent the surfaces of an object, point cloud represents sampled points from the surface which is also the output of modern sensors such as LiDAR and RGB-D cameras. Due to the wide application area of point clouds and the recent advancements in deep neural networks, studies focusing on robust classification of the 3D point cloud data emerged. To evaluate the robustness of deep classifier networks, a common method is to use adversarial attacks where the gradient direction is followed to change the input slightly. The previous studies on adversarial attacks are generally evaluated on point clouds of daily objects. However, considering 3D faces, these adversarial attacks tend to affect the person's facial structure more than the desired amount and cause malformation. Specifically for facial expressions, even a small adversarial attack can have a significant effect on the face structure. In this paper, we suggest an adversarial attack called $\epsilon$-Mesh Attack, which operates on point cloud data via limiting perturbations to be on the mesh surface. We also parameterize our attack by $\epsilon$ to scale the perturbation mesh. Our surface-based attack has tighter perturbation bounds compared to $L_2$ and $L_\infty$ norm bounded attacks that operate on unit-ball. Even though our method has additional constraints, our experiments on CoMA, Bosphorus and FaceWarehouse datasets show that $\epsilon$-Mesh Attack (Perpendicular) successfully confuses trained DGCNN and PointNet models $99.72\%$ and $97.06\%$ of the time, with indistinguishable facial deformations. The code is available at https://github.com/batuceng/e-mesh-attack.
Abstract:Codebook collapse is a common problem in training deep generative models with discrete representation spaces like Vector Quantized Variational Autoencoders (VQ-VAEs). We observe that the same problem arises for the alternatively designed discrete variational autoencoders (dVAEs) whose encoder directly learns a distribution over the codebook embeddings to represent the data. We hypothesize that using the softmax function to obtain a probability distribution causes the codebook collapse by assigning overconfident probabilities to the best matching codebook elements. In this paper, we propose a novel way to incorporate evidential deep learning (EDL) instead of softmax to combat the codebook collapse problem of dVAE. We evidentially monitor the significance of attaining the probability distribution over the codebook embeddings, in contrast to softmax usage. Our experiments using various datasets show that our model, called EdVAE, mitigates codebook collapse while improving the reconstruction performance, and enhances the codebook usage compared to dVAE and VQ-VAE based models.
Abstract:Diffusion models are generative models that have shown significant advantages compared to other generative models in terms of higher generation quality and more stable training. However, the computational need for training diffusion models is considerably increased. In this work, we incorporate prototype learning into diffusion models to achieve high generation quality faster than the original diffusion model. Instead of randomly initialized class embeddings, we use separately learned class prototypes as the conditioning information to guide the diffusion process. We observe that our method, called ProtoDiffusion, achieves better performance in the early stages of training compared to the baseline method, signifying that using the learned prototypes shortens the training time. We demonstrate the performance of ProtoDiffusion using various datasets and experimental settings, achieving the best performance in shorter times across all settings.
Abstract:The problem of text-guided image generation is a complex task in Computer Vision, with various applications, including creating visually appealing artwork and realistic product images. One popular solution widely used for this task is the diffusion model, a generative model that generates images through an iterative process. Although diffusion models have demonstrated promising results for various image generation tasks, they may only sometimes produce satisfactory results when applied to more specific domains, such as the generation of textile patterns based on text guidance. This study presents a fine-tuned diffusion model specifically trained for textile pattern generation by text guidance to address this issue. The study involves the collection of various textile pattern images and their captioning with the help of another AI model. The fine-tuned diffusion model is trained with this newly created dataset, and its results are compared with the baseline models visually and numerically. The results demonstrate that the proposed fine-tuned diffusion model outperforms the baseline models in terms of pattern quality and efficiency in textile pattern generation by text guidance. This study presents a promising solution to the problem of text-guided textile pattern generation and has the potential to simplify the design process within the textile industry.
Abstract:Existing multi-label frameworks only exploit the information deduced from the bipartition of the labels into a positive and negative set. Therefore, they do not benefit from the ranking order between positive labels, which is the concept we introduce in this paper. We propose a novel multi-label ranking method: GaussianMLR, which aims to learn implicit class significance values that determine the positive label ranks instead of treating them as of equal importance, by following an approach that unifies ranking and classification tasks associated with multi-label ranking. Due to the scarcity of public datasets, we introduce eight synthetic datasets generated under varying importance factors to provide an enriched and controllable experimental environment for this study. On both real-world and synthetic datasets, we carry out extensive comparisons with relevant baselines and evaluate the performance on both of the two sub-tasks. We show that our method is able to accurately learn a representation of the incorporated positive rank order, which is not only consistent with the ground truth but also proportional to the underlying information. We strengthen our claims empirically by conducting comprehensive experimental studies. Code is available at https://github.com/MrGranddy/GaussianMLR.
Abstract:Understanding seasonal climatic conditions is critical for better management of resources such as water, energy and agriculture. Recently, there has been a great interest in utilizing the power of artificial intelligence methods in climate studies. This paper presents a cutting-edge deep learning model (UNet++) trained by state-of-the-art global CMIP6 models to forecast global temperatures a month ahead using the ERA5 reanalysis dataset. ERA5 dataset was also used for finetuning as well performance analysis in the validation dataset. Three different setups (CMIP6; CMIP6 + elevation; CMIP6 + elevation + ERA5 finetuning) were used with both UNet and UNet++ algorithms resulting in six different models. For each model 14 different sequential and non-sequential temporal settings were used. The Mean Absolute Error (MAE) analysis revealed that UNet++ with CMIP6 with elevation and ERA5 finetuning model with "Year 3 Month 2" temporal case provided the best outcome with an MAE of 0.7. Regression analysis over the validation dataset between the ERA5 data values and the corresponding AI model predictions revealed slope and $R^2$ values close to 1 suggesting a very good agreement. The AI model predicts significantly better than the mean CMIP6 ensemble between 2016 and 2021. Both models predict the summer months more accurately than the winter months.
Abstract:Multi-label ranking maps instances to a ranked set of predicted labels from multiple possible classes. The ranking approach for multi-label learning problems received attention for its success in multi-label classification, with one of the well-known approaches being pairwise label ranking. However, most existing methods assume that only partial information about the preference relation is known, which is inferred from the partition of labels into a positive and negative set, then treat labels with equal importance. In this paper, we focus on the unique challenge of ranking when the order of the true label set is provided. We propose a novel dedicated loss function to optimize models by incorporating penalties for incorrectly ranked pairs, and make use of the ranking information present in the input. Our method achieves the best reported performance measures on both synthetic and real world ranked datasets and shows improvements on overall ranking of labels. Our experimental results demonstrate that our approach is generalizable to a variety of multi-label classification and ranking tasks, while revealing a calibration towards a certain ranking ordering.