Abstract:While federated learning leverages distributed client resources, it faces challenges due to heterogeneous client capabilities. This necessitates allocating models suited to clients' resources and careful parameter aggregation to accommodate this heterogeneity. We propose HypeMeFed, a novel federated learning framework for supporting client heterogeneity by combining a multi-exit network architecture with hypernetwork-based model weight generation. This approach aligns the feature spaces of heterogeneous model layers and resolves per-layer information disparity during weight aggregation. To practically realize HypeMeFed, we also propose a low-rank factorization approach to minimize computation and memory overhead associated with hypernetworks. Our evaluations on a real-world heterogeneous device testbed indicate that HypeMeFed enhances accuracy by 5.12% over FedAvg, reduces the hypernetwork memory requirements by 98.22%, and accelerates its operations by 1.86 times compared to a naive hypernetwork approach. These results demonstrate HypeMeFed's effectiveness in leveraging and engaging heterogeneous clients for federated learning.
Abstract:In multi speakers environments, cochlear implant (CI) users may attend to a target sound source in a different manner from the normal hearing (NH) individuals during a conversation. This study attempted to investigate the effect of conversational sound levels on the mechanisms adopted by CI and NH listeners in selective auditory attention and how it affects their daily conversation. Nine CI users (five bilateral, three unilateral, and one bimodal) and eight NH listeners participated in this study. The behavioral speech recognition scores were collected using a matrix sentences test and neural tracking to speech envelope was recorded using electroencephalography (EEG). Speech stimuli were presented at three different levels (75, 65, and 55 dB SPL) in the presence of two maskers from three spatially separated speakers. Different combinations of assisted/impaired hearing modes were evaluated for CI users and the outcomes were analyzed in three categories: electric hearing only, acoustic hearing only, and electric+acoustic hearing. Our results showed that increasing the conversational sound level degraded the selective auditory attention in electrical hearing. On the other hand, increasing the sound level improved the selective auditory attention for the acoustic hearing group. In NH listeners, however, increasing the sound level did not cause a significant change in the auditory attention. Our result implies that the effect of the sound level on the selective auditory attention varies depending on hearing modes and the loudness control is necessary for the ease of attending to the conversation by CI users.
Abstract:Electrical hearing by cochlear implants (CIs) may be fundamentally different from acoustic hearing by normal-hearing (NH) listeners, presumably showing unequal speech quality perception in various noise environments. Noise reduction (NR) algorithms used in CI reduce the noise in favor of signal-to-noise ratio (SNR), regardless of plausible accompanying distortions that may degrade the speech quality perception. To gain better understanding of CI speech quality perception, the present work aimed investigating speech quality perception in a diverse noise conditions, including factors of speech/noise levels, type of noise, and distortions caused by NR models. Fifteen NH and seven CI subjects participated in this study. Speech sentences were set to two different levels (65 and 75 dB SPL). Two types of noise (Cafeteria and Babble) at three levels (55, 65, and 75 dB SPL) were used. Sentences were processed using two NR algorithms to investigate the perceptual sensitivity of CI and NH listeners to the distortion. All sentences processed with the combinations of these sets were presented to CI and NH listeners, and they were asked to rate the sound quality of speech as they perceived. The effect of each factor on the perceived speech quality was investigated based on the group averaged quality rated by CI and NH listeners. Consistent with previous studies, CI listeners were not as sensitive as NH to the distortion made by NR algorithms. Statistical analysis showed that the speech level has significant effect on quality perception. At the same SNR, the quality of 65 dB speech was rated higher than that of 75 dB for CI users, but vice versa for NH listeners. Therefore, the present study showed that the perceived speech quality patterns were different between CI and NH listeners in terms of their sensitivity to distortion and speech level in complex listening environment.
Abstract:Modifying the facial images with desired attributes is important, though challenging tasks in computer vision, where it aims to modify single or multiple attributes of the face image. Some of the existing methods are either based on attribute independent approaches where the modification is done in the latent representation or attribute dependent approaches. The attribute independent methods are limited in performance as they require the desired paired data for changing the desired attributes. Secondly, the attribute independent constraint may result in the loss of information and, hence, fail in generating the required attributes in the face image. In contrast, the attribute dependent approaches are effective as these approaches are capable of modifying the required features along with preserving the information in the given image. However, attribute dependent approaches are sensitive and require a careful model design in generating high-quality results. To address this problem, we propose an attribute dependent face modification approach. The proposed approach is based on two generators and two discriminators that utilize the binary as well as the real representation of the attributes and, in return, generate high-quality attribute modification results. Experiments on the CelebA dataset show that our method effectively performs the multiple attribute editing with preserving other facial details intactly.
Abstract:When a deep neural network is trained on data with only image-level labeling, the regions activated in each image tend to identify only a small region of the target object. We propose a method of using videos automatically harvested from the web to identify a larger region of the target object by using temporal information, which is not present in the static image. The temporal variations in a video allow different regions of the target object to be activated. We obtain an activated region in each frame of a video, and then aggregate the regions from successive frames into a single image, using a warping technique based on optical flow. The resulting localization maps cover more of the target object, and can then be used as proxy ground-truth to train a segmentation network. This simple approach outperforms existing methods under the same level of supervision, and even approaches relying on extra annotations. Based on VGG-16 and ResNet 101 backbones, our method achieves the mIoU of 65.0 and 67.4, respectively, on PASCAL VOC 2012 test images, which represents a new state-of-the-art.
Abstract:The main obstacle to weakly supervised semantic image segmentation is the difficulty of obtaining pixel-level information from coarse image-level annotations. Most methods based on image-level annotations use localization maps obtained from the classifier, but these only focus on the small discriminative parts of objects and do not capture precise boundaries. FickleNet explores diverse combinations of locations on feature maps created by generic deep neural networks. It selects hidden units randomly and then uses them to obtain activation scores for image classification. FickleNet implicitly learns the coherence of each location in the feature maps, resulting in a localization map which identifies both discriminative and other parts of objects. The ensemble effects are obtained from a single network by selecting random hidden unit pairs, which means that a variety of localization maps are generated from a single image. Our approach does not require any additional training steps and only adds a simple layer to a standard convolutional neural network; nevertheless it outperforms recent comparable techniques on the Pascal VOC 2012 benchmark in both weakly and semi-supervised settings.
Abstract:A meningioma is a type of brain tumor that requires tumor volume size follow ups in order to reach appropriate clinical decisions. A fully automated tool for meningioma detection is necessary for reliable and consistent tumor surveillance. There have been various studies concerning automated lesion detection. Studies on the application of convolutional neural network (CNN)-based methods, which have achieved a state-of-the-art level of performance in various computer vision tasks, have been carried out. However, the applicable diseases are limited, owing to a lack of strongly annotated data being present in medical image analysis. In order to resolve the above issue we propose pyramid gradient-based class activation mapping (PG-CAM) which is a novel method for tumor localization that can be trained in weakly supervised manner. PG-CAM uses a densely connected encoder-decoder-based feature pyramid network (DC-FPN) as a backbone structure, and extracts a multi-scale Grad-CAM that captures hierarchical features of a tumor. We tested our model using meningioma brain magnetic resonance (MR) data collected from the collaborating hospital. In our experiments, PG-CAM outperformed Grad-CAM by delivering a 23 percent higher localization accuracy for the validation set.
Abstract:The extraction of meaningful features from videos is important as they can be used in various applications. Despite its importance, video representation learning has not been studied much, because it is challenging to deal with both content and motion information. We present a Mutual Suppression network (MSnet) to learn disentangled motion and content features in videos. The MSnet is trained in such way that content features do not contain motion information and motion features do not contain content information; this is done by suppressing each other with adversarial training. We utilize the disentangled features from the MSnet for several tasks, such as frame reproduction, pixel-level video frame prediction, and dense optical flow estimation, to demonstrate the strength of MSnet. The proposed model outperforms the state-of-the-art methods in pixel-level video frame prediction. The source code will be publicly available.