Abstract:Autoformalization, the task of automatically translating natural language descriptions into a formal language, poses a significant challenge across various domains, especially in mathematics. Recent advancements in large language models (LLMs) have unveiled their promising capabilities to formalize even competition-level math problems. However, we observe a considerable discrepancy between pass@1 and pass@k accuracies in LLM-generated formalizations. To address this gap, we introduce a novel framework that scores and selects the best result from k autoformalization candidates based on two complementary self-consistency methods: symbolic equivalence and semantic consistency. Elaborately, symbolic equivalence identifies the logical homogeneity among autoformalization candidates using automated theorem provers, and semantic consistency evaluates the preservation of the original meaning by informalizing the candidates and computing the similarity between the embeddings of the original and informalized texts. Our extensive experiments on the MATH and miniF2F datasets demonstrate that our approach significantly enhances autoformalization accuracy, achieving up to 0.22-1.35x relative improvements across various LLMs and baseline methods.
Abstract:Weakly Supervised Semantic Segmentation (WSSS) with image-level labels has long been suffering from fragmentary object regions led by Class Activation Map (CAM), which is incapable of generating fine-grained masks for semantic segmentation. To guide CAM to find more non-discriminating object patterns, this paper turns to an interesting working mechanism in agent learning named Complementary Learning System (CLS). CLS holds that the neocortex builds a sensation of general knowledge, while the hippocampus specially learns specific details, completing the learned patterns. Motivated by this simple but effective learning pattern, we propose a General-Specific Learning Mechanism (GSLM) to explicitly drive a coarse-grained CAM to a fine-grained pseudo mask. Specifically, GSLM develops a General Learning Module (GLM) and a Specific Learning Module (SLM). The GLM is trained with image-level supervision to extract coarse and general localization representations from CAM. Based on the general knowledge in the GLM, the SLM progressively exploits the specific spatial knowledge from the localization representations, expanding the CAM in an explicit way. To this end, we propose the Seed Reactivation to help SLM reactivate non-discriminating regions by setting a boundary for activation values, which successively identifies more regions of CAM. Without extra refinement processes, our method is able to achieve breakthrough improvements for CAM of over 20.0% mIoU on PASCAL VOC 2012 and 10.0% mIoU on MS COCO 2014 datasets, representing a new state-of-the-art among existing WSSS methods.
Abstract:Autonomous grasping of novel objects that are previously unseen to a robot is an ongoing challenge in robotic manipulation. In the last decades, many approaches have been presented to address this problem for specific robot hands. The UniGrasp framework, introduced recently, has the ability to generalize to different types of robotic grippers; however, this method does not work on grippers with closed-loop constraints and is data-inefficient when applied to robot hands with multigrasp configurations. In this paper, we present EfficientGrasp, a generalized grasp synthesis and gripper control method that is independent of gripper model specifications. EfficientGrasp utilizes a gripper workspace feature rather than UniGrasp's gripper attribute inputs. This reduces memory use by 81.7% during training and makes it possible to generalize to more types of grippers, such as grippers with closed-loop constraints. The effectiveness of EfficientGrasp is evaluated by conducting object grasping experiments both in simulation and real-world; results show that the proposed method also outperforms UniGrasp when considering only grippers without closed-loop constraints. In these cases, EfficientGrasp shows 9.85% higher accuracy in generating contact points and 3.10% higher grasping success rate in simulation. The real-world experiments are conducted with a gripper with closed-loop constraints, which UniGrasp fails to handle while EfficientGrasp achieves a success rate of 83.3%. The main causes of grasping failures of the proposed method are analyzed, highlighting ways of enhancing grasp performance.
Abstract:Identification of the mechanically exfoliated graphene flakes and classification of the thickness is important in the nanomanufacturing of next-generation materials and devices that overcome the bottleneck of Moore's Law. Currently, identification and classification of exfoliated graphene flakes are conducted by human via inspecting the optical microscope images. The existing state-of-the-art automatic identification by machine learning is not able to accommodate images with different backgrounds while different backgrounds are unavoidable in experiments. This paper presents a deep learning method to automatically identify and classify the thickness of exfoliated graphene flakes on Si/SiO2 substrates from optical microscope images with various settings and background colors. The presented method uses a hierarchical deep convolutional neural network that is capable of learning new images while preserving the knowledge from previous images. The deep learning model was trained and used to classify exfoliated graphene flakes into monolayer (1L), bi-layer (2L), tri-layer (3L), four-to-six-layer (4-6L), seven-to-ten-layer (7-10L), and bulk categories. Compared with existing machine learning methods, the presented method possesses high accuracy and efficiency as well as robustness to the backgrounds and resolutions of images. The results indicated that our deep learning model has accuracy as high as 99% in identifying and classifying exfoliated graphene flakes. This research will shed light on scaled-up manufacturing and characterization of graphene for advanced materials and devices.
Abstract:Face deblurring aims to restore a clear face image from a blurred input image with more explicit structure and facial details. However, most conventional image and face deblurring methods focus on the whole generated image resolution without consideration of special face part texture and generally produce unsufficient details. Considering that faces and backgrounds have different distribution information, in this study, we designed an effective face deblurring network based on separable normalization and adaptive denormalization (SNADNet). First, We fine-tuned the face parsing network to obtain an accurate face structure. Then, we divided the face parsing feature into face foreground and background. Moreover, we constructed a new feature adaptive denormalization to regularize fafcial structures as a condition of the auxiliary to generate more harmonious and undistorted face structure. In addition, we proposed a texture extractor and multi-patch discriminator to enhance the generated facial texture information. Experimental results on both CelebA and CelebA-HQ datasets demonstrate that the proposed face deblurring network restores face structure with more facial details and performs favorably against state-of-the-art methods in terms of structured similarity indexing method (SSIM), peak signal-to-noise ratio (PSNR), Frechet inception distance (FID) and L1, and qualitative comparisons.
Abstract:Prior knowledge of face shape and location plays an important role in face inpainting. However, traditional facing inpainting methods mainly focus on the generated image resolution of the missing portion but without consideration of the special particularities of the human face explicitly and generally produce discordant facial parts. To solve this problem, we present a stable variational latent generative model for large inpainting of face images. We firstly represent only face regions with the latent variable space but simultaneously constraint the random vectors to offer control over the distribution of latent variables, and combine with the non-face parts textures to generate a face image with plausible contents. Two adversarial discriminators are finally used to judge whether the generated distribution is close to the real distribution or not. It can not only synthesize novel image structures but also explicitly utilize the latent space with Eigenfaces to make better predictions. Furthermore, our work better evaluates the side face impainting problem. Experiments on both CelebA and CelebA-HQ face datasets demonstrate that our proposed approach generates higher quality inpainting results than existing ones.
Abstract:Existing approaches to train neural networks that use large images require to either crop or down-sample data during pre-processing, use small batch sizes, or split the model across devices mainly due to the prohibitively limited memory capacity available on GPUs and emerging accelerators. These techniques often lead to longer time to convergence or time to train (TTT), and in some cases, lower model accuracy. CPUs, on the other hand, can leverage significant amounts of memory. While much work has been done on parallelizing neural network training on multiple CPUs, little attention has been given to tune neural network training with large images on CPUs. In this work, we train a multi-scale convolutional neural network (M-CNN) to classify large biomedical images for high content screening in one hour. The ability to leverage large memory capacity on CPUs enables us to scale to larger batch sizes without having to crop or down-sample the input images. In conjunction with large batch sizes, we find a generalized methodology of linearly scaling of learning rate and train M-CNN to state-of-the-art (SOTA) accuracy of 99% within one hour. We achieve fast time to convergence using 128 two socket Intel Xeon 6148 processor nodes with 192GB DDR4 memory connected with 100Gbps Intel Omnipath architecture.