Abstract:How do you ensure that, in a reverberant room, several people can speak simultaneously to several other people, making themselves perfectly understood and without any crosstalk between messages? In this work, we report a conceptual solution to this problem by developing an intelligent acoustic wall, which can be reconfigured electronically and is controlled by a learning algorithm that adapts to the geometry of the room and the positions of sources and receivers. To this end, a portion of the room boundaries is covered with a smart mirror made of a broadband acoustic reconfigurable metasurface (ARMs) designed to provide a two-state (0 or {\pi}) phase shift in the reflected waves by 200 independently tunable units. The whole phase pattern is optimized to maximize the Shannon capacity while minimizing crosstalk between the different sources and receivers. We demonstrate the control of multi-spectral sound fields covering a spectrum much larger than the coherence bandwidth of the room for diverse striking functionalities, including crosstalk-free acoustic communication, frequency-multiplexed communications, and multi-user communications. An experiment conducted with two music sources for two different people demonstrates a crosstalk-free simultaneous music playback. Our work opens new routes for the control of sound waves in complex media and for a new generation of acoustic devices.
Abstract:Image captioning models require the high-level generalization ability to describe the contents of various images in words. Most existing approaches treat the image-caption pairs equally in their training without considering the differences in their learning difficulties. Several image captioning approaches introduce curriculum learning methods that present training data with increasing levels of difficulty. However, their difficulty measurements are either based on domain-specific features or prior model training. In this paper, we propose a simple yet efficient difficulty measurement for image captioning using cross-modal similarity calculated by a pretrained vision-language model. Experiments on the COCO and Flickr30k datasets show that our proposed approach achieves superior performance and competitive convergence speed to baselines without requiring heuristics or incurring additional training costs. Moreover, the higher model performance on difficult examples and unseen data also demonstrates the generalization ability.
Abstract:Digitization of scanned receipts aims to extract text from receipt images and save it into structured documents. This is usually split into two sub-tasks: text localization and optical character recognition (OCR). Most existing OCR models only focus on the cropped text instance images, which require the bounding box information provided by a text region detection model. Introducing an additional detector to identify the text instance images in advance is inefficient, however instance-level OCR models have very low accuracy when processing the whole image for the document-level OCR, such as receipt images containing multiple text lines arranged in various layouts. To this end, we propose a localization-free document-level OCR model for transcribing all the characters in a receipt image into an ordered sequence end-to-end. Specifically, we finetune the pretrained Transformer-based instance-level model TrOCR with randomly cropped image chunks, and gradually increase the image chunk size to generalize the recognition ability from instance images to full-page images. In our experiments on the SROIE receipt OCR dataset, the model finetuned with our strategy achieved 64.4 F1-score and a 22.8% character error rates (CER) on the word-level and character-level metrics, respectively, which outperforms the baseline results with 48.5 F1-score and 50.6% CER. The best model, which splits the full image into 15 equally sized chunks, gives 87.8 F1-score and 4.98% CER with minimal additional pre or post-processing of the output. Moreover, the characters in the generated document-level sequences are arranged in the reading order, which is practical for real-world applications.