In the field of underwater vision research, image matching between the sonar sensors and optical cameras has always been a challenging problem. Due to the difference in the imaging mechanism between them, which are the gray value, texture, contrast, etc. of the acoustic images and the optical images are also variant in local locations, which makes the traditional matching method based on the optical image invalid. Coupled with the difficulties and high costs of underwater data acquisition, it further affects the research process of acousto-optic data fusion technology. In order to maximize the use of underwater sensor data and promote the development of multi-sensor information fusion (MSIF), this study applies the image attribute transfer method based on deep learning approach to solve the problem of acousto-optic image matching, the core of which is to eliminate the imaging differences between them as much as possible. At the same time, the advanced local feature descriptor is introduced to solve the challenging acousto-optic matching problem. Experimental results show that our proposed method could preprocess acousto-optic images effectively and obtain accurate matching results. Additionally, the method is based on the combination of image depth semantic layer, and it could indirectly display the local feature matching relationship between original image pair, which provides a new solution to the underwater multi-sensor image matching problem.