We propose the first tensorized optical multimodal fusion network architecture with a self-attention mechanism and low-rank tensor fusion. Simulation results show $51.3 \times$ less hardware requirement and $3.7\times 10^{13}$ MAC/J energy efficiency.