Abstract:Synthesizer is a type of electronic musical instrument that is now widely used in modern music production and sound design. Each parameters configuration of a synthesizer produces a unique timbre and can be viewed as a unique instrument. The problem of estimating a set of parameters configuration that best restore a sound timbre is an important yet complicated problem, i.e.: the synthesizer parameters estimation problem. We proposed a multi-modal deep-learning-based pipeline Sound2Synth, together with a network structure Prime-Dilated Convolution (PDC) specially designed to solve this problem. Our method achieved not only SOTA but also the first real-world applicable results on Dexed synthesizer, a popular FM synthesizer.