Motivated by an increasing need for privacy-preserving voice communications, we investigate here the original idea of sending encrypted data and speech in the form of pseudo-speech signals in the audio domain. Being less constrained than military ``Crypto Phones'' and allowing genuine public evaluation, this approach is quite promising for public unsecured voice communication infrastructures, such as 3G cellular network and VoIP.A cornerstone of secure voice communications is the authenticated exchange of cryptographic keys with sole resource the voice channel, and neither Public Key Infrastructure (PKI) nor Certificate Authority (CA). In this paper, we detail our new robust double authentication mechanism based on signatures and Short Authentication Strings (SAS) ensuring strong authentication between the users while mitigating errors caused by unreliable voice channels and also identity protection against passive eavesdroppers. As symbolic model, our protocol has been formally proof-checked for security and fully validated by Tamarin Prover.