Deep learning models allow the creation of deepfake synthetic audios which are difficult to distinguish from natural ones. Moreover, recognizing which algorithm generated a given synthetic audio is even more challenging. This challenging task, scarcely explored in the literature, is the focus of this paper. We introduce a deep learning approach to identify which synthesis algorithm produced a generic speech audio. Specifically, the proposed system exploits two parallel branches for processing 2D audio features, i.e., Mel-Frequency and the GammaTone coefficients. The extracted features are concatenated and then refined by two convolutional layers. The performance of the model is evaluated on the 2022 IEEE Signal Processing Cup dataset. Different configurations of the proposed framework involving several audio features and deep learning architectures are discussed. The proposed approach achieves an accuracy of 98.1% on the validation set of the synthetic dataset.

Neri, M., Ferrarotti, A., De Luisa, L., Salimbeni, A., Carli, M. (2022). ParalMGC: Multiple Audio Representations for Synthetic Human Speech Attribution. In 10th European Workshop on Visual Information Processing (pp.1-6). 345 E 47TH ST, NEW YORK, NY 10017 USA : IEEE [10.1109/EUVIP53989.2022.9922861].

ParalMGC: Multiple Audio Representations for Synthetic Human Speech Attribution

Neri, M
;
Ferrarotti, A;De Luisa, L;Salimbeni, A;Carli, M
2022-01-01

Abstract

Deep learning models allow the creation of deepfake synthetic audios which are difficult to distinguish from natural ones. Moreover, recognizing which algorithm generated a given synthetic audio is even more challenging. This challenging task, scarcely explored in the literature, is the focus of this paper. We introduce a deep learning approach to identify which synthesis algorithm produced a generic speech audio. Specifically, the proposed system exploits two parallel branches for processing 2D audio features, i.e., Mel-Frequency and the GammaTone coefficients. The extracted features are concatenated and then refined by two convolutional layers. The performance of the model is evaluated on the 2022 IEEE Signal Processing Cup dataset. Different configurations of the proposed framework involving several audio features and deep learning architectures are discussed. The proposed approach achieves an accuracy of 98.1% on the validation set of the synthetic dataset.
978-1-6654-6623-3
Neri, M., Ferrarotti, A., De Luisa, L., Salimbeni, A., Carli, M. (2022). ParalMGC: Multiple Audio Representations for Synthetic Human Speech Attribution. In 10th European Workshop on Visual Information Processing (pp.1-6). 345 E 47TH ST, NEW YORK, NY 10017 USA : IEEE [10.1109/EUVIP53989.2022.9922861].
File in questo prodotto:
Non ci sono file associati a questo prodotto.

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11590/426208
Citazioni
  • ???jsp.display-item.citation.pmc??? ND
  • Scopus 0
  • ???jsp.display-item.citation.isi??? 0
social impact