[논문리뷰 스터디] Conditional Variational Autoencoder with Adversarial Learning forEnd-to-End Text-to-Speech

심화 스터디/논문 리뷰 스터디

by 원준천 2023. 3. 30. 17:11

작성자: 16기 천원준

GitHub - jaywalnut310/vits: VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech - GitHub - jaywalnut310/vits: VITS: Conditional Variational Autoencoder with Adversarial Learning f...

github.com

VITS (Variational Inference with adversarial learning for end-to-end Text-to-Speech)라고 불리는 모델입니다. 현재 TTS에서 가장 좋은 성능을 보여주고 있습니다. HiFi-GAN과 GlowTTS를 기반으로 쓰인 논문입니다. 두 논문을 먼저 읽어보는 걸 추천드립니다.

Introduction

Text-to-Speech (TTS)는 글을 소리로 바꿔주는 것
기존에 존재하던 TTS 모델들은 두 단계로 나누어 진행
- Text에서 중간 음성 표현인 mel-spectogram 생성
- mel-spectogram을 사용하여 raw waveform 생성
이러한 step by step 방법은 현대에 발전된 병렬 프로세싱을 효율적으로 사용하고 있지 않음
따라서 논문에서 parallel end-to-end TTS 메서드를 제시
- VAE를 사용하여 TTS의 두 모듈을 연결
- 좋은 품질의 오디오를 합성하기 위해 normalizing flows를 적용하였고 waveform 도메인에 adversarial training을 적용
- 한 문장에 대하여 다양한 발화(One-to-many problem)를 가지기 위해 stochastic duration predictor를 이용

Method

Variational Inference

Overview

VITS는 marginal log-likelohood의 evidence lower bound (ELBO)를 최대화하는 conditional VAE이다
Training loss는 negative ELBO이고 이는 reconstruction loss log

Reconstruction Loss

디코더를 사용해 Latent variables z 를 waveform 도메인 y로 업샘플
y를 멜스펙토그램 도메인 xmel로 변환
예측 멜스펙토그램과 타겟 멜스펙트로그램 사이의 L1 loss를 reconstruction loss로 사용

KL Divergence

Prior encoder c의 인풋은 텍스트에서 추출한 phonemes ctext와 phoneme과 latent variable 사이의 alignment A이다.
Posterior encoder의 인풋으로 target speech xlin의 linear scale spectogram 사용 -> 더 좋은 resolution information을 제공

Factorized normal distribution은 prior/posterior encoder를 parameterize 하는 데 사용됨
더 "진짜"같은 샘플을 만들기 위해서 prior distribution의 expressiveness를 증가하는 게 중요하다는 것을 찾음
따라서 normalizing flow f_theta를 적용

Alignment Estimation

Monotonic Alignment Search

인풋 텍스트와 타겟 스피치 사이의 alignment A를 추측하기 위해 Monotonic Alignment Search (MAS)를 사용
하지만 MAS를 바로 적용하는 건 어려움
- Why? 우리의 목적은 ELBO 근데 MAS는 exact log likelihood를 최대화
- 따라서 ELBO를 최대화하는 MAS를 재정의

Duration Prediction from Text

사람 같은 발화를 생성하기 위해 stochastic duration predictor를 사용
Stochastic duration predictor
- Flow based gerative model이며 maximum likelihood estimation을 사용해 학습됨
하지만 maximum likelihood estimation의 직접 적용은 어려움
- Why? 인풋 phoneme의 duration이
  1. discrete integer라서 dequantized 돼야 함
  2. scalar라서 invertible 함
따라서 variational dequantization과 variational data augmentation을 사용함

Adversarial Training

음성 합성을 위해 두 가지 loss가 사용됨
- Least squares loss function for adversarial training
- Additional feature matching loss for generator training

Final Loss

다음과 같다

Model Architecture

Posterior Encoder

Non-causal WaveNet residual block 사용
- Gated activatoin unit과 skip connection으로 이루어진 dialated convolution layer들로 구성
Linear projection layer는 normal posterior distribution의 mean / variance 생성

Prior Encoder

인풋 phonemes c_text와 normalizing flow f_theta를 처리하는 text encoder로 구성
Text encoder는 transformer encoder이며 상대적인 positional encoding을 사용
- 이를 사용해 c_text에서 hidden representation h_text를 구할 수 있다
Normalizing flow는 affine coupling layers를 쌓아서 만듦
- 간단히 하기 위해서 Jacobian determinant로 volume preserving transformation을 진행

Decoder

HiFi-GAN V1 generator 사용

Discriminator

HiFi-GAN의 multi period discriminator 사용

Sticgastuc Duration Predictor

Conditional input h_text로부터 phoneme duration의 분포를 예측
Residual block과 dilated & depth separable convolutional layers를 쌓음
Neural spline flows를 적용하여 transformation expressiveness를 향상

Results

https://jaywalnut310.github.io/vits-demo/index.html

Audio Samples from "Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech"

w/o Normalizing Flow (300k training)

jaywalnut310.github.io

와! 생성 속도도 기존 모델보다 빠르다!

논문에서도 평가를 subjective human evaluation인 MOS (mean opinion score)를 사용했기 때문에 직접 들어보시길 추천드립니다.

~~직접 들어보고 판단해 네가 줏대있게 판단하면 좋겠어~~

Conclusion

이 논문에서는 End-to-End 병렬 TTS 모델을 제시합니다. 기존 two stage 보다 더 좋은 성능을 보입니다. Text preprocessing의 문제가 존재하지만 language representation에 self supervised learning을 적용하는 게 해결책일 수도 있겠습니다.

'심화 스터디 > 논문 리뷰 스터디' 카테고리의 다른 글

[논문 리뷰 스터디] Attention Is All You Need (0)	2023.03.30
[논문 리뷰 스터디] VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection (0)	2023.03.30
[논문 리뷰 스터디] Bayesian Inference: An Introduction to Principles and Practice in Machine Learning (0)	2023.03.30
[논문 리뷰 스터디] Densely Connected Convolutional Networks (0)	2023.03.30
[논문 리뷰 스터디] Visualizing and Understanding Convolutional Networks (0)	2023.03.30

KUBIG 2023-1 활동 블로그

고정 헤더 영역

메뉴 레이어

메뉴 리스트

검색 레이어

검색 영역

상세 컨텐츠

본문 제목

본문

Introduction

Method

Results

Conclusion

'심화 스터디 > 논문 리뷰 스터디' 카테고리의 다른 글

관련글 더보기

댓글 영역

추가 정보

인기글

최신글

티스토리툴바