Abstract
Modeling nonlinear dynamical systems is a challenging task in fields such as speech processing, music generation, and video prediction. This paper introduces a hierarchical framework for Deep State Space Models (DSSMs), categorizing them by their conditional independence properties and Markov assumptions and positioning existing models within this framework, including the Stochastic Recurrent Neural Network (SRNN), Variational Recurrent Neural Network (VRNN), and Recurrent State Space Model (RSSM). We discuss different options for the inference networks and demonstrate how integrating normalizing flows can enhance model flexibility by capturing complex distributions. Our work not only clarifies the relationships among existing models but also paves the way for the development of new, more effective approaches for modeling nonlinear dynamics. In particular, we propose the Autoregressive State Space Model (ArSSM) and evaluate its effectiveness in speech and polyphonic music modeling tasks.
Original language | English |
---|---|
Title of host publication | Proceedings of the 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
Publisher | IEEE |
ISBN (Electronic) | 9798350368741 |
DOIs | |
Publication status | E-pub ahead of print - 7 Mar 2025 |