Can We Hear from Events?
Generating Speech from Event Camera

Abstract

Traditional RGB-based speech generation faces Temporal Granularity Mismatch since fixed camera exposure times inevitably blur the high-frequency articulatory transients essential for rendering emotional speech. To break this ceiling, we propose EventSpeech as a novel text-conditioned framework pioneering the use of neuromorphic events for expressive speech generation, since these microsecond-precise events naturally align with acoustic waveform dynamics. Our architecture integrates a dedicated Event Encoder to model sparse neuromorphic events alongside a multi-scale Audio Encoder featuring a Hierarchical Wavelet Contextualizer (HWC). A bidirectional alignment mechanism seamlessly synchronizes linguistic content and visual dynamics with dense acoustic features. Furthermore, we construct EVT-SPK as the first benchmark comprising large-scale synthetic data and real-world recordings from specialized neuromorphic hardware. Extensive evaluations demonstrate that EventSpeech significantly outperforms current baselines by preserving fine-grained emotions and resisting motion blur to establish a new paradigm for multimodal speech generation.

Video Demo

Method

Method Framework

Comparisons of Generated Speech Between EventSpeech and Baselines

Sample 1: No, the man was not drunk. You wanted how he could take up with this stranger.

GT

GT Spectrogram 1

EventSpeech-T (Ours)

EventSpeech-T Spectrogram 1

EventSpeech (Ours)

Ours Spectrogram 1

VALL-E 2

VALL-E2 Spectrogram 1

MATCHA-TTS

MATCHA-TTS Spectrogram 1

MMAudio+AS

MMAudio+AS Spectrogram 1

Diff-Foley+AS

Diff-Foley+AS Spectrogram 1

VTS

VTS Spectrogram 1

VoiceCraft-Dub

VoiceCraft-Dub Spectrogram 1

HPMDubbing

HPMDubbing Spectrogram 1

StyleDubber

StyleDubber Spectrogram 1

VTS+VE

VTS+VE Spectrogram 1
Sample 2: Dogs are sitting by the door.

GT

GT Spectrogram 2

EventSpeech-T (Ours)

EventSpeech-T Spectrogram 2

EventSpeech (Ours)

Ours Spectrogram 2

VALL-E 2

VALL-E2 Spectrogram 2

MATCHA-TTS

MATCHA-TTS Spectrogram 2

MMAudio+AS

MMAudio+AS Spectrogram 2

Diff-Foley+AS

Diff-Foley+AS Spectrogram 2

VTS

VTS Spectrogram 2

VoiceCraft-Dub

VoiceCraft-Dub Spectrogram 2

HPMDubbing

HPMDubbing Spectrogram 2

StyleDubber

StyleDubber Spectrogram 2

VTS+VE

VTS+VE Spectrogram 2
Sample 3: His superiors had also christis, saying it was the way for ittererhoner.

GT

GT Spectrogram 3

EventSpeech-T (Ours)

EventSpeech-T Spectrogram 3

EventSpeech (Ours)

Ours Spectrogram 3

VALL-E 2

VALL-E2 Spectrogram 3

MATCHA-TTS

MATCHA-TTS Spectrogram 3

MMAudio+AS

MMAudio+AS Spectrogram 3

Diff-Foley+AS

Diff-Foley+AS Spectrogram 3

VTS

VTS Spectrogram 3

VoiceCraft-Dub

VoiceCraft-Dub Spectrogram 3

HPMDubbing

HPMDubbing Spectrogram 3

StyleDubber

StyleDubber Spectrogram 3

VTS+VE

VTS+VE Spectrogram 3
Sample 4: Dogs are sitting by the door.

GT

GT Spectrogram 4

EventSpeech-T (Ours)

EventSpeech-T Spectrogram 4

EventSpeech (Ours)

Ours Spectrogram 4

VALL-E 2

VALL-E2 Spectrogram 4

MATCHA-TTS

MATCHA-TTS Spectrogram 4

MMAudio+AS

MMAudio+AS Spectrogram 4

Diff-Foley+AS

Diff-Foley+AS Spectrogram 4

VTS

VTS Spectrogram 4

VoiceCraft-Dub

VoiceCraft-Dub Spectrogram 4

HPMDubbing

HPMDubbing Spectrogram 4

StyleDubber

StyleDubber Spectrogram 4

VTS+VE

VTS+VE Spectrogram 4