Intelligent Speech Technology
Fall 2025
Course Information
Semester: Fall 2025
Credits: 2
Class Hours: 36 (18×2)
Instructor: Shuai Wang
Location: Nanyong Building, Room West 209
Time: Monday 2:00 - 3:50pm
Course Description
As the most natural form of human communication, intelligent speech technology has given machines the ability to “understand” and “speak”. From the emergence of Siri voice assistants to the widespread adoption of smart home and in-car voice systems, and breakthrough developments in multimodal large models like GPT-4o, this technology is profoundly transforming our way of life.
This course will systematically explore the core principles and cutting-edge applications of intelligent speech technology. We will begin with human speech production mechanisms and auditory systems, then delve into traditional technologies including speech recognition, voiceprint modeling, speech synthesis, voice conversion and speech separation, while also examining new developments in speech technology in the era of large language models. The course follows a teaching approach that combines theory with practice, helping students master both theoretical foundations and practical application skills through hands-on projects.
Learning Objectives
By the end of this course, students will be able to:
- Understand the fundamental principles of speech signal processing
- Master key technologies in speech recognition, synthesis, and enhancement
- Apply machine learning techniques to speech processing tasks
- Implement practical speech processing applications
- Critically evaluate research papers in the field
- Understand the latest developments in speech technology
Prerequisites
- Basic knowledge of linear algebra and calculus
- Programming experience in Python
- Familiarity with machine learning concepts
- Basic knowledge of Deep Learning and tools (PyTorch, TensorFlow, etc.)
Course Schedule
| Week | Date | Topic | Materials | Comments |
|---|---|---|---|---|
| 1 | 2025.8.25 | Course Introduction & Overview of Speech Technology | Slides Demos | - |
| 2 | 2025.9.1 | Overview of Speech Technology (continue) | Slides | - |
| 3 | 2025.9.8 | Fundamentals of Speech Signal Processing | Slides | - |
| 4 | 2025.9.15 | Introduction of Automatic Speech Recognition | Slides WER Demo | - |
| 5 | 2025.9.22 | Traditional ASR Models (GMM/DNN - HMM) | Slides | Assignment 1 out |
| 6 | 2025.9.29 | End-to-End ASR Models | Slides | - |
| 7 | 2025.10.6 | - | - | National Holidy |
| 8 | 2025.10.13 | Speaker Modeling (Part 1) | Slides | - |
| 9 | 2025.10.20 | Speaker Modeling (Part 2) | Slides | - |
| 10 | 2025.10.27 | Speech Synthesis (Part 1) | Slides | Invited talk by Jingbei Li, StepAudio |
| 11 | 2025.11.3 | Speech Synthesis (Part 2) | Slides | - |
| 12 | 2025.11.10 | Speech Synthesis (Part 3) | Slides | - |
| 13 | 2025.11.17 | Voice Conversion | Slides | Assignment 2 out |
| 14 | 2025.11.24 | Speech Separation | Slides | - |
| 15 | 2025.12.1 | Self-Supervised Learning for Speech | Slides | - |
| 16 | 2025.12.8 | Speech Processing with Large Language Models | Slides | - |
| 17 | 2025.12.15 | Applications of Speech Processing in Industry | Slides | Invited Talks |
| 18 | 2025.12.22 | Final Project Presentation | Last Class | |
| 20 | 2025.1.5 | - | Final Project due |
Course Materials
Recommended Textbooks
- [1] Jurafsky D, Martin J H. Speech and language processing: An introduction to natural language processing, computational linguistics, and speech recognition with language models. 2025 Edition.
- [2] https://speechprocessingbook.aalto.fi/index.html
- [3] Xuedong Huang, Alex Aceoro, Hsiao-Wuen Hon, Spoken Language Processing: A guide to theory, algorithm, and system development, Prentice Hall, 2011
- [4] 韩纪庆、张磊、郑铁然,《语音信号处理》,清华大学出版社
- [5] 洪青阳,李琳著,《语音识别:原理与应用》,电子工业出版社
Recommended Readings
Recent papers from top conferences (ICASSP, Interspeech, ACL, etc.)
Software & Tools
- Python 3.8+ - Primary programming language
- PyTorch/TensorFlow - Deep learning frameworks
- Librosa - Audio processing library
- WeNet/WeSpeaker/WeSep - Open-source speech processing toolkits
- ESPNet/SpeechBrain/Kaldi - Additional speech processing frameworks
Grading Policy
- Attendance: 15% (From 4th week to 18th week)
- Homework Assignments: 20% (2 assignments)
- Final Project: 65% (1 project)
Assignments
Homework 1: TBD
Due: Week 8
Description: TBD
Homework 2: TBD
Due: Week 13
Description: TBD
Final Project: TBD
Due: Week 20
Description: TBD