In the post-epidemic era, as an important interface for human-computer interaction, the smart voice market is ushering in a golden age of development. It is estimated that by 2030, the total development space of intelligent voice consumption application scenarios will exceed 70 billion yuan. The development space of enterprise-level scenarios is expected to reach a scale of 100 billion.
Auomatic Speech recognition (ASR) is based on speech as the research object, through speech signal processing and pattern recognition to allow machines to automatically recognize and understand human spoken language. Speech recognition technology is a high technology that allows machines to convert voice signals into corresponding text or commands through the process of recognition and understanding. Speech recognition is a wide-ranging interdisciplinary subject, which is closely related to acoustics, phonetics, linguistics, information theory, pattern recognition theory, and neurobiology.
The current mainstream speech recognition technology is based on the basic theory of statistical pattern recognition. A complete speech recognition system can be roughly divided into three parts:
1. Speech feature extraction: The purpose is to extract the sequence of speech features that change over time from the speech waveform.
2. Acoustic model and pattern matching (recognition algorithm): The acoustic model is the underlying model of the recognition system, and is the most critical part of the speech recognition system. The acoustic model is usually generated by training the acquired speech features, with the purpose of establishing a pronunciation template for each pronunciation. During recognition, the unknown speech features are matched and compared with the acoustic model (mode), and the distance between the feature vector sequence of the unknown speech and each pronunciation template is calculated. The design of the acoustic model is closely related to the pronunciation characteristics of the language. The size of the acoustic model unit (word pronunciation model, half-syllable model or phoneme model) has a great influence on the size of speech training data, system recognition rate, and flexibility.
3. Semantic understanding: The computer performs grammatical and semantic analysis on the recognition results. Understand the meaning of language in order to respond accordingly. This is usually achieved through language models.
Speech Recognition Application Scenarios
● Intelligent Driving
In terms of intelligent travel, AI voice technology provides a variety of new interactive methods including vehicle control, social interaction and entertainment, so that the driver’s attention is no longer focused on various complicated settings and buttons, while improving the driving experience. It can enhance driving safety to a certain extent.
● Smart Home
The natural interaction of listening, speaking and watching directly with the machine makes smart home appliances more humanized. This AI voice intelligence technology also brings great convenience to the application and operation of life entertainment products.
● Education
The virtual teacher based on AI voice interaction combined with VR technology can get rid of the limitation of the number of teachers, teach one-on-one, and conduct accurate analysis to improve the learning effect of students. Speech assessment and human-computer dialogue technology combined with semantic technology are applied to teaching, which can quickly correct pronunciation, rhythm and grammatical errors, and are gradually applied to examination scenarios.
● Healthcare
The emergence of voice chat robots can solve the long-term inefficiency problem in the medical market. While reducing costs and reducing the time burden of medical staff, it can bring different experience improvements to patients.
● Finance
Through the use of voice recognition ASR, banks have realized basic services such as voice navigation, voice transactions, and business handling. The insurance industry uses voice recognition ASR to realize the record and record of the dialogue between the salesman and the customer.
● Logistics
Voice picking is how warehouse workers facilitate picking by talking to a voice system through a Bluetooth headset. The traditional voice picking is to communicate with people and instruct the picker to pick the goods, which is time-consuming and costly. Through speech recognition and synthesis technology, warehouse operators can directly communicate with warehouse management systems.
Speech Recognition Challenge
1. Recognition in harsh scenarios
The first challenging problem faced by the speech recognition system is the recognition problem in harsh scenarios. Specifically, in complex use scenarios such as long-distance and noisy, various noises, reverberations, and even the insertion of other people’s speech may easily cause aliasing and pollution of speech signals, which will greatly affect the accuracy of speech recognition. Therefore, it is necessary to put forward higher requirements for the robustness of the speech recognition model.
2. Mixed language recognition
The problem of language mixing is also an important problem in the current speech recognition technology field. Because in traditional speech recognition solutions, speech recognition systems of different languages are modeled independently, so how to effectively integrate and distinguish modeling units for different languages , and how to deal with the acquisition of voice data and text data in the Chinese-English mixed scene are all difficult problems in Chinese-English mixed recognition.
3. Terminology recognition
The recognition accuracy of professional vocabulary largely depends on the coverage of language model training corpus. Due to the wide range of industrial application fields, the training corpus inevitably has sparsity problems, and the occurrence probability of professional words is usually significantly lower than that of general domain words, so there is a greater risk of professional words being recognized as common words with similar pronunciation.
4. Minor language recognition
The speech recognition system relies on a large amount of supervised data. For common languages such as Chinese and English, the data resources are abundant, and the effect has reached a usable level. Due to the small population base and small scope of use of small languages, it is relatively difficult to collect data resources and the cost of labeling is relatively high.
About
Founded in 2011, Nexdata is a professional artificial intelligence data service provider and committed to providing high-quality training data and data services for global AI companies. Relying on own data resources, technical advantages and intensive data processing experiences, Nexdata provides data services to 1000+ companies and institutions worldwide.
If you need data services, please feel free to contact us: info@nexdata.ai